Extracting ETF keywords with TF-IDF

A few days ago, I saw this article on CNN analyzing key phrases by candidate from the second democratic debate. The author pointed out that they used the tidytext package, specifically the built in tf-idf functionality, to do that analysis. Coincidentally, I was looking at the descriptions of various ETF’s extracted from RobinHood and Tiingo via the riingo and RobinHood packages.

library(tidyverse)
library(tidytext)
library(RobinHood)
library(riingo)
library(kableExtra)

## Sign into Robinhood
RH = RobinHood(username = keyring::key_list('robinhood')[1,2], 
               password = keyring::key_get('robinhood'))

## Get ETF List
etf <- get_tag(RH=RH, tag='etf')

## Get Metadata on Each
meta <- riingo_meta(etf)

glimpse(meta)

## Observations: 499
## Variables: 6
## $ ticker       <chr> "VOO", "SPY", "MJ", "VTI", "BOTZ", "QQQ", "VYM", "S…
## $ startDate    <dttm> 2010-09-09, 1993-01-29, 2018-02-07, 2001-05-31, 20…
## $ exchangeCode <chr> "NYSE ARCA", "NYSE ARCA", "NYSE ARCA", "NYSE ARCA",…
## $ endDate      <dttm> 2019-08-09, 2019-08-09, 2019-08-09, 2019-08-09, 20…
## $ name         <chr> "VANGUARD 500 INDEX FUND ETF SHARES", "SPDR SP 500 …
## $ description  <chr> "The Fund employs an indexing investment approach d…

Looking at the data, there’s a nice description field to play with, however, it’s filled with legal talk and can obfuscate what the ETF does.

For example, probably of interest to this audience, is the AIEQ, or AI POWERED EQUITY ETF. It’s description reads as such:

The Fund is actively managed and invests primarily in equity securities listed on a U.S. exchange based on the results of a proprietary, quantitative model (the “EquBot Model”) developed by EquBot LLC (“EquBot”) that runs on the IBM Watson™ platform. EquBot, the Fund’s sub-adviser, is a technology based company focused on applying artificial intelligence (“AI”) based solutions to investment analyses. As an IBM Global Entrepreneur company, EquBot leverages IBM’s Watson AI to conduct an objective, fundamental analysis of U.S.-listed common stocks and real estate investment trusts (“REITs”) based on up to ten years of historical data and apply that analysis to recent economic and news data. Each day, the EquBot Model ranks each company based on the probability of the company benefiting from current economic conditions, trends, and world events and identifies approximately 30 to 125 companies with the greatest potential over the next twelve months for appreciation and their corresponding weights, while maintaining volatility (i.e., the range in which the portfolio’s returns vary) comparable to the broader U.S. equity market. The Fund may invest in the securities of companies of any market capitalization. The EquBot model recommends a weight for each company based on its potential for appreciation and correlation to the other companies in the Fund’s portfolio. The EquBot model limits the weight of any individual company to 10%. At times, a significant portion of the Fund’s assets may consist of cash and cash equivalents. IBM’s Watson AI is a computing platform capable of answering natural language questions by connecting large amounts of data, both structured (e.g., spreadsheets) and unstructured (e.g., news articles), and learning from each analysis it conducts (e.g., by recognizing patterns) to produce a more accurate answer with each subsequent question. The Fund’s investment adviser utilizes the recommendations of the EquBot Model to decide which securities to purchase and sell, while complying with the Investment Company Act of 1940 (the “1940 Act”) and its rules and regulations. The Fund’s investment adviser anticipates primarily making purchase and sale decisions based on information from the EquBot Model. The Fund may frequently and actively purchase and sell securities.

There’s a lot there, a lot of it legalese. Being a data scientist, I think the idea of an AI powered ETF is pretty cool, but (frankly), I’m not going to slog through 499 stuffy descriptions to find cool ETFs when there’s an easier way. That’s where tf-idf comes in.

What is tf-idf? It stands for Term Frequency - Inverse Document Frequency and it’s a technique to find important words in documents. Essentially it looks at all the words (or phrases) in a document, filters out the ones that are common across all documents, and finds words unique to that document.

In the CNN article, they found phrases that were unique to candidates so they could easily identify how each candidate differed from the others. For example, the key phrases extracted for John Delaney were ‘impossible promises’, ‘real solutions’, and ‘private sector’. Anyone who watched the debate knew that he positioned himself in contrast to candidates pitching large progressive agendas. Marianne Williamson’s key phrases were ‘deep truth’, ‘false god’, ‘collectivized hatred’, and ‘heal’, which is 0% surprising to anyone who had heard her speak before.

To me, these summarizations seemed effective, so why not try it with these ETF’s?

To do this, I’ll use the bind_tf_idf function in the tidytext package. In addition to using word counts, I’ll also use bigrams to try to capture phrasing. In order to prevent redudancies (like identifying the bigram ‘artificial intelligence’ alongside the words ‘artificial’ and ‘intelligence’). From the tf-idf scores, I’ll identify the top 3 terms by ETF.

## Extract Bigrams
ticker_bigram <- meta %>%
  select(ticker, description) %>%
  mutate(description=tolower(description)) %>%
  unnest_tokens(term, description, token = "ngrams", n = 2) %>%
  separate(term, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  unite(term, word1, word2, sep = " ") %>%
  count(ticker, term, sort = TRUE) 

## Pull Bigram Words
bg_words <- ticker_bigram %>% separate(term, c("word1", "word2"), sep = " ")

## Extract Unigrams
ticker_unigram <- meta %>%
  select(ticker, description) %>%
  mutate(description=tolower(description)) %>%
  unnest_tokens(term, description) %>%
  anti_join(stop_words, by=c('term'='word')) %>%
  filter(!term %in% unique(c(bg_words$word1, bg_words$word2))) %>%
  count(ticker, term, sort = TRUE)  

## Bind together, calculate tf-idf, identify top 3 words
ticker_tfidf <- bind_rows(ticker_unigram, ticker_bigram) %>%
  bind_tf_idf(term, ticker, n) %>%
  arrange(ticker, desc(tf_idf)) %>%
  group_by(ticker) %>%
  top_n(3) %>%
  select(ticker, term)  %>%
  summarize(terms = paste(sort(unique(term)),collapse=", "))


## Merge back to original df
meta <- meta %>%
  left_join(ticker_tfidf, by='ticker')

Now let’s look at the AIEQ ticker to see what terms have been extracted.

meta %>%
  filter(ticker=='AIEQ') %>%
  .$terms

## [1] "company based, equbot model, ibm’s watson, watson ai"

From these terms, we can see at a glance that this ETF is based on a model with IBM Watson. You have to dig deeper to see the whole ETF’s strategy, but the extracted terms allow you to get an general idea of what it’s about.

Let’s look at some other terms.

meta %>%
  filter(ticker %in% c('BIL', 'VPL', 'YOLO', 'LVL', 'AMJL', 'ARKW')) %>%
  select(ticker, name, terms) %>%
  knitr::kable() %>%
  kable_styling()

ticker	name	terms
YOLO	AdvisorShares Pure Cannabis	advisorshares pure, pure cannabis
ARKW	ARK WEB X0 ETF	fintech innovation, shifting, technology infrastructure
VPL	VANGUARD PACIFIC STOCK INDEX FUND ETF SHARES	asia pacific, developed asia, ftse developed
AMJL	Credit Suisse AG Nassau Branch XLinks Monthly Pay 2X Leveraged Alerian MLP Index ETN 05162036	NA NA
LVL	INVESCO SP GLOBAL DIVIDEND OPPORTUNITIES INDEX ETF	100 common, 346.1 billion, 854.5 million, countries included, global broad, japan korea, korea singapore
BIL	SPDRR BLOOMBERG BARCLAYS 13 MONTH TBILL ETF	1 month, 3 months, u.s treasury

There’s a range of usefulness here. ARKW’s terms show that it’s centered around fintech, LVL’s terms show it to be focused on the asia-pacific region. YOLO, VPL, and BIL have informative terms, but don’t give you much more than what is in the title. AMJL is a total miss, not returning any useful terms.

Overall, tf-idf was mostly successful in extracting useful terms from ETF descriptions, even with some being redundant. There’s a variety of other text summarization techniques, but this analysis shows that tf-idf can quickly extract key terms from text with minimal preprocessing.

Disclaimers

Not sure how this works, so just covering my butt
Nothing in this post should be considered investment advice, research, or an invitation to buy or sell a security

Also, if you invest based on anything I say, you are not a smart person.

Extracting ETF keywords with TF-IDF

Disclaimers

Aaron Miles