Photo by Sz. Marton on Unsplash

NLP News Cypher | 04.12.20

Down the Rabbit Hole

Ricky Costa
Towards AI
Published in
6 min readApr 13, 2020

--

I called it RABBIT. My demo is finito. We built an app for those who are interested in streaming APIs, online inference, and transformers in production.

**update: 04.13.20: rabbit.quantumstat.com

The web app, (of which I’ve shown a glimpse in the past) attempts to do a very difficult balancing act. One of the hardest bottlenecks in deep learning today is leveraging state of the art models in NLP (transformers, RAM expensive) and being able to deploy them in production without making your server or bank account explode. I think I may have figured it out, at least for this app 😎.

What is it?

RABBIT streams tweets from dozens of financial news sources (the usual suspects: Bloomberg, CNBC, WSJ and more) and runs 2 classifiers over them in real-time!

What am I classifying? 1st model classifies 21 topics in finance:

declassified

The 2nd model classifies whether the tweet is either bullish, bearish or neutral in stance. What does this mean? It means that if you are an investor/trader holding gold, and the tweet mentions that the price of gold is up, this would be labeled bullish, the inverse bearish, and if you don’t care either way, it’s neutral. In fact, this app is supposed to be personalized to an individual user. Because what you will see is a demo for a general audience, I tried to generalize as much as possible with this classifying schema.

As a result, this assumes first-order logic in classification. Meaning, my logic is not assuming n-order effects. For example, if you hold oil, and oil goes up in price, this is considered bullish, (even though it is possible that the reason oil went up is because of some geo-political conflict which could have negative impact on the market (bearish), this is a hypothetical n-order effect).

What does it run on?

I’ve architected the back-end with the option of expanding both compute and connections if required. The transformers are the distilled version of RoBERTa that were fed over 10K tweets from a custom dataset. Currently, I’m leveraging message queues and an asynchronous framework to help me push tweets out to the user. Shout-out to Adam King for sparking the idea during one of our digital fireside chats. (FYI, you can check out his infamous GPT-2 model here: talktotransformer.com)

RABBIT uses a web-socket connection for the streaming capabilities and is run only on 4 CPU cores. While this compute may seem small, when married with this architecture, it’s actually lightning fast (even while doing online inference with 2 transformers!). Since the web-sockets are connected to the browser and data serving is uni-directional, scaling to the client-side is fairly robust.

Errata

Very recently there’s been some domain shift due to the coronavirus altering the news cycle (which has decreased the accuracy of the models). I will continually add more data to mitigate this, even though for now, it performs reasonably well.

Fin

Will officially release it tomorrow, April 13th. Check my Twitter for the update. FYI, the app is best experienced during weekly trading hours when the stock market is open so you can see it stream really fast (even though technically you can check it out anytime you want).

Proud of this work. It’s cheap, it’s powerful and it’s fast.

Possible future approaches will be to create a language-model from scratch, and then fine-tune it on the custom dataset I mentioned above. Additionally, would be nice to add more data in a dashboard with a live stock market stream.

How was your week? 😎

This Week:

Bare Metal

Colab of the Week, on Self-Attention

Hugging Electra

Colbert AI

A Very Large News Dataset

A Token of Appreciation

Dataset of the Week: X-Stance

Bare Metal

AI chip makers are betting that NLP models keep getting bigger and bigger although their chips are becoming smarter. The metal peeps say they want to isolate NN inputs to individual cores as opposed to batching them. The consequence is only neurons in your network that “need” to fire will do so since they are isolated:

“Companies are fixated on the concept of “sparsity,” the notion that many neural networks can be processed more efficiently if redundant information is stripped away. Lie observed that there is “a large, untapped potential for sparsity” and that “neural networks are naturally sparse.”

With this knowledge, the new AI chips don’t need to train as long and can drop out of training earlier on. 🧐

Colab of the Week, on Self-Attention

I’ll let you explore this one:

Hugging Electra

The new method for training language models with relatively low compute is now on the Hugging Face library. You may remember ELECTRA’s provoking performance 👇

It didn’t take long for developers to leverage ELECTRA, the Simple Transformers library, which is built on top of 🤗's Transformers, already has it:

Colbert AI

GPT-2 strikes back with a bit of humor. Developers Abbas Mohammed and Shubham Rao created this model by extracting monologues from the Late Show’s video captions on YouTube. They provided a nice Colab notebook with excellent documentation for anyone who wants to do something similar. (you may need to get your own dataset 😢)

Colab:

A Very Large News Dataset

Found this gem on Reddit. With the average developer getting closer and closer to training their own language models from scratch, super big datasets will grow more popular among NLP developers in the long term. This dataset holds 2.7 million news articles from the past 4 years:

A Token of Appreciation

A relatively new tokenizer came to my attention this week boasting it’s speed versus other well known tokenizers (it was written in C++). If you want to compare how fast it does versus the others (Hugging Face, SentencePiece and fastBPE), check out their benchmark results:

Main Repo:

Benchmarks:

Dataset of the Week: X-Stance

What is it?

“The x-stance dataset contains more than 150 political questions, and 67k comments written by candidates on those questions.” The comments are in English, German, French and Italian.

Sample:

Where is it?

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

If you enjoyed this article, help us out and share with friends!

For complete coverage, follow our Twitter: @Quantum_Stat

www.quantumstat.com

--

--

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟