NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 02.21.21
🎉 1T or bust my dudes 🎉
There’s a group of ML hackers attempting to recreate GPT-3 on their own.
Earlier this year, EleutherAI sent data nerds buzzing when they released their pile dataset (825 GB English text corpus targeted at training large-scale language models) paper. This breakthrough takes care of the data problem, now all they need is the compute: 👇
They are building it using Tensorflow’s Mesh library. We wish them the best of luck. Or as it states on their repo: 1T or bust my dudes.
Their discord server:
Oh, and Hello Mars! 👽
If you enjoy the read, help us out by giving it a 👏👏 and share with friends 🙈.
PyTorch | Ray and Distributed Training
If you want to stay on top of the latest distributed training with PyTorch and Ray, this is a healthy intro:
Transformers Interpret
“Transformers interpret allows any transformers model to be explained in just two lines. It even supports visualizations in both notebooks and as savable html files.”
So for example if you were doing sentiment analysis on the sentence below:
“I love you, I like you”
This output 👇 would tell you what words have the biggest impact on inference.
[(‘BOS_TOKEN’, 0.0),
(‘I’, 0.46820529249283205),
(‘love’, 0.46061853275727177),
(‘you’, 0.566412765400519),
(‘,’, -0.017154456486408547),
(‘I’, -0.053763869433472),
(‘like’, 0.10987746237531228),
(‘you’, 0.48221682341218103),
(‘EOS_TOKEN’, 0.0)]
Then you visualize it with 1 line of code:
cls_explainer.visualize("distilbert_viz.html")
ConvLab-2
“ConvLab-2 is an open-source toolkit that enables researchers to build task-oriented dialog systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems.”
Question Generation Tutorial on Udemy
The creator of QuestGen library, Ramsri Golla, has a new course on Udemy!
And I got a discount coupon you can use for his program. Here’s a description of what you’ll learn in case you are interested:
- Generate assessments like MCQs, True/False questions etc from any content using state-of-the-art natural language processing techniques.
- Apply recent advancements like BERT, OpenAI GPT-2, and T5 transformers to solve real-world problems in edtech.
- Use NLP libraries like Spacy, NLTK, AllenNLP, HuggingFace transformers, etc.
- Use Google Colab environment to run all these algorithms.
- 4 hours on-demand video 🤖
25% Off Coupon:
MIT CS Courses
Electrical Engineering and Computer Science courses at MIT.
Wiki’s API
Article describing the genesis of Wikipedia’s API, the problem of originally not having a holistic API strategy at the Wikimedia Foundation (WMF) and their solution to this problem. The API was completed in December of 2020.
Source Code:
Docker Swarm Implementation
Includes code…Hope you like YML files. 😁
Papers Without Code 😬
Where unreproducible papers come to live…
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
65 Million Probably Asked Questions and New Retriever Model
A new QA-pair retriever model, RePAQ, to complement Probably Asked Questions (PAQ), a resource of 65M automatically-generated QA-pairs.
Fact Check Summarization
Abstractive Summarization using two methods:
1. JAENS: joint entity and summary generation
2. Summary-worthy entity classification with summarization (multi-task learning)
This approach is interested in handling the factual consistency of entities in abstractive summarization (AS), which is an ongoing research problem.
*runs on fairseq*
Emoji Transfer
Training transformers for sentiment analysis with emoji data.
Relation Extraction Over Universal Graph
Distantly Supervised Relation Extraction (DS-RE) over knowledge graph and textual data.
Apache Log Generator
Automating the parsing task of Apache logs by formulating it as a machine translation (MT) task.
NoiseQA
New question answering evaluation benchmark. Takes in consideration on how the deployment of a QA model can impact performance. For example, QA interfaces such as speech, text or translation can induce unique inference error that most evaluation benchmarks don’t consider.
Optimizing Inference on CPU for Transformers
Empirical analysis of scalability and performance of inferencing a Transformer-based model on CPUs.
Exploring Transformers for NLG
A pithy introduction to transformers of GPT, BERT, and XLNET for NLG.
Dataset of the Week: ArtEmis
A dataset that associates human emotions with artworks and contains explanations in natural language of the rationale behind each triggered emotion.
Sample
Where is it?
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat