Baptism of Christ (aka there’s a giant UFO in the sky) | Gelder (1710)

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 06.06.21

Trinity

Ricky Costa
Towards AI
Published in
9 min readJun 7, 2021

--

Welcome back to the simulation ✌ . So ACL 2021 data dump happened and now we have a huge list of repos to get through in the Repo Cypher this week. 😁

Also, we are updating the NLP index very soon with 100+ new repos (many of which are mentioned here) alongside 30+ new NLP notebooks like this one 👇 . If you would like to get an email alert for future newsletters and asset updates, you can sign-up here.

thank you Niels Rogge

So let us start with incoming awesomeness. Heard of the Graph4NLP library??? If you want to leverage graph neural networks (GNNs) for your NLP tasks you may want to check them out (they’re in BETA). It runs on top of DeepGraphLibrary. Also, they have an awesome collection of papers on everything deep learning + graphs + NLP 👇.

https://github.com/graph4ai/graph4nlp_literature

declassified

But wait, we need a simulation gut check

Oh… just an FYI. Aliens can’t be ruled out as the reason for all the aerial phenomena (~120 events) in the US according to sources in the Pentagon. The New York Times got an inside scoop on the upcoming UFO report being made ready by the DNI (Director of National Intelligence). Bottom-line, no-one has a clue of what is going on and it may be aliens. You can read the full article here:

U.S. Finds No Evidence of Alien Technology in Flying Objects, but Can’t Rule It Out, Either

If you want to do your own investigative work, the FAA has a database of unmanned aircraft sightings. Some involve military encounters. 🤷‍♂️

Data:

So…

To conclude the intro: I’d like to welcome everyone to the halfway mark of the glorious year that is… 2021💖💖. A year where Elon runs crypto, aliens are real, and everyone wears masks. 👍

and now… a word from our sponsors:

😂😂😂

Important to Note: Elon Musk’s crypto shitposting is so lit that it got the attention of the hacker group Anonymous. I’m guessing they weren't too keen on Elon’s tweets swaying the crypto markets. As a result, they decided to G check him on sight. 🥶🥶

Tokenizers Go Bye Bye: ByT5 Model

No need for intro: Google already gives one in their repo:

“ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword vocabulary like most other pretrained language models (BERT, XLM-R, T5, GPT-3), our ByT5 model operates directly on UTF-8 bytes, removing the need for any text preprocessing. Beyond the reduction in system complexity, we find that parameter-matched ByT5 models are competitive with mT5 across a range of tasks, and outperform mT5 on tasks that involve noisy text or are sensitive to spelling and pronunciation. This repo can be used to reproduce the experiments in the ByT5 paper.”

… already merged in Transformers:

Binary Passage Retriever (BPR)

From the peeps that rolled out the LUKE model dropped ACL bombs on Twitter. They reduced the mem size of the Dense Passage Retriever (DPR) model w/o losing QA accuracy. ✌

“Compared with DPR, BPR substantially reduces the memory cost from 65GB to 2GB without a loss of accuracy on two standard open-domain question answering benchmarks: Natural Questions and TriviaQA.” 🔥🔥

The Good, the Bad and the Mysterious: GPT-3

Across a multi-scale paradigm, a change in quantity on one scale, is a change in quality in another.

Stanford decided to investigate GPT-3’s emergent phenomenon of few-shot learning. Warning Power laws incoming….

DoT: Double Transformer Model

Google has a new architecture for using transformers to parse table data.

“DoT, a double transformer model, that decomposes the problem into two sub-tasks: A shallow pruning transformer that selects the top-K tokens, followed by a deep task-specific transformer that takes as input those K tokens.”

paper

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

SapBERT: Self-alignment pretraining for BERT

Introduces a novel cross-lingual biomedical entity task (XL-BEL), establishing a widecoverage and reliable evaluation benchmark for cross-lingual entity representations in the biomedical domain in 10 languages.

Connected Papers 📈

CIDER: Commonsense Inference for Dialogue Explanation and Reasoning

CIDER — a manually curated dataset that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference.

Connected Papers 📈

DynaEval

DynaEval serves as a unified framework for both turn and dialogue level evaluation in open-domain dialogue.

Connected Papers 📈

MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding

MPC-BERT, a pre-trained model for Multi-Party Conversation (MPC) understanding that considers learning who says what to whom in a unified model with several elaborated self-supervised tasks.

Connected Papers 📈

COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences

A new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs.

CitationIE

This work augments text representations by leveraging a complementary source of document context: the citation graph of referential links between citing and cited papers. This is used for information extraction in science papers.

Connected Papers 📈

Dataset of the Week: CoDesc

What is it?

Dataset consisting of 4.2M Java source code and parallel data of their description from code search, and code summarization studies.

Where is it?

Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.

For complete coverage, follow our Twitter: @Quantum_Stat

Quantum Stat

--

--

Subscribe to the NLP Cypher newsletter for the latest in NLP & ML code/research. 🤟