NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER
The NLP Cypher | 06.06.21
Trinity
Welcome back to the simulation ✌ . So ACL 2021 data dump happened and now we have a huge list of repos to get through in the Repo Cypher this week. 😁
Also, we are updating the NLP index very soon with 100+ new repos (many of which are mentioned here) alongside 30+ new NLP notebooks like this one 👇 . If you would like to get an email alert for future newsletters and asset updates, you can sign-up here.
So let us start with incoming awesomeness. Heard of the Graph4NLP library??? If you want to leverage graph neural networks (GNNs) for your NLP tasks you may want to check them out (they’re in BETA). It runs on top of DeepGraphLibrary. Also, they have an awesome collection of papers on everything deep learning + graphs + NLP 👇.
https://github.com/graph4ai/graph4nlp_literature
But wait, we need a simulation gut check
Oh… just an FYI. Aliens can’t be ruled out as the reason for all the aerial phenomena (~120 events) in the US according to sources in the Pentagon. The New York Times got an inside scoop on the upcoming UFO report being made ready by the DNI (Director of National Intelligence). Bottom-line, no-one has a clue of what is going on and it may be aliens. You can read the full article here:
U.S. Finds No Evidence of Alien Technology in Flying Objects, but Can’t Rule It Out, Either
If you want to do your own investigative work, the FAA has a database of unmanned aircraft sightings. Some involve military encounters. 🤷♂️
Data:
So…
To conclude the intro: I’d like to welcome everyone to the halfway mark of the glorious year that is… 2021💖💖. A year where Elon runs crypto, aliens are real, and everyone wears masks. 👍
and now… a word from our sponsors:
Important to Note: Elon Musk’s crypto shitposting is so lit that it got the attention of the hacker group Anonymous. I’m guessing they weren't too keen on Elon’s tweets swaying the crypto markets. As a result, they decided to G check him on sight. 🥶🥶
Tokenizers Go Bye Bye: ByT5 Model
No need for intro: Google already gives one in their repo:
“ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword vocabulary like most other pretrained language models (BERT, XLM-R, T5, GPT-3), our ByT5 model operates directly on UTF-8 bytes, removing the need for any text preprocessing. Beyond the reduction in system complexity, we find that parameter-matched ByT5 models are competitive with mT5 across a range of tasks, and outperform mT5 on tasks that involve noisy text or are sensitive to spelling and pronunciation. This repo can be used to reproduce the experiments in the ByT5 paper.”
… already merged in Transformers:
Binary Passage Retriever (BPR)
From the peeps that rolled out the LUKE model dropped ACL bombs on Twitter. They reduced the mem size of the Dense Passage Retriever (DPR) model w/o losing QA accuracy. ✌
“Compared with DPR, BPR substantially reduces the memory cost from 65GB to 2GB without a loss of accuracy on two standard open-domain question answering benchmarks: Natural Questions and TriviaQA.” 🔥🔥
The Good, the Bad and the Mysterious: GPT-3
Across a multi-scale paradigm, a change in quantity on one scale, is a change in quality in another.
Stanford decided to investigate GPT-3’s emergent phenomenon of few-shot learning. Warning Power laws incoming….
DoT: Double Transformer Model
Google has a new architecture for using transformers to parse table data.
“DoT, a double transformer model, that decomposes the problem into two sub-tasks: A shallow pruning transformer that selects the top-K tokens, followed by a deep task-specific transformer that takes as input those K tokens.”
Repo Cypher 👨💻
A collection of recently released repos that caught our 👁
SemEval2021 Reading Comprehension of Abstract Meaning
Task is designed to help evaluate the ability of machines in representing and understanding abstract concepts. Multiple-choice dataset.
Directed Sentiment Analysis in News Text
A dataset and code for extracting sentiment relationships between political entities in news text.
SapBERT: Self-alignment pretraining for BERT
Introduces a novel cross-lingual biomedical entity task (XL-BEL), establishing a widecoverage and reliable evaluation benchmark for cross-lingual entity representations in the biomedical domain in 10 languages.
CIDER: Commonsense Inference for Dialogue Explanation and Reasoning
CIDER — a manually curated dataset that contains dyadic dialogue explanations in the form of implicit and explicit knowledge triplets inferred using contextual commonsense inference.
Transformer-based Text Classifier: Simple yet Identifiable
Source code for observing the identifiability of attention weights.
DynaEval
DynaEval serves as a unified framework for both turn and dialogue level evaluation in open-domain dialogue.
NeuralLog: Natural Language Inference with Joint Neural and Logical Reasoning
A symbolic and neural inference system that improves accuracy on the NLI task and can achieve state-of-art accuracy on the SICK and MED datasets.
Commit Autosuggestions
Using CodeBERT, newly
patch_type_embeddings
are introduced that can distinguish between added and deleted code.
MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding
MPC-BERT, a pre-trained model for Multi-Party Conversation (MPC) understanding that considers learning who says what to whom in a unified model with several elaborated self-supervised tasks.
🌳 Fingerprinting Fine-tuned Language Models in the Wild
Experiments conducted to demonstrate the limitations of existing fingerprinting approaches in language models.
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences
A new commonsense reasoning benchmark dataset comprising natural language true/false statements, with each sample paired with its complementary counterpart, resulting in 4k sentence pairs.
DialoGraph
Incorporating interpretable strategy graph networks into negotiation dialogues.
CitationIE
This work augments text representations by leveraging a complementary source of document context: the citation graph of referential links between citing and cited papers. This is used for information extraction in science papers.
HERALD: An Annotation Efficient Method to Train User Engagement Predictors in Dialogs
HERALD, an annotation efficient framework that reframes the training data annotation process as a denoising problem.
Discriminative Reasoning for Document-level Relation Extraction
This new method outperforms the previous state-of-the-art performance on the large-scale DocRE dataset.
ConvoSumm: Conversation Summarization Benchmark
A total of four datasets evaluating a model’s performance on a broad spectrum of conversation data. This is for summarization.
OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12 More Genres
A dataset consistent with OntoNotes, with 168 documents (∼150K tokens, 19,378 mentions, 4,471 coref chains) in 12 genres, including conversational genres.
Attention-Based Contextual Language Modeling Adaptation
MedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain
Dataset of the Week: CoDesc
What is it?
Dataset consisting of 4.2M Java source code and parallel data of their description from code search, and code summarization studies.
Where is it?
Every Sunday we do a weekly round-up of NLP news and code drops from researchers around the world.
For complete coverage, follow our Twitter: @Quantum_Stat