Baptism of Christ (aka there’s a giant UFO in the sky) | Gelder (1710)

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 06.06.21

Trinity

Ricky Costa

Published in

Towards AI

9 min readJun 7, 2021

Welcome back to the simulation ✌ . So ACL 2021 data dump happened and now we have a huge list of repos to get through in the Repo Cypher this week. 😁

Also, we are updating the NLP index very soon with 100+ new repos (many of which are mentioned here) alongside 30+ new NLP notebooks like this one 👇 . If you would like to get an email alert for future newsletters and asset updates, you can sign-up here.

thank you Niels Rogge

So let us start with incoming awesomeness. Heard of the Graph4NLP library??? If you want to leverage graph neural networks (GNNs) for your NLP tasks you may want to check them out (they’re in BETA). It runs on top of DeepGraphLibrary. Also, they have an awesome collection of papers on everything deep learning + graphs + NLP 👇.

https://github.com/graph4ai/graph4nlp_literature

graph4ai/graph4nlp

Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language…

github.com

But wait, we need a simulation gut check

Oh… just an FYI. Aliens can’t be ruled out as the reason for all the aerial phenomena (~120 events) in the US according to sources in the Pentagon. The New York Times got an inside scoop on the upcoming UFO report being made ready by the DNI (Director of National Intelligence). Bottom-line, no-one has a clue of what is going on and it may be aliens. You can read the full article here:

U.S. Finds No Evidence of Alien Technology in Flying Objects, but Can’t Rule It Out, Either

If you want to do your own investigative work, the FAA has a database of unmanned aircraft sightings. Some involve military encounters. 🤷‍♂️

FAA Data Shows Ongoing Pattern Of Military Encounters With Unidentified Aircraft In Sensitive…

Apart from the coastal events, an entirely separate cluster of incidents have been reported in a concentrated area in…

www.thedrive.com

Data:

UAS Sightings Report

Reports of unmanned aircraft ( UAS) sightings from pilots, citizens and law enforcement have increased dramatically…

www.faa.gov

So…

To conclude the intro: I’d like to welcome everyone to the halfway mark of the glorious year that is… 2021💖💖. A year where Elon runs crypto, aliens are real, and everyone wears masks. 👍

and now… a word from our sponsors:

😂😂😂

Important to Note: Elon Musk’s crypto shitposting is so lit that it got the attention of the hacker group Anonymous. I’m guessing they weren't too keen on Elon’s tweets swaying the crypto markets. As a result, they decided to G check him on sight. 🥶🥶

Tokenizers Go Bye Bye: ByT5 Model

No need for intro: Google already gives one in their repo:

“ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword vocabulary like most other pretrained language models (BERT, XLM-R, T5, GPT-3), our ByT5 model operates directly on UTF-8 bytes, removing the need for any text preprocessing. Beyond the reduction in system complexity, we find that parameter-matched ByT5 models are competitive with mT5 across a range of tasks, and outperform mT5 on tasks that involve noisy text or are sensitive to spelling and pronunciation. This repo can be used to reproduce the experiments in the ByT5 paper.”

google-research/byt5

ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword vocabulary like most other pretrained…

github.com

… already merged in Transformers:

Binary Passage Retriever (BPR)

From the peeps that rolled out the LUKE model dropped ACL bombs on Twitter. They reduced the mem size of the Dense Passage Retriever (DPR) model w/o losing QA accuracy. ✌

“Compared with DPR, BPR substantially reduces the memory cost from 65GB to 2GB without a loss of accuracy on two standard open-domain question answering benchmarks: Natural Questions and TriviaQA.” 🔥🔥

studio-ousia/bpr

Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR…

github.com

The Good, the Bad and the Mysterious: GPT-3

Across a multi-scale paradigm, a change in quantity on one scale, is a change in quality in another.

Stanford decided to investigate GPT-3’s emergent phenomenon of few-shot learning. Warning Power laws incoming….

Extrapolating to Unnatural Language Processing with GPT-3's In-context Learning: The Good, the Bad…

In mid-2020, OpenAI published the paper and commercial API for GPT-31, their latest generation of large-scale language…

ai.stanford.edu

DoT: Double Transformer Model

Google has a new architecture for using transformers to parse table data.

“DoT, a double transformer model, that decomposes the problem into two sub-tasks: A shallow pruning transformer that selects the top-K tokens, followed by a deep task-specific transformer that takes as input those K tokens.”

paper

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

SemEval2021 Reading Comprehension of Abstract Meaning

Task is designed to help evaluate the ability of machines in representing and understanding abstract concepts. Multiple-choice dataset.

boyuanzheng010/SemEval2021-Reading-Comprehension-of-Abstract-Meaning

This is the repository for SemEval 2021 Task 4: Reading Comprehension of Abstract Meaning. It includes code for…

github.com

NATURAL LANGUAGE PROCESSING (NLP) WEEKLY NEWSLETTER

The NLP Cypher | 06.06.21

Trinity

graph4ai/graph4nlp

Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language…

But wait, we need a simulation gut check

FAA Data Shows Ongoing Pattern Of Military Encounters With Unidentified Aircraft In Sensitive…

Apart from the coastal events, an entirely separate cluster of incidents have been reported in a concentrated area in…

UAS Sightings Report

Reports of unmanned aircraft ( UAS) sightings from pilots, citizens and law enforcement have increased dramatically…

Tokenizers Go Bye Bye: ByT5 Model

google-research/byt5

ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword vocabulary like most other pretrained…

Binary Passage Retriever (BPR)

studio-ousia/bpr

Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR…

The Good, the Bad and the Mysterious: GPT-3

Extrapolating to Unnatural Language Processing with GPT-3's In-context Learning: The Good, the Bad…

In mid-2020, OpenAI published the paper and commercial API for GPT-31, their latest generation of large-scale language…

DoT: Double Transformer Model

Repo Cypher 👨‍💻

A collection of recently released repos that caught our 👁

boyuanzheng010/SemEval2021-Reading-Comprehension-of-Abstract-Meaning

This is the repository for SemEval 2021 Task 4: Reading Comprehension of Abstract Meaning. It includes code for…

bywords/directed_sentiment_analysis

This repository provides a dataset and code for extracting sentiment relationships between political entities in news…

cambridgeltl/sapbert

This repo holds code, data, and pretrained weights for (1) the SapBERT model presented in our NAACL 2021 paper…

declare-lab/CIDER

This repository contains the dataset and the pytorch implementations of the models from the paper CIDER: Commonsense…

declare-lab/identifiable-transformers

This repository helps: Someone who is looking for a quick transformer-based classifier with low computation budget…

e0397123/DynaEval

conda env create -f environment.yml conda activate gcn processed datasets can be found at…

eric11eca/NeuralLog

Deep learning (DL) based language models achieve high performance on various benchmarks for Natural Language Inference…

formiel/fairseq

This is the codebase for the paper Lightweight Adapter Tuning for Multilingual Speech Translation (ACL-IJCNLP 2021)…

graykode/commit-autosuggestions

Have you ever hesitated to write a commit message? Now get a commit message from Artificial Intelligence! CodeBERT: A…

JasonForJoy/MPC-BERT

This repository contains the source code for the ACL 2021 paper MPC-BERT: A Pre-Trained Language Model for Multi-Party…

🦦 OTTers: One-turn Topic Transitions for Open-Domain Dialogue

karinseve/OTTers

Data analysis scripts coming soon! You can download the data already divided in the two in-domain and out-of-domain…

🌳 Fingerprinting Fine-tuned Language Models in the Wild

LCS2-IIITD/ACL-FFLM

This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild…

neulab/SpanNER

Two roles of span prediction models (boxes in blue): as a base NER system as a system combiner. This repository…

PlusLabNLP/Com2Sense

Models for the Commonsense Dataset Construction project Please download the model weights from Google-drive, and place…

rishabhjoshi/DialoGraph_ICLR21

Code, Data, Demo coming soon!! Please cite the following paper if you use this code in your work. @inproceedings{…

salesforce/ConvSumm

Chien-Sheng Wu*, Linqing Liu*, Wenhao Liu, Pontus Stenetorp, Caiming Xiong Please cite our work if you use the code or…

teacherpeterpan/Zero-shot-Fact-Verification

Please cite the paper in the following format if you use this dataset during your research.

TharinduDR/TransQuest

The goal of quality estimation (QE) is to evaluate the quality of a translation without having access to a reference…

viswavi/CitationIE

This repository serves two purposes: Provides tools for joining the SciREX dataset with the S2ORC citation graph.

Weixin-Liang/HERALD

This repo provides the PyTorch source code of our paper: HERALD: An Annotation Efficient Method to Train User…

xwjim/DRN

PyTorch implementation for ACL 2021 Findings paper: Discriminative Reasoning for Document-level Relation Extraction…

Yale-LILY/ConvoSumm

Data, code, and model checkpoints for the ACL 2021 paper ConvoSumm: Conversation Summarization Benchmark and Improved…

yilunzhu/ontogum

This repository contains the code for building up the OntoGUM dataset from: Python >= 3.6 Download GUM from…

amazon-research/contextual-attention-nlm

This project provides the source to reproduce the main methods of the paper “Attention-Based Contextual Language Model…

crherlihy/clinical_nli_artifacts

This repository contains the source code required to reproduce the analysis presented in the paper “MedNLI Is Not…

Dataset of the Week: CoDesc

What is it?

Where is it?

csebuetnlp/CoDesc

This is the public release of code, and data of our paper titled "CoDesc: A Large Code-Description Parallel Dataset"…

Written by Ricky Costa

Chien-Sheng Wu, Linqing Liu, Wenhao Liu, Pontus Stenetorp, Caiming Xiong Please cite our work if you use the code or…