Deep Learning, Natural Language Processing

Introduction to Word Embeddings

Implementation of word embedding using deep learning for Natural Language processing.

Bharat Choudhary
Towards AI
Published in
8 min readAug 4, 2020

--

word embedding natural language processing keras data science deep learning machine learning text preprocessing
Word Cloud from Kaggle

Word embedding is a method to capture the “meaning” of the word via low dimension vector and it can be used in a variety of tasks in Natural Language Processing (NLP).

Before beginning word embedding tutorial we should have an understanding of vector space and similarity matrix.

Vector Space

A sequence of numbers that is used to identify a point in space is called vector and if we have a whole bunch of vectors that all belong to the same dataset it will be called a vector space.

Words in the text can also be represented in the higher dimension in vector space where words having the same meaning will have similar representations. For example,

deep learning word embedding machine learning data science text preprocessing
photo by Allision Parrish from Github

The above image shows a vector representation of words on the scale of cuteness and size of animals. we can see that there is a semantic relationship between words on bases of similar properties. It is difficult to represent the higher dimensional relationship between words but the maths behind is the same so it works similarly in a higher dimension also.

Similarity matrix

It is used to calculate the distance between vectors in the vector space. it measures similarity or distance between two data points in vector space. This allows us to capture words that are used in similar ways to result in having similar representation naturally capturing their meaning. there is a lot of similarity matrix available but we will discuss Euclidean distance and Cosine similarity.

Euclidean distance

One way to calulate how far two data points are in vector space is to calculate Euclidean distance.

import math
def distance2d(x1, y1, x2, y2):
return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)

So, the distance between “capybara” (70, 30) and “panda” (74, 40) from the above image example:

… is less than the distance between “tarantula” and “elephant” from the above image example:

This shows that “pandas” and “capybara” are more similar as compared to “tarantula” and “elephant”.

Cosine similarity

It is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

from numpy import dot
from numpy.linalg import norm

cos_sim = dot(a, b)/(norm(a)*norm(b))

Now Question is what is word embedding and why do we use them?

In simple words, they are a vector representation of words in sentences, documents, etc.,

Word embedding is a learning representation of words in the form of numeric vectors. It learns a densely distributed representation for a predefined fixed-sized vocabulary from a corpus of text. The word embedding representation is capable to reveal many hidden relationships between words. For example, vector(“king”) — vector(“lords”) is similar to vector(“queen”) — vector(“princess”)

It is an improvement over the traditional methods to represent word such as bag-of-word model which produces large sparse vectors which are computationally impractical to represent an entire vocabulary. These representations were sparse due to its vast vocabularies and a given word or document would be represented by a large vector comprised mostly of zero values a sparse representation.

Two popular methods of learning word embeddings from the text include:

1. Word2Vec.

2. GloVe.

There are pre-trained models that were trained over a large corpus of text. We can use them for our use case.

In addition to these methods, a word embedding can be learned using deep learning model. This can be a slower approach but we can design it for our own use case the model will be trained on a specific training dataset as per our own requirement. Keras provides a very easy and flexible Embedding layer that can be used for neural networks on text data.

In this tutorial, we’re going to use Keras to train our own word embedding model and can be further use for sentimental analysis, machine translation, language modelling and various other natural language processing task.

Importing Module

Let’s get started with importing our dataset, module, and checking its head. I took a dataset from Kaggle IMBD Movie Review-NLP.

import pandas as pd
import numpy as np
from numpy import array
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

We’ll use Scikit-learn to divide our dataset into a training set and test set. We’ll train the word embedding on 70% of the data and test it on 30%.

INTEGER ENCODING ALL THE DOCUMENTS

After this, all the unique words will be represented by an integer. For this, we are using one_hot function available in the Keras. Note that the vocab_size is specified as a total number of unique words so as to ensure unique integer encoding for each and every word.

Note one important thing that the integer encoding for the word remains the same in different text. eg ‘year’ is denoted by 23518 in each and every document.

Let’s now have a look at one of the reviews. We’ll compare this sentence with its transformation as we move in the next steps.

I really didn't like this movie because it didn't really bring across the messages and ideas L'Engle brought out in her novel. We had read the novel in our English class and i absolutely loved it, i'm afraid i can't say the same for the film. There were some serious differences between the novel and the adapted version and it just didn't do any credit to the imaginative genius that is Madeleine L'Engle! This is the reason i gave it such a poor rating. Don't see this movie if you are a big fan of L'Engle's texts because you will be sorely disappointed. However, if you are watching the movie for entertainment purposes (or educational as was my case) then it is an alright movie!

This review will be converted into integer representation where each number represents a unique word.

[24608, 32542, 30289, 58025, 50966, 19624, 43296, 35850, 30289, 32542, 31519, 11569, 30465, 7968, 12928, 34105, 8750, 49668, 38039, 40264, 3503, 45016, 63074, 41404, 53275, 30465, 45016, 40264, 28666, 47101, 44909, 12928, 24608, 62202, 46727, 35850, 24425, 5515, 24608, 25601, 35725, 30465, 10577, 55918, 30465, 13875, 62286, 22967, 5067, 9001, 33291, 1247, 30465, 45016, 12928, 30465, 23555, 44142, 12928, 35850, 41976, 30289, 20229, 15687, 7845, 50705, 30465, 58301, 14031, 11556, 1495, 26143, 8750, 50966, 1495, 30465, 63056, 24608, 39847, 35850, 30936, 54227, 33469, 55622, 8193, 3111, 50966, 19624, 9403, 51670, 40033, 54227, 42254, 52367, 44935, 63226, 17625, 43296, 51670, 65642, 30053, 42863, 34757, 32894, 9403, 51670, 40033, 1112, 30465, 19624, 55918, 55169, 57666, 10193, 50176, 59413, 10480, 63135, 56156, 64520, 35850, 1495, 49938, 59074, 19624]

Padding theText (to make the very text of the same length)

The Keras Embedding layer requires all individual documents to be of the same length. Hence we will pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer, the ‘input_length’ will be equal to the length (ie no of words) of the document with maximum length or a maximum number of words.

To pad the shorter documents I am using pad_sequences function from the Keras library.

The maximum number of words in any document is :  1719

Here, we found that the maximum words that a sentence hold is 1719. so we will be padding according to it. In padding, we will be adding zeros(0) in a shorter sentence than max_length. In shorter length sentences “0 ” will be added at the beginning of the sentence.

For example:

array([    0,     0,     0, ..., 32875, 18129, 60728])

WE WILL BE CREATING THE EMBEDDINGS using KERAS EMBEDDING LAYER

Now all the text are of the same length (after padding). And so now we are ready to create and use the embedding layer.

PARAMETERS OF THE EMBEDDING LAYER — -

‘Input_dim’ = the vocab size that we will choose. It is the number of unique words in the vocabulary.

‘Output_dim’ = the number of dimensions we wish to embed into. Each word can be represented by a vector of the same dimensions.

‘Input_length’ = length of the maximum text. which is stored in the maxlen variable in the example.

Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 1719, 8) 527680
_________________________________________________________________
flatten_1 (Flatten) (None, 13752) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 13753
=================================================================
Total params: 541,433
Trainable params: 541,433
Non-trainable params: 0
_________________________________________________________________
None

Let’s now check the model accuracy on our training set.

6000/6000 [==============================] - 1s 170us/step
Training Accuracy is 100.0

The next step we can do is check its accuracy on the test set.

4000/4000 [==============================] - 1s 179us/step
Testing Accuracy is 86.57500147819519

We are getting train accuracy as 100% because on that data we train embedding but for test data, there are some words used which are unseen so we are getting a bit less accuracy.

In practice, I would recommend performing a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding. That will surely improve performance on test data.

What’s Next

Now we have learned how to represent words in the form of continuous numbers. As compared to other forms of text representation such as bag-of-words or TF-IDF(term frequency-inverse document frequency), etc., Word embedding gives much better semantic relationships between words. It can significantly improve the performance of natural language processing(NLP) tasks.

Now, I would suggest you try yourself word embedding on your own NLP task and you will find significant improvement in the performance. you can also experiment with implementing word embeddings on the same dataset by using pre-trained word embeddings such as Word2Vec as fixed and on top of it, you can perform learning.

Most often, you will notice that the pre-trained models will have a higher accuracy on the testing set the reason for that is it already had trained on a large variety of NLP datasets. But if you have enough data and want to perform a specific task than it will be a better choice to train your own word embedding.

Code for Word Embedding is Available on GitHub.

Thanks for the read. I hope this helps you understanding Word Embedding and its importance in natural language processing (NLP).

Follow me up at Medium. As always, I welcome feedback and constructive criticism and can be reached on Linkedin.

--

--