Mastering Sentiment Analysis with Python using the Attention Mechanism

Published in

Towards AI

12 min readMay 31, 2023

Businesses across industries now recognize the importance of understanding customer opinions and sentiments. By gauging the sentiment behind product reviews, brand mentions, and service feedback, companies can gain vital insights into customer satisfaction, brand perception, and market trends.

In this article, we’re diving deep into the exciting world of sentiment analysis. We’ll show you how to build your very own sentiment analysis model using Python. And guess what? We’re bringing in the big guns — the attention mechanism. With the help of the amazing Keras library, we’ll train a deep-learning model that can read emotions like a pro.

We’re about to unravel the mysteries hidden within text data. Get ready to make sense of the countless opinions and sentiments floating around in the digital universe. With our step-by-step guide, you’ll be armed with the skills and tools to conquer sentiment analysis like a boss. Let’s dive into this adventure and uncover the true power of words!

Getting Started

Before we proceed, we will need to install some libraries. We will make use of the following libraries:

NLTK, which we will use for text preprocessing by removing stop words.
TensorFlow, an open-source platform for machine learning.
Keras, a library for building deep learning models.

To install these, simply open up a terminal and type:

$ pip install nltk tensorflow keras pandas matplotlib seaborn sklearn

Data Preparation

We will be using the IMDB Review Dataset, which contains movie reviews of positive, negative, and unsupervised sentiments, respectively. The dataset is available on Kaggle.

We will download the dataset and load it into the program.

from importlib import reload
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

import re
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    roc_auc_score
)
import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.layers import Concatenate, Dense, Input, LSTM, Embedding, Dropout, Activation, GRU, Flatten
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model, Sequential
from keras.layers import Convolution1D
from keras import initializers, regularizers, constraints, optimizers, layers

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Import IMDB dataset and drop unnecessary columns and rows
df2 = pd.read_csv('imdb_master.csv', encoding="latin-1")
df2 = df2.drop(['Unnamed: 0','type','file'],axis=1)
df2.columns = ["review","sentiment"]
df2 = df2[df2.sentiment != 'unsup']
df = df2

# Set NLTK StopWords and WordLemmatizer
stop_words = set(stopwords.words("english")) 
lemmatizer = WordNetLemmatizer()

The resulting df has two columns: review, which contains the movie review, and sentiment, which is the sentiment of the review (either positive or negative).

Clean Data

Before we can proceed, we need to clean and preprocess the text data. This involves several steps, including removing punctuation, lowercasing the text, lemmatizing words, removing stop words, and joining the text again into strings.

# Defining function to clean text data
def clean_text(text):
    text = re.sub(r'[^\w\s]','',text, re.UNICODE)
    text = text.lower()
    text = [lemmatizer.lemmatize(token) for token in text.split(" ")]
    text = [lemmatizer.lemmatize(token, "v") for token in text]
    text = [word for word in text if not word in stop_words]
    text = " ".join(text)
    return text

# Applying the clean_text function to every row of the 'review' column
df['Processed_Reviews'] = df.review.apply(lambda x: clean_text(x))

We define a function clean_text that takes a string of text as input and applies a series of regex and NLP operations on it to remove unnecessary characters and extract useful features. We then apply this function to all the reviews in the dataset to obtain the preprocessed version of the data.

Split Data

Next, we encode the target labels, which represent the sentiment of each text, into numerical values between 0 and 1. The LabelEncoder maps each unique sentiment label to a unique integer value, allowing the model to work with numeric representations of the labels.

In order to evaluate the performance of our sentiment classification model, we need to split the dataset into training and testing sets. This is accomplished using the train_test_split function. The x_train and x_test variables store the processed reviews (text data) from the df dataframe, while y_train and y_test store the corresponding sentiment labels.

# Encode target labels between 0 and 1
le = LabelEncoder()
df['sentiment'] = le.fit_transform(df['sentiment'])

# Split Train y Test
x_train, x_test, y_train, y_test = train_test_split(df['Processed_Reviews'],
                                                    df['sentiment'])

Set Model Parameters

To prepare the text data for modeling, it must be tokenized, meaning the sentences are split into individual words or tokens. The Keras library provides the Tokenizer class, which performs this task. By specifying the num_words parameter as MAX_FEATURES, we limit the vocabulary size to the most frequently occurring words in the training data.

The tokenizer.fit_on_texts method fits the tokenizer on the x_train text data. Once the tokenizer is fitted, we can convert the text data into sequences of integers using the texts_to_sequences method.

Machine learning models typically require input data of the uniform shape. To achieve this, the sequences of integers in list_tokenized_train are padded with zeros or truncated to ensure a fixed sequence length

# Model Parameters
MAX_FEATURES = 6000
EMBED_SIZE = 128
RNN_CELL_SIZE = 32
MAX_LEN = 130   # Since our mean length is 128.5

# Tokenization
tokenizer = Tokenizer(num_words=MAX_FEATURES)
tokenizer.fit_on_texts(x_train)
# Converting into sequences
list_tokenized_train = tokenizer.texts_to_sequences(x_train)


# Padding Sequences
X_train = pad_sequences(list_tokenized_train, maxlen=MAX_LEN)

Model Architecture

To perform sentiment analysis on the reviews, we will construct a deep learning model based on the attention mechanism. The model will be composed of three main sections, each playing a crucial role in the analysis process.

Attention Mechanism: This layer enhances the model’s ability to concentrate on significant words while disregarding irrelevant ones. By assigning attention weights to different words in the input sequence, the model can focus on the most informative elements for sentiment analysis.
Embedding Layer: This layer transforms each word in the input sequence into a fixed-size vector representation. It captures the semantic meaning and contextual information associated with each word, allowing the model to understand the nuances of the text.
Bi-Directional RNN: This layer encodes the sentence into a fixed-length vector representation by leveraging both the forward and backward information flow. The bi-directional nature of this recurrent neural network (RNN) enables the model to capture the dependencies and sequential patterns in the text, contributing to a more comprehensive analysis.

By combining these three components, our model can effectively analyze the sentiment of the reviews, taking into account the importance of specific words, the semantic meaning of the text, and the contextual information within the sequences. This architecture empowers the model to make informed predictions and extract valuable insights from the input data.

Creating the Attention Layer

The attention layer is created in the first step. We can easily use the final encoded state of an RNN for a prediction task. However, since RNNs tend to forget relevant information from previous steps, we can use the encoded states to maintain that information.

We use a weighted sum of these encoded states (our attention mechanism) to make our prediction. Since all these encoded states are equally valuable, we use a weighted sum of these encoded states. The following Python code creates an attention layer using Keras:

# Attention Layer
class Attention(tf.keras.Model):
    def __init__(self, units):
        super(Attention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
 
    def call(self, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        score = tf.nn.tanh(
            self.W1(features) + self.W2(hidden_with_time_axis))
        
        attention_weights = tf.nn.softmax(self.V(score), axis=1)

        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)
 
        return context_vector, attention_weights

This code defines an Attention class that extends tf.keras.Model. We set up three dense layers in the constructor, self.W1, self.W2, and self.V, and implement the call method for calculating attention.

We first expand the hidden state tensor to have a time dimension, then calculate the score by adding the expanded state to self.W1(features). The attention weights are calculated using tf.nn.softmax(self.V(score), axis=1) and the context vector is a weighted sum of the feature tensor features by attention_weights, normalized by summing across time dimension.

Embedding Layer

The next step involves creating an embedding layer to represent the words in our text data. The Embedding layer is used for this purpose, with MAX_FEATURES representing the maximum number of features and EMBED_SIZE specifying the dimensionality of the word embeddings.

# Add Embedding Layer
sequence_input = Input(shape=(MAX_LEN,), dtype="int32")
embedded_sequences = Embedding(MAX_FEATURES, EMBED_SIZE)(sequence_input)

Bidirectional RNN

To capture the contextual information from both forward and backward sequences, a bidirectional recurrent neural network (RNN) is employed. The Bidirectional wrapper is used to encapsulate the LSTM layer. The outputs of the bidirectional LSTM are then collected.


# Add Bidirectional layer
bilstm = Bidirectional(LSTM(64, return_sequences=True), name="bi_lstm_0")(embedded_sequences)
(lstm, forward_h, forward_c, backward_h, backward_c) = Bidirectional(LSTM(RNN_CELL_SIZE, return_sequences=True, return_state=True), name="bi_lstm_1")(bilstm)

Attention and Dense Layers

Since our model utilizes a bidirectional RNN, we concatenate the hidden states of each RNN before calculating attention weights and applying weighted summation. The concatenated hidden states are passed to the attention layer.

# Concatenate RNN hidden states and apply attention
state_h = Concatenate()([forward_h, backward_h])
state_c = Concatenate()([forward_c, backward_c])
context_vector, attention_weights = Attention(10)(lstm, state_h)

We then connect a dense layer to the context vector, followed by a dropout layer to prevent overfitting, and finally, a dense layer with a sigmoid activation function to produce the model’s output.

# Add Dense Layers
dense1 = Dense(20, activation="relu")(context_vector)
dropout = Dropout(0.05)(dense1)
output = Dense(1, activation="sigmoid")(dropout)

Visualizing the Model Architecture

To gain a better understanding of the model’s architecture and layer connections, we can generate a visual representation of the model using the plot_model function from the Keras library.

# Plot Model
keras.utils.plot_model(model, show_shapes=True, dpi=90)

This function generates a graphical representation of the model, showing the flow of data from the input layer to the output layer. The show_shapes=True parameter ensures that the shapes of the input and output tensors are displayed in the diagram. The resulting plot provides a clear visualization of the model's structure, allowing us to analyze and verify its design.

Model Arquitecture plot. Created by Author.

Model Compilation

To compile the model, we specify the loss function, optimizer, and evaluation metrics. In the example below, we utilize binary cross-entropy as the loss function and the Adam optimizer. Additionally, we define a set of metrics to evaluate the model’s performance.

# Set Train Metrics
METRICS = [
      keras.metrics.TruePositives(name='tp'),
      keras.metrics.FalsePositives(name='fp'),
      keras.metrics.TrueNegatives(name='tn'),
      keras.metrics.FalseNegatives(name='fn'), 
      keras.metrics.BinaryAccuracy(name='accuracy'),
      keras.metrics.Precision(name='precision'),
      keras.metrics.Recall(name='recall'),
      keras.metrics.AUC(name='auc'),
]

# Compile the Model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=METRICS)

By following these preprocessing steps and constructing the sentiment classification model, we can effectively prepare the text data for sentiment analysis. The combination of encoding, splitting, tokenization, and padding ensures that the data is ready for training and evaluation, leading to accurate sentiment predictions.

Training Model

After compiling the model, we can proceed with training it on the prepared training data. In this example, we will train the attention model for 5 epochs using mini-batches of 100 samples.

# Train Model
BATCH_SIZE = 100
EPOCHS = 5
history = model.fit(X_train, y_train,
                    batch_size=BATCH_SIZE,
                    epochs=EPOCHS,
                    validation_split=0.2)

During training, the model's performance on both the training and validation sets is recorded in the history object. This information will be useful for visualizing the model's training progress and evaluating its performance.

Model Evaluation

Once the model is trained, we can evaluate its performance on the test set. We first preprocess the test data by tokenizing and padding it using the same tokenizer and maximum length as before.

# Tokenize and padding
list_sentences_test = x_test
list_tokenized_test = tokenizer.texts_to_sequences(list_sentences_test)
X_test = pad_sequences(list_tokenized_test, maxlen=MAX_LEN)

We then use the trained model to make predictions on the test data.

# Make Predictions
prediction = model.predict(X_test)

Confusion Matrix

To evaluate the model's performance, we can compute various metrics such as accuracy, precision, recall, and AUC-ROC. We can also generate a classification report and plot a confusion matrix to gain further insights into the model's predictions.

# Classificacion Report
report = classification_report(y_test, y_pred)
print(report)

def plot_cm(labels, predictions, p=0.5):
    cm = confusion_matrix(labels, predictions)
    plt.figure(figsize=(5, 5))
    sns.heatmap(cm, annot=True, fmt="d")
    plt.title("Confusion matrix (non-normalized)")
    plt.ylabel("Actual label")
    plt.xlabel("Predicted label")

plot_cm(y_test, y_pred)

The classification report provides metrics such as precision, recall, and F1-score for each class, as well as an average across all classes.

              precision    recall  f1-score   support

           0       0.88      0.85      0.87      6299
           1       0.85      0.88      0.87      6201

    accuracy                           0.87     12500
   macro avg       0.87      0.87      0.87     12500
weighted avg       0.87      0.87      0.87     12500

The confusion matrix visually represents the model’s performance by showing the number of true positives, true negatives, false positives, and false negatives. These metrics and visualizations give us a comprehensive understanding of how well the model is performing on the test set.

Validation Metrics

Next, we can plot the training and validation metrics over epochs to assess the model’s training progress.

# Figure Params
colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
mpl.rcParams["figure.figsize"] = (12, 18)

# Plot Model Metrics
def plot_metrics(history):
    metrics = [
        "loss",
        "tp", "fp", "tn", "fn",
        "accuracy",
        "precision", "recall",
        "auc",
    ]

    # Create subplot for each metric
    for n, metric in enumerate(metrics):
        name = metric.replace("_", " ").capitalize()
        plt.subplot(5, 2, n + 1)
        plt.plot(
            history.epoch,
            history.history[metric],
            color=colors[0],
            label="Train",
        )
        plt.plot(
            history.epoch,
            history.history["val_" + metric],
            color=colors[1],
            linestyle="--",
            label="Val",
        )
        plt.xlabel("Epoch")
        plt.ylabel(name)
        if metric == "loss":
            plt.ylim([0, plt.ylim()[1] * 1.2])
        elif metric == "accuracy":
            plt.ylim([0.4, 1])
        elif metric == "fn":
            plt.ylim([0, plt.ylim()[1]])
        elif metric == "fp":
            plt.ylim([0, plt.ylim()[1]])
        elif metric == "tn":
            plt.ylim([0, plt.ylim()[1]])
        elif metric == "tp":
            plt.ylim([0, plt.ylim()[1]])
        elif metric == "precision":
            plt.ylim([0, 1])
        elif metric == "recall":
            plt.ylim([0.4, 1])
        else:
            plt.ylim([0, 1])
        plt.legend()
plot_metrics(history)

The plot_metrics function generates a set of subplots, each representing a different metric, including loss, true positives (tp), false positives (fp), true negatives (tn), false negatives (fn), accuracy, precision, recall, and AUC. By observing these plots, we can gain insights into the model's performance and determine if any overfitting or underfitting is occurring.

ROC Curve

Finally, we can plot the receiver operating characteristic (ROC) curve to assess the model’s ability to discriminate between positive and negative sentiment.

from sklearn.metrics import roc_curve, auc

# Binarize labels
y_bin = label_binarize(y_test, classes=[0, 1])
n_classes = 1
fpr = dict()
tpr = dict()
roc_auc = dict()

# Create ROC Curve for each class
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test.ravel(), y_pred.ravel())
    roc_auc[i] = auc(fpr[i], tpr[i])
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_pred.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

# Plot ROC Curve
plt.figure()
lw = 2
plt.plot(fpr[0], tpr[0], color='darkorange',
         lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[0])

plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

The ROC curve plots the true positive rate against the false positive rate at various classification thresholds. The area under the ROC curve (AUC-ROC) is a commonly used metric to evaluate the overall performance of a binary classification model. A higher AUC value indicates better discriminative ability.

By evaluating the model’s performance using these evaluation techniques, we can gain valuable insights into its accuracy, precision, recall, and AUC-ROC score. These metrics provide a comprehensive assessment of the model’s effectiveness in sentiment classification.

Conclusion

And there you have it, my friends! We’ve journeyed through the fascinating realm of sentiment analysis, armed with Python and the mighty attention mechanism. We’ve seen how this cutting-edge technology can unravel the sentiments hidden within text, giving us valuable insights into customer opinions, brand perception, and market trends.

By harnessing the power of deep learning, we’ve learned to train a robust sentiment analysis model that can read emotions with astonishing accuracy. Armed with this knowledge, you can now tap into the vast amounts of data flowing through social media platforms and make data-driven decisions that propel your business forward.

Remember, sentiment analysis is not just about words; it’s about understanding people, their desires, and their pain points. It’s about gaining a deeper understanding of your audience and building meaningful connections. With your newfound skills, you can listen to the pulse of the digital world and adapt your strategies to meet the evolving needs of your customers.

So go ahead, embrace the power of sentiment analysis and let it guide you towards success. Explore the endless possibilities, uncover hidden sentiments, and make your mark in the world of data-driven decision-making.

Become a Medium member today and enjoy unlimited access to thousands of Python guides and Data Science articles! For just $5 a month, you’ll have access to exclusive content and support as a writer. Sign up now using my link, and I’ll earn a small commission at no extra cost to you.