Deep Learning

CycleGAN as a Denoising Engine for OCR Images

Cleaning up scanned but dirty documents to their original form. With a clean document, performing OCR will be a much more accurate task.

Sharon Lim

Published in

Towards AI

12 min readAug 15, 2020

Fig 1. Dirty input image (left), clean output image (right)

With the rapid growth of digitization, the need for digitized content is of crucial importance for data processing, storage, and transmission. Optical Character Recognition (“OCR”) is the process of converting a typed, handwritten, or printed text into a digitized format that is editable, searchable, and interpretable while obviating the need for entry of data into systems.

Most often than not, scanned documents contain noise which prevents the OCR from recognizing the full content of the text. The scanning process often results in the introduction of noise such as watermarking, background noise, blurriness due to camera motion or shake, faded text, wrinkles, or coffee stains. These noises pose many readability challenges to existing text recognition algorithms which significantly reduce their performance.

Fig 2. Scanned document converted into a text document using OCR

Basic OCR Pre-processing Methods

Binarization is the conversion of a colored image into an image which consists of only black and white pixels by fixing a threshold.
Skew correction generally involves skew angle determination and correction of the document image based on the skew angle.
Noise removal helps to smoothen the image by removing small dots or patches which have high intensity than the rest of the image.
Thinning and skeletonization ensure the uniformity of the stroke width for handwritten text as different writers have a different style of writing.

CycleGAN as an Advanced OCR Pre-processing Method

Generative Adversarial Networks (“GANs”) are a deep learning-based generative model. The GAN model architecture involves two sub-models: a generator model for generating new examples, and a discriminator model for classifying whether generated examples are real, from the domain, or fake, generated by the generator model.

Fig 3. Conceptual diagram of the GAN network (source: https://developers.google.com/machine-learning/gan/gan_structure)

CycleGAN was selected for implementation using TensorFlow, as an advanced OCR pre-processing method. An advantage of CycleGAN is that it does not require paired training data. Generally, paired data are data sets where every data point in one independent sample would be paired uniquely to a data point in another independent sample.

While input and output variables are still required, they do not need to directly correspond to each other. Since paired data is hard to find in most domains, the unsupervised training capabilities of CycleGAN are indeed very useful.

In the absence of paired images for training, CycleGAN is able to learn a mapping between the distributions of the noisy images to the denoised images using unpaired data, to achieve image-to-image translation for cleaning the noisy documents.

Image-to-image translation is the process of transforming an image from one domain (ie. noisy document image), to another (ie. clean document image). Other features of the image like text should stay recognizably the same, instead of features not directly related to either domain, such as the background.

CycleGAN Architecture

The architecture of the CycleGAN comprises of two pairs of generators and discriminators. Each generator has a corresponding discriminator, which attempts to evaluate its synthesized images from the real ones. As with any GANs, the generators and discriminators learn adversarially. Each generator attempts to “fool” the corresponding discriminator, while discriminators learn to not get “fooled”.

In order for the generator to preserve the text of the dirty documents, the model computes the cycle consistency loss, which evaluates how much an image that was translated from and back to its domain, resembles its original version.

Fig 4(a) Conversion of original dirty input to its translated clean output

Fig 4(b) Conversion of original clean input to its translated dirty output

The first generator, G-x2y, converts an original dirty input into a translated clean output. A discriminator, D-y, will attempt to evaluate whether the translated clean output is a real or generated image. The discriminator will then provide the probability that the evaluated image is a real image.

The second generator, G-y2x, converts an original clean input into a translated dirty output. The discriminator, D-x, will try to tell apart the real dirty images from generated ones. The created model will be trained in two directions, with a set of dirty images and a set of clean images, as illustrated above.

Methodology and Design

Background noise removal is the process of removing the background noise, such as uneven contrast, background spots, dog-eared pages, faded sunspots, wrinkles on the documents. The background noise limits the performance of OCR as it is difficult to differentiate the text from its background.

The CycleGAN model was trained using the Kaggle Document Denoising Dataset, which consists of noisy documents with noise in various forms such as coffee stains, faded sunspots, dog-eared pages, and wrinkles.

Fig 5. Types of dirty documents in the Kaggle Document Denoising Dataset

In order to fine-tune the model training, synthetic text generation was performed to introduce more noise in addition to the Kaggle dataset. This was achieved using the DocCreator program, an open-source and multi-platform software that can create virtually unlimited amounts of different ground truth synthetic document images based on a small number of real images. Various realistic degradation models had been applied to the original corpus, resulting in synthetic images generated.

Fig 6. Types of dirty documents in the synthetic text generated

The train data are grouped under folders trainA and trainB, which consists of both noisy and clean document images. The validation data are categorized under folders testA and testB, consisting of noisy and clean document images as well. A test dataset of unseen noisy document images under the TEST folder was used to test the trained network, and evaluate the models for removing background noise from document images.

Fig 7. Breakdown of training, validation and test data

For model training, the Adam optimizer was used with a learning rate of 0.0002 and a momentum of 0.5, with noisy input images of size 256 X 512. Due to hardware constraints, the best results were obtained by training the CycleGAN model for 300 epochs, with a batch size of 3.

Evaluation of Results

A factorial design experiment was performed to examine how multiple factors could affect a dependent variable, both independently and together. With training parameters such as the Adam optimizer, input image size and batch size kept constant, four different factors were evaluated in this project.

Fig 8. Factors considered in this experiment

Due to the complexity of the CycleGAN architecture, an in-depth evaluation of the CycleGAN performance was carried out using various metrics as follows:

Discriminator loss function takes two inputs — real images and generated images. The real loss is a sigmoid cross-entropy loss of the real images and an array of ones since these are the real images. Generated loss is a sigmoid cross-entropy loss of the generated images and an array of zeros as these are the fake images. The total discriminator loss is the sum of the mean square error of the real loss and the generated loss.
Accuracy is calculated by expressing the total discriminator loss as a percentage. The lower the discriminator loss, the higher the accuracy.
Generator loss is a sigmoid cross-entropy loss of the generated images and an array of ones. This includes the L1 loss which is the mean absolute error between the generated image and the target image, hence allowing the generated image to become structurally similar to the target image.
In cycle consistency loss, the original dirty image is passed via the first generator to yield a generated image. This generated image is passed via the second generator to yield the reconstructed image. The mean absolute error is calculated between the original dirty image and the reconstructed image. The lower the mean absolute error, the more structurally similar the reconstructed image is compared to the original dirty image.
Peak signal-to-noise ratio (“PSNR”) is defined as the ratio of the maximum possible power of a signal and the power of distorting noise which deteriorates the quality of its representation. PSNR is usually expressed in terms of mean-squared error. The higher the PSNR value, the better is the image quality.

Fig 9. Average results for quantifiable performance evaluation metrics

Comparing the results obtained from the four factors, the significant improvement in accuracy from 61% to 99% reflects a reduced discriminator loss obtained during the translation from original images to generated images. With a decrease in generator loss, the generated image was structurally similar to the original image.

For cycle consistency loss, considerable improvement was seen with the increase in training epochs and datasets. The decrease in mean absolute error meant that the reconstructed image was structurally similar to the original dirty image. There was no significant difference noted for the PSNR value as high image quality was obtained.

Let’s take a look at the learning curves upon training the model for 300 epochs using the combined dataset.

Fig 10(a) Generator and Discriminator Losses

Fig 10(b) Training Cycle Consistency Loss

The above learning curves derived from the training dataset gives us an idea of how well the model is learning. These learning curves were evaluated on the individual batches during each forward pass.

The loss curves reflected a good fit as it had decreased to a point of stability. The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal because every gradient update should be improving the loss function monotonically (unless the learning rate is set too high).

As for the training accuracy plot, there was some volatility given the stochastic nature of the learning algorithm. Although a slower learning rate of 0.0002 was used, we had increased the momentum to 0.5 to ensure that the model is learning well.

Template matching is a method for searching and finding the location of a template image in a larger image. The template image is slid over the input image and a comparison is made between the template and patch of input image under the template image. A grayscale image is returned, where each pixel denotes how much does the neighborhood of that pixel had matched with the template.

Fig 11(a) Template matching for **100 epochs** using only the **Kaggle Dataset**

Fig 11(b) Template matching for 300 **epochs** using only the **Kaggle Dataset**

Fig 11(c) Template matching for **100 epochs** using only the **synthetic text generated**

Fig 11(d) Template matching for **300 epochs** using the **combined dataset**

Overall, the CycleGAN model had performed well for template matching. The matching result shows a grayscale image denoted by the level of intensity of how much the neighborhood of that pixel had matched with the template image. The detected point indicated a good coverage of the text being detected and matched to the template image.

Mean squared error (“MSE”) measures the average of the squares of the errors, i.e. the average squared difference between the estimated values and what is estimated. A value of zero for MSE indicates perfect similarity, while a value greater than one implies less similarity and will continue to increase as the average difference between pixel intensities increases as well.
Structural similarity index (“SSIM”) attempts to model the perceived change in the structural information of the image, whereas MSE is the actual estimated perceived error. Unlike MSE, the SSIM value can vary between negative one and one, where one indicates perfect similarity.

Fig 12(a) Output for **100 epochs** using only the **Kaggle Dataset**

Fig 12(b) Output for **300 epochs** using only the **Kaggle Dataset**

Fig 12(c) Output for **100 epochs** using only the **synthetic text generated**

Fig 12(d) Output for **300 epochs** using the **combined dataset**

With the increase in training epochs and datasets, the decrease in MSE is notable with an improvement in SSIM value. This indicates a higher structural similarity and the reduced difference in pixel intensities between the test image and output image generated.

Image subtraction is the process of subtracting the dirty image from the generated image. The purpose is to compare the pixel intensity of both images to ascertain how well the dirty image has been cleaned using the trained model.

Fig 13(a) Image subtraction output for **100 epochs** (left), and **300 epochs** (right) using only the **Kaggle Dataset**

Fig 13(b) Image subtraction output for **100 epochs** using only the **synthetic text generated** (left), and **300 epochs** using the **combined dataset** (right)

The test image was subtracted from the output image generated to ascertain how well the dirty image had been cleaned using the trained model. By comparing the pixel intensity of the output images, favorable results (ie. almost full black output image) were obtained by CycleGAN with a higher number of training epochs using the combined dataset.

The sample output images generated by CycleGAN for the various training parameters and factors are shown below.

Original column: Original dirty/ clean image;

Translated column: Corresponding clean/ dirty images upon training the original images;

Reconstructed column: Images reconstructed back to the original dirty/clean state using the translated images.

Fig 14(a) Output image trained for **100 epochs** using only the **Kaggle Dataset**

Fig 14(b) Missing text in a **translated** image

The output image trained for 100 epochs using only the Kaggle Dataset was not ideal for OCR. There was missing text noted in both the translated and reconstructed images.

Fig 15(a) Output image trained for **300 epochs** using only the **Kaggle Dataset**

Fig 15(b) Missing text in a **translated** image

There was gradual improvement in the output image trained for 300 epochs using only the Kaggle Dataset. With a higher number of training epochs, the occurrence of missing text had decreased in both the translated and reconstructed images.

Fig 16(a) Output image trained for **100 epochs** using only the **synthetic text generated**

Fig 16(b) Illegibility of text in a **translated** image

The output image trained for 100 epochs using only the synthetic text generated was illegible. Thus, increasing the number of epochs for training would help to improve the readability of output images.

Fig 17. Output image trained for **300 epochs** using the **combined dataset**

As observed from the above output image trained for 300 epochs using the combined dataset, the model performance for CycleGAN had shown a significant improvement with an increased number of training epochs. The addition of noise with the use of synthetic text generation helped to increase the amount of training data, thereby improving model performance. Deep networks such as CycleGAN perform better with a large amount of training data. The risk of over-fitting was mitigated by exposing the training model to an increased number of training samples.

Recommendations

The following areas can be taken into consideration to improve the performance of CycleGAN:

Increase dataset through data augmentation
Increase the number of epochs for training
Experiment with different learning rates and learning rate schedules to enhance convergence

The basis of the above recommendations was derived based on the improved performance observed in the increase in a dataset via synthetic text generation and an increased number of epochs for training. The complexity of the CycleGAN architecture encompasses the capacity to manage a massive dataset for training, thus providing good grounds for further improvement in model performance.

Conclusion

CycleGAN has proven to be an effective denoising engine to denoise and clean up documents for OCR.

In the absence of paired training data, CycleGAN’s unique use of cycle-consistency loss addressed the issue of learning meaningful transformations with unpaired data. It allows the generator to generate clean images that preserved the text of the dirty input image through image-to-image translation. The increased number of training epochs combined with an increase in the dataset through generating synthetic text had also helped to significantly improve the performance of the CycleGAN model.