How to Extract Key Information from Business Documents using LayoutLMv3
A quick guide on how to use LayoutLMv3 to streamline business documents, understanding
To receive deep insights just like this and more, including top ML papers of the week, job postings, ML tips from real-world experience, and ML stories from researchers and builders, join my newsletter here.
The Need for Document Understanding
A lot of businesses produce a ton of documents every day, which in turn are consumed by other businesses. Some of these businesses include: legal firms, accounting firms, and e-commerce.
This requires a ton of manual labor to read, understand and extract the right information.
We can definitely do better.
Here’s one of the best approaches out there for document understanding which I personally tried.
Introducing LayoutLMv3.
LayoutLMv3 falls into the category of algorithms and models within the field of Intelligent Document Processing or IDP for short. This field aims to make document understanding easier for computers.
The better IDP algorithms become the more streamlined the process of consuming and digesting information inside different document formats becomes.
Here are the good and the bad about LayoutLMv3.
The Good about LayoutLMv3
LayoutLMv3 is a deep learning model that’s pre-trained using multimodal Transformers for Document AI with unified text and image masking.
LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked.
This unified architecture and training objectives make LayoutLMv3 a general-purpose pretrained model for both text-centric and image-centric Document AI tasks.
Experimental results show that LayoutLMv3 achieves state-of-the-art performance on :
text-centric tasks such as:
- form understanding,
- receipt understanding,
- and document visual question answering,
Image-centric tasks such as:
- document image classification
- and document layout analysis
The Bad about LayoutLMv3
LayoutLMv3 is very dependent on OCR engines.
This means that you can’t use it without a prior OCR model that does text detection and extraction.
Also, if you want to train your own model, then the annotation of your dataset may not be straightforward.
You basically have to use an OCR engine to do the extraction of the text.
Then you have to specify which texts represent which entity: invoice date, invoice number, customer name, customer address, …
There aren’t that many annotation tools out there to help you do this.
I personally had to build my own annotation tool because I needed to integrate LayoutLMv3 with a proprietary OCR engine.
Below is a sample output of how LayoutLMv3 can do question answering on a document.
To help you understand more about this model and even train it on your own data, here are some resources:
- You can check the original paper of LayoutLMv3
- You can also check the github repo
- If you want train and test the model by yourself, check out this Colab.
- To annotate your data to prepare it for training using LayoutLMv3, you can check these annotation tools: this and this.
Conclusion
Document understanding is a crucial part of many businesses. There have been lots of works recently that aim to make it easy, which in turn makes it easier to do extraction and sharing of useful information from documents. LayoutLMv3 stands as one of the top models in machine learning that show promising results in this area.
However, it’s worth knowing that this model has some drawbacks, particularly its dependency on a prior OCR engine.
References
[1] LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
[2] https://github.com/microsoft/unilm/tree/master/layoutlmv3