Using AI to understand and extract data from your documents

5 min readMay 2, 2019

Using AI to understand and extract data from your documents

Data extraction from documents, or document analysis, has become one of the most highly sought after capabilities from businesses. The are looking to the tech industry and the world of AI to solve this — now more than ever. Its been something that I’ve been working on very closely with many clients with various use cases across different industries. Whenever presenting a solution, the most common question I get is if others can use “that solution” to extract data from their client’s documents to optimize their business processes. The answer is — maybe, but there is no one size fits all. I am going to share the different aspects that you need to consider when looking at a document data extraction use case and share some real examples.

First, some clarity on the technical capability versus the business value. The tech capability is making sense of and extracting data from documents to be used in some way. Some examples are making a document more searchable with tagged metadata, automatically classifying and routing it to the right place, or automating your business application by feeding it critical data from those documents. The business value is it helps automate and scale your business process, saving you a lot of time, effort and money.

When looking at any document understanding or document extraction use case, the first thing you need to do is understand the types of documents you are dealing with. Are these unstructured documents such as emails or news articles? Are these structured documents such as application forms or shipping labels? Do they contains tables of data? What is the file format— are these PDFs, Word documents, JPEGs, PNGs, etc? If PDFs, are these digital or scanned? If they are structured forms as digital PDFs or Word documents, is there a template being used, is there a finite set of formats or is there a large or infinite number formats? If they are scanned PDFs or image files, is the image clear, with the same orientation, and same quality?

There are many important differences and permutations within documents and the above are some of the most critical aspects to look for. The approach you take to implementing a robust solution will depend on those variables. Again, it is not a one size fits all.

Let’s look at a few examples. Note: these solutions are examples, based on real implementations, and all will depend on the nature of your data. You can also use combinations of the solutions below to achieve your goals.

Example 1 — Email

There are many use cases where email communication is analyzed for certain things. This could be anything — auto damage claims, service desk inquiries, bullying, inappropriate behavior, etc. Let’s take “claims” as the example where you need to automate the claims process by helping your customers file claims more quickly, using AI to identify everything you need to automate the process. In your support organization’s incoming emails from your customers, you need to detect things like type of claim (eg auto, home) accident date, type of auto damage (eg rear bumper is dented, passenger side door was smashed), injury sustained (eg left arm broken, extreme neck and back pain).

A solution for this type of use case would be to train a machine learning model to learn and understand the claims domain. You can use the IBM Watson Natural Language Classifier service (or IBM Watson Assistant) to classify each email to understand the intent. You can then use IBM Watson Knowledge Studio to train a ML model, and deploy it to either IBM Watson Discovery or the IBM Watson Natural Language Understanding service, to analyze and extract specific metadata around claims. You can also use Watson Tone Analyzer to identify the emotion of the customer (eg angry or frustrated) to prioritize your email actions. These are key differentiating products in the AI market offered on the IBM Watson platform. Based on all these Watson AI-powered insights, you are able to automate your claims process by assisting customer service agents working these claims or automatically filing claims for customers.

Example 2 — Form Automation

In this use case, your a financial organization and need to validate client financial documents in your audit and underwriting processes. You look to validate data such as personal and financial information. These are digital PDF, structured documents with a known set of templates provided by your financial organization.

A solution for this would entail using IBM Watson’s Natural Language Classifier service (or IBM Watson Assistant) to first classify the type of document that you are analyzing. Once you know what type of document you are analyzing, then you know what type of information that you are looking for. Next, you need to understand the layout of the document. This can be achieved a number of different ways such as creating a custom machine learning model leveraging IBM Watson Studio and use coordinate-based annotation to do this. You can also use IBM’s Business Automation Content Analyzer which is an offering that can be trained to understand the structure of your document. Once you understand the layout of the document, you can leverage IBM Watson’s Visual Recognition service to detect objects like checkboxes and validate if a signature has been provided.

Example 3 — Invoice Management

In this use case, you are an shipping department that is trying to validate invoices or purchase orders from various different vendors, all with different formats and templates. These are primarily scanned images (converted to PDFs) and do not have perfect orientation or cosmetics. Your technical goal is to extract and validate buyer info, shipper info, and item information such as type of item, quantity, price etc. to ensure that the right items have been billed, shipped and received.

A solution for this would be to leverage the IBM Watson Compare and Comply service’s “invoice understanding” model. This is a pre-trained model geared towards understanding various types of invoices and billing documents to extract key/value pairs of data and understand table structures. This accounts for many different variations of documents where the formats are not all known.

The Bottom Line

As I’ve pointed out some of the most common use cases and solutions, there are many other variations that can be achieved leveraging the IBM Watson & Cloud Platform. Many organizations look to optimize their business process using Watson AI and automation to increase their efficiency and scalability. So next time you see a data extraction solution and wonder if “it” can work for you, you’ll be better positioned and aware of the key aspects to look for to help you answer that question.

Marc is the CTO for IBM Watson AI Strategic Partnerships. When not leading the technology vision and strategy for IBM Partnerships, Marc enjoys DJing, playing video games and wrestling.

Written by Marc Nehme