What is Intelligent Document Processing?

6 min readMay 5, 2020

Intelligent document processing augments human understanding of unstructured data through data science tools like computer vision, optical character recognition, machine learning, and natural language processing in each stage of document data integration.

The reason why intelligent document processing (IDP) is gaining attention is that it provides disruptive solutions to automate data extraction projects that were previously extremely difficult, if not impossible to solve.

But it’s not just about the technology. Optical character recognition (OCR), and data science tools have been around for a very long time. What’s new is the combination of these tools into a single platform solution, and it’s transforming the way we work, especially with documents. Discovering new sources of data create better business outcomes and pave the way for human-initiated innovation.

IDP is a new way of capturing and extracting information. All the big technology companies are building intelligent tools, but the problem is that they aren’t accessible in a single, seamless platform. If you want the power of Azure, AWS, or Google’s advanced tools, they’re only available through APIs. These individual tools are great for testing and experimentation, but the modern enterprise needs a unified approach.

Intelligent document processing platforms are powerful software machines that fuel the data supply chain with labeled data from any text-based source.

Read on to discover:

Key components of IDP
How IDP manages each stage of document data integration
Why IDP is different from document capture
Examples of innovation using IDP
How to achieve success with IDP
IDP — the catalyst for transformation

What are the Key Components of an Intelligent Document Processing Platform?

Intelligent document processing platforms include every necessary step to transform paper or digital documents into accurately labeled data.

IDP platforms must:

Be industry agnostic
Be flexible to accommodate structured and unstructured data
Scale to process billions of extractions daily
Integrate with cloud and on-premise content management systems
Provide a visual interface for training and classification

Intelligent Document Processing Platforms Manage Each Stage of Document Data Integration.

Document capture — The platform directly integrates with scanning hardware to digitize physical media like paper or microform. Because not every document is digital, a solution is required to speed up traditionally slow scanning processes. Built-in integrations ingest data from digitally born content like text files, PDFs, and Microsoft Office productivity documents.

Image processing — Image processing is provided by computer vision algorithms that prepare a document for both optimal OCR and archival. The IDP platform will create two versions of digitized documents — one optimized for machine reading, and the other for on-screen viewing in a content management system.

OCR — Accurate OCR is necessary for machines to read text on documents. One of the cornerstone features of IDP is the use of multiple OCR engines. A “layered” approach eliminates the need for better OCR by synthesizing the results from multiple engines until near-100% accuracy is achieved.

Classification — Most business documents are groups of pages that contain different types of information. IDP classification engines are trained to recognize documents through machine learning and other intelligence-based techniques. Automatic document recognition is an important step in understanding the information within a document. Gone are the days of manual data entry for categorization.

Extraction — Successful data extraction hinges on the software’s artificial understanding of content. Because AI is only as smart as its training, the system must be trainable to find and label all expected information within a document. This includes identifying sections of natural language documents and extracting specific data elements like dates, names, numbers, etc.

Data Validation — All extracted data must be verifiable to be trusted. IDP platforms are unique because they leverage external databases and pre-configured lexicons to validate information. Any data that doesn’t match up is flagged for human review and correction.

Integration — Data integration requirements are extremely diverse. Because IDP platforms are critical sources in the data supply chain, they must integrate with all downstream applications. This includes cloud and local databases and document repositories. Labeled data and metadata are attached to human-readable copies of the data for portability and for search and discovery.

Why Intelligent Document Processing is Different than Document Capture

The biggest difference in IDP compared to traditional capture is innovation. The big names in capture stopped innovating their solutions over a decade ago. And the reason is two-fold:

First those tools were created in an era where conserving compute was important. Their software architecture was not built for the scalability demanded by today’s data-hungry applications. And since many of these platforms have grown through acquisition, a platform-wide software re-build to meet the requirements of IDP would simply be too expensive.

The second problem is that the customer-base for the traditional document capture companies is large. They are profitable as-is and would like to avoid disrupting their customers’ existing workflows with a required upgrade. Instead of innovating capture, they have focused on developing other technologies like robotic process automation, or have rebranded to make the appearance of having IDP capabilities (sad, but true).

Where’s the Innovation?

One of the best examples of innovation through intelligent document processing is a massive project taken on by the U.S. Nuclear Regulatory Commission. I like to talk about this use-case because it includes a valuable lesson from the past.

Before their IDP project, they experienced a massive failure from a technology vendor who used a traditional capture approach. An attempt to integrate data from an archived data source took five years, and didn’t provide the promised results.

In what turned out to be one of the biggest and most successful government records projects, they integrated labeled data from over 39 million pages of records in under two years. The information contained in the documents was integrated into a central database where pristine document images where linked to the data.

In another example, one of the U.S’s largest healthcare data processing companies needed a solution to process billing and claims information for hundreds of thousands of patients. The workload required on back-end systems was massive.

By using an intelligent document processing platform, they transform gigabyte-sized text files into billions of data extractions needed to complete mission-critical workflows on a daily basis.

But it isn’t just government or big enterprise that benefits from intelligent document processing. IDP platforms are being used to process:

Invoices & financial statements
Mortgage documents
Oil and gas documents
Contracts & leases
Explanation of benefits
Complex forms
Electronic files
Medical forms
And more!

How to Achieve Success with Intelligent Document Processing

The key to success with IDP platforms is in developing document data literacy. Before a software is trained to integrate data, a significant amount of time must be spent gaining an understanding of what information is available and the business outcomes related to that information. If that sounds logical, it’s because it is! However, (either from marketing hype or miss-matched expectations) there is a tendency to skip this step.

To achieve document data literacy, it’s critical to consult the subject matter experts who use the information to produce work. Their intimate understanding of both the business value and interpretation of the information on the documents they work with ensures the right data is extracted and, what should be done with it.

Gaining a system-wide understanding of what your data represents and how it is used paves the way for improved workflows through intelligent automation and business process re-design.

Intelligent Document Processing is the Catalyst for Transformation

At the heart of intelligent document processing is the concept of change and disruption. Just like a caterpillar must change to achieve flight, the modern business cannot advance without a catalyst.

In all organizations, data plays a critical role in transformation. Either by gaining new sources of data, or finding new methods of analysis, they discover valuable insight needed to disrupt their industry by creating something new.

Data is the most important element of “going digital.” It has been said that data will be the new oil, and that time has certainly arrived. Because the result of digital transformation is creating new value propositions, products, operating models, and capabilities, it is clear that data, and data alone is the single most important factor for disruptive success.

And if you’re wondering where people fit in, they are at the epicenter of disruption. Advances in digital dexterity and data literacy give modern workers the tools needed to see the path towards change. IDP augments the modern workforce by providing a stream of valuable information into software applications. New workflows become transformative business enablers as we re-imagine the way we work.

Data is the great enabler of digital transformation and organizations who invest in intelligent document processing will stay at the forefront of innovation and progress.