Advances in OCR (Optical Character Recognition) and AI (Artificial Intelligence) have made it possible to automate the extraction of highly accurate structured data from mortgage documents.
Since every stage of the loan lifecycle requires extensive document analysis, it pays to know how OCR and AI can help automate this process.
In this post, we’ll cover:
Automated structured data extraction is a process where unstructured data, like a mortgage document, is used as input. The output from this process is a set of data points—such as borrower name, age, assets, etc.— structured data.
Structured means that for each type of document processed, you receive the same set of data points consistently. This data can then be used by other software systems.
A familiar example of data extraction in the mortgage industry is a Loan Officer reviewing borrower documents (unstructured data) and manually inputting required information into forms in a Loan Origination System (LOS).
In this context:
Automated structured data extraction is the same process but done by software instead of humans.
Structured data from mortgage documents alone won’t make much impact on your mortgage operations. It’s how you use this data to automate processes that will.
You usually need structured data as input to automate the process.
And if you don’t have structured data, you either:
So, having structured data makes it possible for you to:
Here are a few examples of tasks you can automate by having structured data from mortgage documents:
Below is a six-step process for automated mortgage document processing and data extraction.
To extract data, your document processing system must first receive the documents.
Thus, the process begins with your system pulling documents for processing from various sources.
Common sources include:
Each document should be processed with a specialized processor to ensure consistent, structured, and accurate data.
The problem is that File ≠ Document.
A single PDF file can contain multiple documents.
For example, a correspondent lender loan package PDF might include more than 20 distinct documents.
So, it's crucial to identify and separate the individual documents within each file before processing.
An OCR & AI model trained for document classification and file splitting can be used. Such a model takes a PDF file as input and outputs a list of individual documents.
The more pages you're trying to process with a single processor, the lower the accuracy you can expect.
So, to process lengthy mortgage documents like URLA (Form 1003) with high accuracy, we need to split them even further by page.
This will let you use even more specialized processors trained to extract data from a single page.
Once the system knows your document on the borrower file, you can route each to a specialized processor to extract structured data.
Sometimes, you might need to apply multiple processors per page to extract data that some processors can't.
For example, non-text data like signatures and checkboxes are better extracted by a different processor than the one you use to extract text data.
Sometimes, AI can't extract fields from documents with enough accuracy.
In this case, we need to loop in humans to review the data and correct if it is wrong.
Usually, AI document processing products offer out-of-the-box Human-In-The-Loop (HITL) interfaces to handle this workflow.
Once data is extracted and items with low confidence reviewed and corrected, we can push it downstream.
You can feed this data into other systems to automate mortgage lending operations.
Here are some common destinations:
You saw the workflow to extract data from mortgage documents in the section above.
But it is different from implementing one in your operations.
Below, I outline how to approach building your automated data extraction workflow.
Start by defining what data you need to extract and where to use it.
Once you know what data you need, list the documents you need to get this data.
And once you know what document you need, find out how you'll get them.
You should have:
The next step is to find models based on the list you created in the previous step.
You'll need 2 types of models.
The first is the one that splits files into document types you need to support.
The second is the one that can extract data from each document type.
To get this model, you have 2 options:
You can find more details about the differences between these options in the section below.
By the end of the step, you should have the following:
Once you have models, the next step is implementing the data extraction workflow.
By the end of this step, you should have an end-to-end extraction workflow, from getting raw files to pushing extracted data into downstream integrations.
The last step is to fine-tune and up-train your models to improve accuracy.
That's especially true for extracting data from mortgage documents, as fewer providers have pre-trained models for mortgage documents.
So unless you find a provider that already has pre-trained models for every document type you need to support, there will be a period where you'll need to invest more time into up-training.
The process will involve reviewing and correcting data for documents with low accuracy.
You can either:
By the end of this step, you should have a document data extraction system that processes most of the documents with high accuracy. And only in rare cases does human involvement need to correct items that have low confidence.
Quite a few AI document-processing products & tools are available on the market.
Their main difference is the degree to which they work for mortgage document extraction out of the box.
And it comes down to how many steps of the 6-step process they cover:
Some of the solutions cover all 6 steps. Other solutions cover none.
The less customization you need, the higher the cost per document you can expect.
The more you invest to get it working, the less cost per document is.
Here, you can find a list of providers that you can use to automate data extraction from mortgage documents. That’s not an exhaustive list of the providers; these are the ones that, in my opinion, are the most relevant for mortgage document data extraction.
Low-level:
💡 Low-level solutions are the ones that need the most engineering involvement to make them work for mortgage documents. But they tend to have the lowest cost per document.
Mid-level:
💡 Mid-level solutions are usually built on top of one or multiple low-level solutions and remove some complexity in implementation. Most come with pre-trained models relative to the mortgage industry and have up/down-stream integrations with popular mortgage software.
Specialised:
💡 Specialised solutions are usually built on top of mid-level solutions. They take them further by providing out-of-the-box automation using the data they extract.
I hope you enjoyed this piece, and it helped you get an insight into how to use OCR & AI to extract structured data from mortgage documents.
If you’d like to stay on top of the latest mortgage tech and how it can be applied to mortgage operations, consider joining our mortgage technology newsletter.