Our Blog

article How to Extract Insights from Unstructured Data with Google Document AI (Step-by-Step Guide)
published on Mar 05, 2024

Introduction

Unstructured data poses a big challenge in fast-paced business environments. Traditional manual extraction and data entry from documents, such as invoices, is time-consuming and prone to errors. With its AI and ML capabilities, Google’s Document AI offers a powerful, efficient, and scalable solution for automating document processing workflows.

In the following sections, we’ll explore a practical example to analyze invoice data and upload it directly to a BigQuery table.

What is Document AI?

Document AI, a component of Google Cloud Platform, applies machine learning to modernize unstructured document data management. It extracts critical data from diverse documents like PDFs and scanned images. By pinpointing key details like names and addresses, it boosts workflow efficiency by turning unstructured documents into structured, actionable information.

Here’s why Google Document AI is a game-changer:

  • Streamlines Organization: It intelligently labels and organizes files for you, making it easier to find important information.

  • Minimizes Errors: Its advanced deep learning algorithms reduce the risk of human error in document processing. It knows how to tag, group, and organize your documents accurately.

  • Accelerates Outcomes: It’s equipped with AI models that are ready to go, requiring no training. You can quickly extract or classify data by submitting documents to a user-friendly API endpoint. This efficiency helps you gain valuable insights from your data in no time.

  • Saves Money: It speeds up information extraction from documents, leading to significant cost savings. Your team will have more time for tasks that add greater value.

Document AI improves document processing but may still encounter difficulties with unreadable or unusually formatted data, potentially leading to errors. Training can enhance its performance, but businesses need to be aware of these issues before implementing the tool.

The effectiveness of the tool depends on the quality and consistency of the input data. For example, low-quality scans or handwritten text can present challenges to its accuracy.

How to scan PDF invoice data using Document AI

To get started, you’ll need an active Google Cloud Platform account, with a payment method set up, and a basic understanding of how the GCP operates.

Create a Service Account

A service account is a special type of Google account that is used by applications and services to interact with Google Cloud Platform (GCP) resources securely. We will need this account for our script to be able to interact with both Document AI and BigQuery.

To create a service account in Google Cloud Platform, follow these steps:

  1. Access the IAM & Admin page: After logging into your GCP Console, click on the “IAM & Admin” option in the left-hand menu to navigate to the IAM & Admin page.

  2. Open “Service Accounts”: Once on the IAM & Admin page, select the “Service Accounts” option. This displays a list of existing service accounts in your project.

  3. Establish a new service account: Click the “Create Service Account” button at the top of the page to start creating a new service account.

  4. Provide the service account details: Fill in the dialog box with a name and description for your service account. If you wish, you can also designate a unique service account ID.

  5. Assign permissions: Choose the roles or permissions to assign to the service account. Pick from a range of predefined roles like “Editor” or “Viewer,” or create custom roles with particular permissions. Your service account should have the “Document AI API User” and “BigQuery Data Editor” roles.

  6. Create the service account: Specify the details and permissions for the service account, then click on the “Create” button to create the service account.

  7. Download the service account key: After creating the service account, download the prompted JSON key file. This file, which contains the service account’s credentials, will authenticate and grant your application or service access to GCP resources.

GCP Service Account

Enable Document AI API and create a processor

Use the search panel to access the Document AI page. If you are a first-time user, you will need to enable the service.

To analyze documents and extract information, we use a “processor”. While you have the option to create a custom one, which requires a dataset with example documents and labeling, Google already offers a wide variety of ready-to-use, out-of-the-box models that need little or no training. You can check these models out here.

For our case, choose the “Invoice Parser” model. We simply find it in the processors gallery and select “Create Processor.”

Doc AI Processor Gallery

Once you create the processor, its details will be visible. For this use case, you will only need to note down the processor ID shown in the image.

Doc AI parser screen

Collect documents

In this initial approach to Document AI, we will only process local files from our PC.

You should create a folder on your computer to store all the invoices you want to analyze. Ensure that all files have the same type (in our case, we will use PDF files for all invoices).

Create BigQuery Table

Before proceeding, you should create the final destination for the processed data. Here’s example code you can use, but feel free to modify it to suit your needs.

Create BigQuery Table

Note that the ‘line_items_list’ will be a complex data type, consisting of a list of all items detected in the invoice.

Create the Script

Document AI uses a unified API endpoint, offering several benefits. We use the same endpoint for each processor we create, only needing to specify our processor by including the processor ID and project ID in the request we send to the API. This allows us to use the same client library and authentication for each processor.

The responses also use a standard document structure, which we call a document object, to simplify development. This document object contains all the document’s information, including the layout of the raw text, extracted entities, and languages.

We have to decide how our application will process the documents when we call the API. Document AI supports both batch (asynchronous) and online (synchronous) processing.

In our example, we’ll use online processing to process one invoice at a time. We could use batch processing, but the logic would be slightly more complex.

Install the Client libraries

We developed our code in Python. You can install the necessary libraries to interact with Document AI and BigQuery using the following commands.

pip install --upgrade google-documentai google-cloud-bigquery

Load modules and define variables

In this part, we actively import all the necessary modules and functions for the code. We opt to use an .env file to store our environment variables when dealing with sensitive information.

import os

The “mime_type” variable informs Document AI about the type of document it needs to process. The user, when running the code in the console, will provide the name of the file for processing, which is stored in the “file_name” variable.

Get Service Account credentials

In the first step, we created a service account and downloaded its JSON format credentials. These credentials allow our code to interact with Google Cloud services. We should save these credentials in an appropriate location, such as the same folder as our ongoing project. There are several ways to authenticate, but in this case, we use the from_service_account_file() method.

credential python script

Create function to process the PDFs

This function carries out the following actions:

  • Initializes a Document AI client with the provided credentials.

  • Reads the file into memory.

  • Loads the binary data into a Document AI RawDocument object with the specified MIME type.

  • Configures a process request with the processor’s resource name and the raw document.

  • Processes the document using the Document AI client.

  • Returns the processed document.

process pdf

Processing the Document AI response

While the Document AI response includes a wide variety of data, we are only interested in a few specifics. Initially, we will establish a final structure for the data that we will use to input the data into a BigQuery table.

response structure

In the “entities” section below, we find the document data that interests us. In the context of Document AI, entities are specific pieces of information that the system extracts from a document during the processing phase.

These entities can encompass various data types, such as names, dates, addresses, numbers, or any other structured information in the document.

Document AI processes a document by analyzing its content and identifying relevant entities based on predefined rules and models. The system then extracts these entities and includes them in the processing result.

For instance, when Document AI processes an invoice document, it could extract entities like the vendor’s name, invoice number, date, total amount, and line items. Similarly, if it processes a legal contract, the entities could include the names of the parties involved, effective dates, contract terms, and obligations.

Choosing the Invoice processor developed by Google offers an advantage because it’s already optimized to identify specific entities in this type of document. You can view the fields that each Google Parser (processor) detects by following this link.

The following block of code loops over the identified entities and fills our templates with the discovered data.

extract data

Insert data into BigQuery

The following function actively loads information into BigQuery. It takes the previously processed information, formatted as per the previous step, as input.

load into bigquery

Putting it all together and running

Group the previously developed functions by creating a main function.

mimetype

Then we can run the file from the console by typing:

python script file

Check the results

If you have successfully followed all the previous steps, you should now see a new row in your newly created BigQuery table. This row will contain all the information extracted from the processed invoice.

For our example, we analyzed a test invoice PDF, as shown in the image:

Google Invoice pdf

And this is the data inserted in the table:

invoice parser BQ data table

Conclusion

This blog serves as an introduction to using Document AI. To create more suitable real-world applications, you can take several steps. For instance:

  • Process multiple files simultaneously in batches.

  • Set up automatic processes to handle new files instantly when added to a specific location.

  • Implement quality control and data validation checks to verify the reliability of the information Document AI generates.

In future installments, we’re likely to delve deeper into these aspects. If you want to learn more about how to leverage Google Cloud Platform for your business and get the most out of it, subscribe to our newsletter or reach out to us for any questions!

Data Driven

Stay in the know! Sign up to receive our latest news and updates by filling out our contact form today.

Send us a message

Optional
Max. 500 characters