HomeAIConstruct a receipt and bill processing pipeline with Amazon Textract

Construct a receipt and bill processing pipeline with Amazon Textract

In right this moment’s enterprise panorama, organizations are continuously searching for methods to optimize their monetary processes, improve effectivity, and drive price financial savings. One space that holds important potential for enchancment is accounts payable. On a excessive degree, the accounts payable course of consists of receiving and scanning invoices, extraction of the related information from scanned invoices, validation, approval, and archival. The second step (extraction) will be advanced. Every bill and receipt look completely different. The labels are imperfect and inconsistent. A very powerful items of data akin to value, vendor identify, vendor tackle, and fee phrases are sometimes not explicitly labeled and need to be interpreted based mostly on context. The normal method of utilizing human reviewers to extract the info is time-consuming, error-prone, and never scalable.

Suta [CPS] IN
Redmagic WW

On this put up, we present automate the accounts payable course of utilizing Amazon Textract for information extraction. We additionally present a reference structure to construct an bill automation pipeline that allows extraction, verification, archival, and clever search.

Answer overview

The next structure diagram exhibits the levels of a receipt and bill processing workflow. It begins with a doc seize stage to securely gather and retailer scanned invoices and receipts. The subsequent stage is the extraction part, the place you go the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially associated relationships between textual content akin to vendor identify, bill receipt date, order date, quantity due, quantity paid, and so forth. Within the subsequent stage, you employ predefined expense guidelines to find out if you happen to ought to routinely approve or reject the receipt. Authorized and rejected paperwork go to their respective folders throughout the Amazon Easy Storage Service (Amazon S3) bucket. For authorized paperwork, you’ll be able to search all of the extracted fields and values utilizing Amazon OpenSearch Service. You may visualize the listed metadata utilizing OpenSearch Dashboards. Authorized paperwork are additionally set as much as be moved to Amazon S3 Clever-Tiering for long-term retention and archival utilizing S3 lifecycle insurance policies.

The next sections take you thru the method of making the answer.


To deploy this answer, you will need to have the next:

  • An AWS account.
  • An AWS Cloud9 atmosphere. AWS Cloud9 is a cloud-based built-in growth atmosphere (IDE) that allows you to write, run, and debug your code with only a browser. It features a code editor, debugger, and terminal.

To create the AWS Cloud9 atmosphere, present a reputation and outline. Hold every little thing else as default. Select the IDE hyperlink on the AWS Cloud9 console to navigate to IDE. You’re now prepared to make use of the AWS Cloud9 atmosphere.

Deploy the answer

To arrange the answer, you employ the AWS Cloud Growth Package (AWS CDK) to deploy an AWS CloudFormation stack.

  1. In your AWS Cloud9 IDE terminal, clone the GitHub repository and set up the dependencies. Run the next instructions to deploy the InvoiceProcessor stack:
git clone https://github.com/aws-samples/amazon-textract-invoice-processor.git
pip set up -r necessities.txt
cdk bootstrap
cdk deploy

The deployment takes round 25 minutes with the default configuration settings from the GitHub repo. Further output info can also be obtainable on the AWS CloudFormation console.

  1. After the AWS CDK deployment is full, create expense validation guidelines in an Amazon DynamoDB desk. You should use the identical AWS Cloud9 terminal to run the next instructions:
aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Worth' --output textual content)" VALUE {'ruleId': 1, 'kind': 'regex', 'subject': 'INVOICE_RECEIPT_ID', 'examine': '(?i)[0-9]{3}[a-z]{3}[0-9]{3}$', 'errorTxt': 'Receipt quantity isn't legitimate. It's of the format: 123ABC456'}"
aws dynamodb execute-statement --statement "INSERT INTO "$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-RulesTableName`].Worth' --output textual content)" VALUE {'ruleId': 2, 'kind': 'regex', 'subject': 'PO_NUMBER', 'examine': '(?i)[a-z0-9]+$', 'errorTxt': 'PO quantity isn't current'}"
  1. Within the S3 bucket that begins with invoiceprocessorworkflow-invoiceprocessorbucketf1-*, create an uploads folder.

In Amazon Cognito, it’s best to have already got an current consumer pool known as OpenSearchResourcesCognitoUserPool*. We use this consumer pool to create a brand new consumer.

  1. On the Amazon Cognito console, navigate to the consumer pool OpenSearchResourcesCognitoUserPool*.
  2. Create a brand new Amazon Cognito consumer.
  3. Present a consumer identify and password of your selection and be aware them for later use.
  4. Add the paperwork random_invoice1 and random_invoice2 to the S3 uploads folder to begin the workflows.

Now let’s dive into every of the doc processing steps.

Doc Seize

Prospects deal with invoices and receipts in a mess of codecs from completely different distributors. These paperwork are obtained by means of channels like exhausting copies, scanned copies uploaded to file storage, or shared storage gadgets. Within the doc seize stage, you retailer all scanned copies of receipts and invoices in a extremely scalable storage akin to in an S3 bucket.

Upload sample invoices


The subsequent stage is the extraction part, the place you go the collected invoices and receipts to the Amazon Textract AnalyzeExpense API to extract financially associated relationships between textual content akin to Vendor Title, Bill Receipt Date, Order Date, Quantity Due/Paid, and so on.

AnalyzeExpense is an API devoted to processing bill and receipts paperwork. It’s obtainable each as a synchronous or asynchronous API. The synchronous API permits you to ship pictures in bytes format, and the asynchronous API permits you to ship information in JPG, PNG, TIFF, and PDF codecs. The AnalyzeExpense API response consists of three distinct sections:

  • Abstract fields – This part consists of each normalized keys and the explicitly talked about keys together with their values. AnalyzeExpense normalizes the keys for contact-related info akin to vendor identify and vendor tackle, tax ID-related keys akin to tax payer ID, payment-related keys akin to quantity due and low cost, and basic keys akin to bill ID, supply date, and account quantity. Keys that aren’t normalized nonetheless seem within the abstract fields as key-value pairs. For an entire listing of supported expense fields, seek advice from Analyzing Invoices and Receipts.
  • Line objects – This part consists of normalized line merchandise keys akin to merchandise description, unit value, amount, and product code.
  • OCR block – The block incorporates the uncooked textual content extract from the bill web page. The uncooked textual content extract can be utilized for postprocessing and figuring out info that isn’t coated as a part of the abstract and line merchandise fields.

This put up makes use of the Amazon Textract IDP CDK constructs (AWS CDK elements to outline infrastructure for clever doc processing (IDP) workflows), which lets you construct use case-specific, customizable IDP workflows. The constructs and samples are a group of elements to allow definition of IDP processes on AWS and revealed to GitHub. The principle ideas used are the AWS CDK constructs, the precise AWS CDK stacks, and AWS Step Features.

The next determine exhibits the Step Features workflow.

Step function workflow

The extraction workflow consists of the next steps:

  • InvoiceProcessor-Decider – An AWS Lambda perform that verifies if the enter doc format is supported by Amazon Textract. For extra particulars about supported codecs, seek advice from Enter Paperwork.
  • DocumentSplitter – A Lambda perform that generates 2,500-page (max) chunks from paperwork and might course of massive multi-page paperwork.
  • Map State – A Lambda perform that processes every chunk in parallel.
  • TextractAsync – This process calls Amazon Textract utilizing the asynchronous API following finest practices with Amazon Easy Notification Service (Amazon SNS) notifications and makes use of OutputConfig to retailer the Amazon Textract JSON output to the S3 bucket you created earlier. It consists of two Lambda features: one to submit the doc for processing and one that’s triggered on the SNS notification.
  • TextractAsyncToJSON2 – As a result of the TextractAsync process can produce a number of paginated output information, the TextractAsyncToJSON2 course of combines them into one JSON file.

We focus on the small print of the subsequent three steps within the following sections.

Verification and approval

For the verification stage, the SetMetaData Lambda perform verifies whether or not the uploaded file is a sound expense as per the principles configured beforehand in DynamoDB desk. For this put up, you employ the next pattern guidelines:

  • Verification is profitable if INVOICE_RECEIPT_ID is current and matches the regex (?i)[0-9]{3}[a-z]{3}[0-9]{3}$ and if PO_NUMBER is current and matches the regex (?i)[a-z0-9]+$
  • Verification is un-successful if both PO_NUMBER or INVOICE_RECEIPT_ID is wrong or lacking within the doc.

After the information are processed, the expense verification perform strikes the enter information to both authorized or declined folders in the identical S3 bucket.

S3 output

For the needs of this answer, we use DynamoDB to retailer the expense validation guidelines. Nevertheless, you’ll be able to modify this answer to combine with your personal or business expense validation or administration options.

Clever index and search

With the OpenSearchPushInvoke Lambda perform, the extracted expense metadata is pushed to an OpenSearch Service index and is accessible for search.

The ultimate TaskOpenSearchMapping step clears the context, which in any other case may exceed the Step Features quota of most enter or output measurement for a process, state, or workflow run.

After the OpenSearch Service index is created, you’ll be able to seek for key phrases from the extracted textual content by way of OpenSearch Dashboards.

OpenSearch document search

Archival, audit, and analytics

To handle the lifecycle and archival of invoices and receipts, you’ll be able to configure S3 lifecycle guidelines to transition S3 objects from Normal to Clever-Tiering storage lessons. S3 Clever-Tiering screens entry patterns and routinely strikes objects to the Rare Entry tier once they haven’t been accessed for 30 consecutive days. After 90 days of no entry, the objects are moved to the Archive On the spot Entry tier with out efficiency impression or operational overhead.

For auditing and analytics, this answer makes use of OpenSearch Service for working analytics on bill requests. OpenSearch Service lets you effortlessly ingest, safe, search, combination, view, and analyze information for a variety of use instances, akin to log analytics, software search, enterprise search, and extra.

Log in to OpenSearch Dashboards and navigate to Stack Administration, Saved objects, then select Import. Select the invoices.ndjson file from the cloned repository and select Import. This prepopulates indexes and builds the visualization.

OpenSearch import

Refresh the web page and navigate to House, Dashboard, and open Invoices. Now you can choose and apply filters and increase the time window to discover previous invoices.

OpenSearch dashboard

Clear up

Once you’re completed evaluating Amazon Textract for processing receipts and invoices, we suggest cleansing up any assets that you just might need created. Full the next steps:

  1. Delete all content material from the S3 bucket invoiceprocessorworkflow-invoiceprocessorbucketf1-*.
  2. In AWS Cloud9, run the next instructions to delete Amazon Cognito assets and CloudFormation stacks:
cognito_user_pool=$(aws cloudformation list-exports --query 'Exports[?Name==`InvoiceProcessorWorkflow-CognitoUserPoolId`].Worth' --output textual content)
echo $cognito_user_pool
cdk destroy
aws cognito-idp delete-user-pool --user-pool-id $cognito_user_pool
  1. Delete the AWS Cloud9 atmosphere that you just created from the AWS Cloud9 console.


On this put up, we supplied an summary of how we will construct an bill automation pipeline utilizing Amazon Textract for information extraction and create a workflow for validation, archival, and search. We supplied code samples on use the AnalyzeExpense API for extraction of important fields from an bill.

To get began, check in to the Amazon Textract console to do that function. To study extra about Amazon Textract capabilities, seek advice from the Amazon Textract Developer Information or Textract Sources. To study extra about IDP, seek advice from the IDP with AWS AI providers Half 1 and Half 2 posts.

In regards to the Authors

Sushant Pradhan is a Sr. Options Architect at Amazon Net Companies, serving to enterprise clients. His pursuits and expertise embrace containers, serverless expertise, and DevOps. In his spare time, Sushant enjoys spending time outside together with his household.

Shibin Michaelraj is a Sr. Product Supervisor with the AWS Textract group. He’s targeted on constructing AI/ML-based merchandise for AWS clients.

Suprakash Dutta is a Sr. Options Architect at Amazon Net Companies. He focuses on digital transformation technique, software modernization and migration, information analytics, and machine studying. He’s a part of the AI/ML neighborhood at AWS and designs clever doc processing options.

Maran Chandrasekaran is a Senior Options Architect at Amazon Net Companies, working with our enterprise clients. Exterior of labor, he likes to journey and experience his motorbike in Texas Hill Nation.

Supply hyperlink

latest articles

Head Up For Tails [CPS] IN
ChicMe WW

explore more