Textract Configuration

Here we provide the steps to set up a basic Textract configuration. For full details on the field mapping settings and options see Intelligent Capture with Textract.

Vasion Automate Pro supports the following Amazon machine learning models:

  • Expense — used for accounting documents, for example, invoices, purchase orders, etc.
  • Lending — used for agreements, contracts, loan documents, etc.
  • Identity — used for personal identification, driver's licenses, passports, etc.
  • Analyze — used for documents that do not meet the other endpoint criteria.
  • Detect — used to detect the text in a document without needing to map data to an object.

In Vasion Automate Pro, these models are referred to as Textract API endpoints.

Requirements

Before you start a new Textract configuration you need the following:

  • A sample document used to map the fields identified by the Textract process.
  • An AWS account. Either of the following are required:
    • The AWS access key ID and secret key.
    • An AWS Role ARN and External ID.
      • The Role will need to be given permissions for the products being used.
  • S3 Bucket used to process the files. The AWS region and the name of the bucket are required.

    The S3 Bucket used to process the files should not be used as a storage location. It's used by Textract to process the documents to extract the text and data.

To utilize this feature, ensure that the IAM user has write access for the following actions:

  • ListCollections

  • CreateCollection

  • CreateUser

  • IndexFaces

  • AssociateFaces

  • SearchUsers

  • DeleteFaces

You are billed directly by Amazon for the number of Textract and Coprehend pages processed each month.

Textract Configuration

Once you have the required information, follow these steps.

  1. Navigate to Capture.
  2. Select Adv Image Processing from the side navigation.
  3. Select the New Configuration Type drop-down.
  4. Select Amazon Textract.

    New AIP configuration options.

  5. Enter a name for this Textract process in the Amazon Textract AIP Name field.

  6. Select either Key and Secret or Role Assumption for authentication.

    1. Key and Secret:
      1. AWS Access Key ID — enter the key ID for your AWS account.
      2. AWS Secret Key — enter the secret key for your account.
    2. Role Assumption:
      1. AWS Role ARN — enter the previously configured ARN.
      2. Duration (seconds) — enter a duration for the authentication to be maintained before reauthenticating.
      3. External ID — enter the External ID associated with the AWS Role Arn.
  7. Select Validate.

    Name and AWS account fields.

  8. Once the AWS account is validated complete the following:
    1. AWS Region — use the drop-down to select the AWS region for the account.
    2. S3 Bucket — use the drop-down to select the bucket you created to use with Textract.
  9. Select Validate.

    AWS Region and S3 Bucket fields.

  10. Once the Bucket is validated, complete the following:
    1. Textract API — use the drop-down to select the ML model you want to use for this process.
    2. Include full text OCR data — select this option if you want to include an .rtf file containing all the text extracted by Textract. This option makes the data available in Full Text Searches.
    3. Object — use the drop-down to select the object where you want to save the data value.
    4. Select Map Field Data to map the data returned from Textract to the Vasion Automate Pro object data. For information on this step, see Map Field Data.

    5. If you would like to limit the number of pages sent to AWS, enter the desired pages numbers or ranges separated by commas in the Process Selected Pages field. An empty value will process all pages of the document.

    6. Select the Amazon Comprehend PII Detection option to detect and handle PII found on documents. For more information on this functionality, see Comprehend PII configuration.
      ML model and Object fields.

New Textract configuration.

Textract API Options

There are five Textract API endpoints available to use within the Vasion Automate Pro advanced image processing configurations.

Expense

The Expense API endpoint can be used to extract data from documents like invoices, purchase orders, receipts, etc.

To configure the Expense endpoint:

  1. Select Expense from the Amazon Textract API drop-down.
  2. If you would like this process to produce a full text file of the document, select the Include Full Text OCR Data check box.

Identity

The Identity API endpoint can be used to extract data from documents like driver's licenses, passports, birth certificates, etc.

To configure the Identity endpoint:

  1. Select Identity from the Amazon Textract API drop-down.
  2. If you would like this process to produce a full text file of the document, select the Include Full Text OCR Data check box.

Lending

The Lending API endpoint can be used to extract data from documents like bank statements, collateral agreements, pay stubs, etc.

To configure the Lending endpoint:

  1. Select Lending from the Amazon Textract API drop-down.
This endpoint does not support generating full text OCR data.

Analyze

The Analyze API endpoint can be used to capture structured document data.

  1. Select Analyze from the Amazon Textract API drop-down.
  2. If you would like this process to produce a full text file of the document, select the Include Full Text OCR Data check box.

Detect

The Detect endpoint can be used to gather and organize all available text and data on the page being scanned. The Detect endpoint option automatically includes the full text OCR data option and you cannot edit the check box. The Map Field Data option is not enabled.

  1. Select Detect from the Amazon Textract API drop-down.

Detect API option.

Line Item Data

The Include Line Item Data check-box will return individualize line item information so that it can be used for searches, stored and exported for other line of business applications, or used for external data analysis through other platforms.

The Line Item Fail Logic radio buttons apply to the collection of line items altogether. If you want to capture the line item data regardless of Amazon Textract's confidence, select the Ignore radio button.

Include line item checkbox with fail logic radio buttons.

Map the Field Data

In the Map Field Data page, complete the following:

Import the Sample File

  1. Select Browse
  2. In the Select Import File modal:
  3. Drag and drop the file, or
  4. Select the drive / storage, navigate to the file location, and select the file.
  5. Select Continue.

    Select Import File modal with file selected.

New field map configuration.

Map the Fields

The object fields are shown as drop-downs. use the drop-downs to select the fields you want to map. Some fields can be left blank, depending on the object and how you want to process the file. If you want to see a preview of the sample file, select the Preview button.

Fields mapped.

Select a chip from one of the mapped object drop-downs to see the value in the document highlighted with the confidence score color.

Redact

For object fields that contain sensitive data, select the Redact option. When the file is processed by Textract and the confidence level is 61% and above, a version of the document is created with the information contained in the selected field redacted. When you open the document in the Document Viewer, you can still see the data in the object fields in the side panel.

Redacted document in Document Viewer.

Save the Configuration

  1. When you complete the mapping, select Save.
  2. In the Configuration page, select Save.

    Completed configuration page.