Last Updated: June 26, 2024
Textract and Comprehend
Amazon Textract pulls the text and structural information from files and Amazon Comprehend performs a deeper analysis to detect Personal Identifiable Information (PII) Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. Further, PII is defined as information: (i) that directly identifies an individual (e.g., name, address, social security number or other identifying number or code, telephone number, email address, etc.) or (ii) by which an agency intends to identify specific individuals in conjunction with other data elements, i.e., indirect identification. Additionally, information permitting the physical or online contacting of a specific individual is the same as personally identifiable information. in English or Spanish text documents. Depending on your needs, you can specify one or more types of PII you want to identify.
Amazon Comprehend refers to the different types of PII as entities. A list of PII entities is included in this topic. There is an ALL option available in the Comprehend Type configuration that includes all the options listed. Keep in mind the maximum number character for a text field is 100.
Here we provide the steps to set up a basic configuration using Comprehend PII detection.
Requirements
Before you start a new Textract configuration you need the following:
- An AWS account. The AWS access key ID and secret key are required.
-
S3 Bucket used to process the files. The AWS region and the name of the bucket are required.
The S3 Bucket used to process the files should not be used as a storage location. It's used by Textract to process the documents to extract the text and data.
- An object text field to store the PII information detected.
You are billed directly by Amazon for the number of pages processed each month.
Textract Configuration
Once you have the required information, follow these steps.
- From the side navigations select Adv Image Processing.
-
Select the New Configuration Type drop-down, and then select Textract.
- In the configuration page complete the following:
- Textract AIP Name — enter the name you want to use to identify this process.
- AWS Access Key ID — enter the key ID for your AWS account.
- AWS Secret Key — enter the secret key for your account.
Select Validate.

- Once the AWS account is validated complete the following:
- AWS Region — use the drop-down to select the AWS region for the account.
- S3 Bucket — use the drop-down to select the bucket you created to use with Textract.
Select Validate.

- Once the S3 Bucket is validated, complete the following:
- Textract API — use the drop-down to selectDetect.
- Object — use the drop-down to select the object where you want to save the data value.
- Check the Comprehend PII detection check box.
- Select from the following options:
- Highlight — this option highlights the text identified as PII.
- Redact — this option redacts the text identified as PII.
- Field to store PII result — select the field where you want to store the PII data.
- Comprehend Types — use the drop-down to select the type of PII to detect.
Select Validate Comprehend Access.

- Once the access to Amazon Comprehend is validated, select Save from the top-right corner.
PII Entity Types
Amazon Comprehend supports universal entity types and some country-specific entity types. Below is a comprehensive list of both types.
PII universal entity types
Some PII entity types are universal (not specific to individual countries), such as email addresses and credit card numbers. Amazon Comprehend detects the following types of universal PII entities:
- ADDRESS
- A physical address, such as "100 Main Street, Anytown, USA" or "Suite #12, Building 123". An address can include information such as the street, building, location, city, state, country, county, zip code, precinct, and neighborhood.
- AGE
- An individual's age, including the quantity and unit of time. For example, in the phrase "I am 40 years old," Amazon Comprehend recognizes "40 years" as an age.
- AWS_ACCESS_KEY
- A unique identifier that's associated with a secret access key; you use the access key ID and secret access key to sign programmatic AWS requests cryptographically.
- AWS_SECRET_KEY
- A unique identifier that's associated with an access key. You use the access key ID and secret access key to sign programmatic AWS requests cryptographically.
- CREDIT_DEBIT_CVV
- A three-digit card verification code (CVV) that is present on VISA, MasterCard, and Discover credit and debit cards. For American Express credit or debit cards, the CVV is a four-digit numeric code.
- CREDIT_DEBIT_EXPIRY
- The expiration date for a credit or debit card. This number is usually four digits long and is often formatted as month/year or MM/YY. Amazon Comprehend recognizes expiration dates such as 01/21, 01/2021, and Jan 2021.
- CREDIT_DEBIT_NUMBER
- The number for a credit or debit card. These numbers can vary from 13 to 16 digits in length. However, Amazon Comprehend also recognizes credit or debit card numbers when only the last four digits are present.
- DATE_TIME
- A date can include a year, month, day, day of week, or time of day. For example, Amazon Comprehend recognizes "January 19, 2020" or "11 am" as dates. Amazon Comprehend will recognize partial dates, date ranges, and date intervals. It will also recognize decades, such as "the 1990s".
- DRIVER_ID
- The number assigned to a driver's license, which is an official document permitting an individual to operate one or more motorized vehicles on a public road. A driver's license number consists of alphanumeric characters.
- An email address, such as marymajor@email.com.
- INTERNATIONAL_BANK_ACCOUNT_NUMBER
- An International Bank Account Number has specific formats in each country. See www.iban.com/structure.
- IP_ADDRESS
- An IPv4 address, such as 198.51.100.0.
- LICENSE_PLATE
- A license plate for a vehicle is issued by the state or country where the vehicle is registered. The format for passenger vehicles is typically five to eight digits, consisting of upper-case letters and numbers. The format varies depending on the location of the issuing state or country.
- MAC_ADDRESS
- A media access control (MAC) address is a unique identifier assigned to a network interface controller (NIC).
- NAME
- An individual's name. This entity type does not include titles, such as Dr., Mr., Mrs., or Miss. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the "John Doe Organization" as an organization, and it recognizes "Jane Doe Street" as an address.
- PASSWORD
- An alphanumeric string that is used as a password, such as "*very20special#pass*".
- PHONE
- A phone number. This entity type also includes fax and pager numbers.
- PIN
- A four-digit personal identification number (PIN) with which you can access your bank account.
- SWIFT_CODE
- A SWIFT code is a standard format of Bank Identifier Code (BIC) used to specify a particular bank or branch. Banks use these codes for money transfers such as international wire transfers.
- SWIFT codes consist of eight or 11 characters. The 11-digit codes refer to specific branches, while eight-digit codes (or 11-digit codes ending in 'XXX') refer to the head or primary office.
- URL
- A web address, such as www.example.com.
- USERNAME
- A user name that identifies an account, such as a login name, screen name, nick name, or handle.
- VEHICLE_IDENTIFICATION_NUMBER
- A Vehicle Identification Number (VIN) uniquely identifies a vehicle. VIN content and format are defined in the ISO 3779 specification. Each country has specific codes and formats for VINs.
Country-specific PII entity types
Some PII entity types are country-specific, such as passport numbers and other government-issued ID numbers. Amazon Comprehend detects the following types of country-specific PII entities:
- CA_HEALTH_NUMBER
- A Canadian Health Service Number is a 10-digit unique identifier, required for individuals to access healthcare benefits.
- CA_SOCIAL_INSURANCE_NUMBER
- A Canadian Social Insurance Number (SIN) is a nine-digit unique identifier, required for individuals to access government programs and benefits.
- The SIN is formatted as three groups of three digits, such as 123-456-789. A SIN can be validated through a simple check-digit process called the Luhn algorithm.
- IN_AADHAAR
- An Indian Aadhaar is a 12-digit unique identification number issued by the Indian government to the residents of India. The Aadhaar format has a space or hyphen after the fourth and eighth digit.
- IN_NREGA
- An Indian National Rural Employment Guarantee Act (NREGA) number consists of two letters followed by 14 numbers.
- IN_PERMANENT_ACCOUNT_NUMBER
- An Indian Permanent Account Number is a 10-digit unique alphanumeric number issued by the Income Tax Department.
- IN_VOTER_NUMBER
- An Indian Voter ID consists of three letters followed by seven numbers.
- UK_NATIONAL_HEALTH_SERVICE_NUMBER
- A UK National Health Service Number is a 10-17 digit number, such as 485 777 3456. The current system formats the 10-digit number with spaces after the third and sixth digits. The final digit is an error-detecting checksum.
- The 17-digit number format has spaces after the 10th and 13th digits.
- UK_NATIONAL_INSURANCE_NUMBER
- A UK National Insurance Number (NINO) provides individuals with access to National Insurance (social security) benefits. It is also used for some purposes in the UK tax system.
- The number is nine digits long and starts with two letters, followed by six numbers and one letter. A NINO can be formatted with a space or a dash after the two letters and after the second, forth, and sixth digits.
- UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER
- A UK Unique Taxpayer Reference (UTR) is a 10-digit number that identifies a taxpayer or a business.
- BANK_ACCOUNT_NUMBER
- A US bank account number, which is typically 10 to 12 digits long. Amazon Comprehend also recognizes bank account numbers when only the last four digits are present.
- BANK_ROUTING
- A US bank account routing number. These are typically nine digits long, but Amazon Comprehend also recognizes routing numbers when only the last four digits are present.
- PASSPORT_NUMBER
- A US passport number. Passport numbers range from six to nine alphanumeric characters.
- US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER
- A US Individual Taxpayer Identification Number (ITIN) is a nine-digit number that starts with a "9" and contain a "7" or "8" as the fourth digit. An ITIN can be formatted with a space or a dash after the third and forth digits.
- SSN
- A US Social Security Number (SSN) is a nine-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last four digits are present.
Yes. You must have an existing Amazon Textract account to configure Amazon Textract in your Vasion Automate Pro instance.
Vasion has a simple, easy-to-use UI that walks you through the steps needed for configuring Amazon Textract in your Vasion instance. Please see our Amazon Textract documentation for specific configuration steps.