Textract and Comprehend

Amazon Textract pulls the text and structural information from files and Amazon Comprehend performs a deeper analysis to detect Personal Identifiable Information (PII) Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means. Further, PII is defined as information: (i) that directly identifies an individual (e.g., name, address, social security number or other identifying number or code, telephone number, email address, etc.) or (ii) by which an agency intends to identify specific individuals in conjunction with other data elements, i.e., indirect identification. Additionally, information permitting the physical or online contacting of a specific individual is the same as personally identifiable information. in English or Spanish text documents. Depending on your needs, you can specify one or more types of PII you want to identify.

Amazon Comprehend refers to the different types of PII as entities. A list of PII entities is included in this topic. There is an ALL option available in the Comprehend Type configuration that includes all the options listed. Keep in mind the maximum number character for a text field is 100.

Here we provide the steps to set up a basic configuration using Comprehend PII detection.

Requirements

Before you start a new Textract configuration you need the following:

  • An AWS account. The AWS access key ID and secret key are required.
  • S3 Bucket used to process the files. The AWS region and the name of the bucket are required.

    The S3 Bucket used to process the files should not be used as a storage location. It's used by Textract to process the documents to extract the text and data.

  • An object text field to store the PII information detected.

You are billed directly by Amazon for the number of pages processed each month. Amazon Textract and Amazon Comprehend are two separate services each with its own pricing.

Textract Configuration

Once you have the required information, follow these steps.

  1. From the side navigations select Adv Image Processing.
  2. Select the New Configuration Type drop-down, and then select Textract.

    Textraxt configuration option

  3. In the configuration page complete the following:
    1. Textract AIP Name — enter the name you want to use to identify this process.
    2. AWS Access Key ID — enter the key ID for your AWS account.
    3. AWS Secret Key — enter the secret key for your account.
    4. Select Validate.

      Name and AWS account fields

  4. Once the AWS account is validated complete the following:
    1. AWS Region — use the drop-down to select the AWS region for the account.
    2. S3 Bucket — use the drop-down to select the bucket you created to use with Textract.
    3. Select Validate.

      AWS Region and S3 Bucket fields

  5. Once the S3 Bucket is validated, complete the following:
    1. Textract API — use the drop-down to selectDetect.
    2. Object — use the drop-down to select the object where you want to save the data value.
  6. Check the Comprehend PII detection check box.
  7. Select from the following options:
    1. Highlight — this option highlights the text identified as PII.
    2. Redact — this option redacts the text identified as PII.
    3. Field to store PII result — select the field where you want to store the PII data.
    4. Comprehend Types — use the drop-down to select the type of PII to detect.
    5. Select Validate Comprehend Access.

      Comprehend PII detection options

  8. Once the access to Amazon Comprehend is validated, select Save from the top-right corner.

New Textract configuration

PII Entity Types

Amazon Comprehend supports universal entity types and some country-specific entity types. Below is a comprehensive list of both types.

PII universal entity types

Some PII entity types are universal (not specific to individual countries), such as email addresses and credit card numbers. Amazon Comprehend detects the following types of universal PII entities:

ADDRESS
A physical address, such as "100 Main Street, Anytown, USA" or "Suite #12, Building 123". An address can include information such as the street, building, location, city, state, country, county, zip code, precinct, and neighborhood.
AGE
An individual's age, including the quantity and unit of time. For example, in the phrase "I am 40 years old," Amazon Comprehend recognizes "40 years" as an age.
AWS_ACCESS_KEY
A unique identifier that's associated with a secret access key; you use the access key ID and secret access key to sign programmatic AWS requests cryptographically.
AWS_SECRET_KEY
A unique identifier that's associated with an access key. You use the access key ID and secret access key to sign programmatic AWS requests cryptographically.
CREDIT_DEBIT_CVV
A three-digit card verification code (CVV) that is present on VISA, MasterCard, and Discover credit and debit cards. For American Express credit or debit cards, the CVV is a four-digit numeric code.
CREDIT_DEBIT_EXPIRY
The expiration date for a credit or debit card. This number is usually four digits long and is often formatted as month/year or MM/YY. Amazon Comprehend recognizes expiration dates such as 01/21, 01/2021, and Jan 2021.
CREDIT_DEBIT_NUMBER
The number for a credit or debit card. These numbers can vary from 13 to 16 digits in length. However, Amazon Comprehend also recognizes credit or debit card numbers when only the last four digits are present.
DATE_TIME
A date can include a year, month, day, day of week, or time of day. For example, Amazon Comprehend recognizes "January 19, 2020" or "11 am" as dates. Amazon Comprehend will recognize partial dates, date ranges, and date intervals. It will also recognize decades, such as "the 1990s".
DRIVER_ID
The number assigned to a driver's license, which is an official document permitting an individual to operate one or more motorized vehicles on a public road. A driver's license number consists of alphanumeric characters.
EMAIL
An email address, such as marymajor@email.com.
INTERNATIONAL_BANK_ACCOUNT_NUMBER
An International Bank Account Number has specific formats in each country. See www.iban.com/structure.
IP_ADDRESS
An IPv4 address, such as 198.51.100.0.
LICENSE_PLATE
A license plate for a vehicle is issued by the state or country where the vehicle is registered. The format for passenger vehicles is typically five to eight digits, consisting of upper-case letters and numbers. The format varies depending on the location of the issuing state or country.
MAC_ADDRESS
A media access control (MAC) address is a unique identifier assigned to a network interface controller (NIC).
NAME
An individual's name. This entity type does not include titles, such as Dr., Mr., Mrs., or Miss. Amazon Comprehend does not apply this entity type to names that are part of organizations or addresses. For example, Amazon Comprehend recognizes the "John Doe Organization" as an organization, and it recognizes "Jane Doe Street" as an address.
PASSWORD
An alphanumeric string that is used as a password, such as "*very20special#pass*".
PHONE
A phone number. This entity type also includes fax and pager numbers.
PIN
A four-digit personal identification number (PIN) with which you can access your bank account.
SWIFT_CODE
A SWIFT code is a standard format of Bank Identifier Code (BIC) used to specify a particular bank or branch. Banks use these codes for money transfers such as international wire transfers.
SWIFT codes consist of eight or 11 characters. The 11-digit codes refer to specific branches, while eight-digit codes (or 11-digit codes ending in 'XXX') refer to the head or primary office.
URL
A web address, such as www.example.com.
USERNAME
A user name that identifies an account, such as a login name, screen name, nick name, or handle.
VEHICLE_IDENTIFICATION_NUMBER
A Vehicle Identification Number (VIN) uniquely identifies a vehicle. VIN content and format are defined in the ISO 3779 specification. Each country has specific codes and formats for VINs.

Country-specific PII entity types

Some PII entity types are country-specific, such as passport numbers and other government-issued ID numbers. Amazon Comprehend detects the following types of country-specific PII entities:

CA_HEALTH_NUMBER
A Canadian Health Service Number is a 10-digit unique identifier, required for individuals to access healthcare benefits.
CA_SOCIAL_INSURANCE_NUMBER
A Canadian Social Insurance Number (SIN) is a nine-digit unique identifier, required for individuals to access government programs and benefits.
The SIN is formatted as three groups of three digits, such as 123-456-789. A SIN can be validated through a simple check-digit process called the Luhn algorithm.
IN_AADHAAR
An Indian Aadhaar is a 12-digit unique identification number issued by the Indian government to the residents of India. The Aadhaar format has a space or hyphen after the fourth and eighth digit.
IN_NREGA
An Indian National Rural Employment Guarantee Act (NREGA) number consists of two letters followed by 14 numbers.
IN_PERMANENT_ACCOUNT_NUMBER
An Indian Permanent Account Number is a 10-digit unique alphanumeric number issued by the Income Tax Department.
IN_VOTER_NUMBER
An Indian Voter ID consists of three letters followed by seven numbers.
UK_NATIONAL_HEALTH_SERVICE_NUMBER
A UK National Health Service Number is a 10-17 digit number, such as 485 777 3456. The current system formats the 10-digit number with spaces after the third and sixth digits. The final digit is an error-detecting checksum.
The 17-digit number format has spaces after the 10th and 13th digits.
UK_NATIONAL_INSURANCE_NUMBER
A UK National Insurance Number (NINO) provides individuals with access to National Insurance (social security) benefits. It is also used for some purposes in the UK tax system.
The number is nine digits long and starts with two letters, followed by six numbers and one letter. A NINO can be formatted with a space or a dash after the two letters and after the second, forth, and sixth digits.
UK_UNIQUE_TAXPAYER_REFERENCE_NUMBER
A UK Unique Taxpayer Reference (UTR) is a 10-digit number that identifies a taxpayer or a business.
BANK_ACCOUNT_NUMBER
A US bank account number, which is typically 10 to 12 digits long. Amazon Comprehend also recognizes bank account numbers when only the last four digits are present.
BANK_ROUTING
A US bank account routing number. These are typically nine digits long, but Amazon Comprehend also recognizes routing numbers when only the last four digits are present.
PASSPORT_NUMBER
A US passport number. Passport numbers range from six to nine alphanumeric characters.
US_INDIVIDUAL_TAX_IDENTIFICATION_NUMBER
A US Individual Taxpayer Identification Number (ITIN) is a nine-digit number that starts with a "9" and contain a "7" or "8" as the fourth digit. An ITIN can be formatted with a space or a dash after the third and forth digits.
SSN
A US Social Security Number (SSN) is a nine-digit number that is issued to US citizens, permanent residents, and temporary working residents. Amazon Comprehend also recognizes Social Security Numbers when only the last four digits are present.