LLMWhisperer: Amazon Textract Alternative

Table of Contents

Introduction

In today’s digital landscape, the demand for accurate OCR and document parsing has never been higher. Businesses and organizations rely on accurate data extraction from complex documents, forms, and tables to automate workflows, reduce manual errors, and increase efficiency.

Several popular solutions exist to address these needs, including established tools like Tesseract, Google Cloud Vision, and Amazon Textract. While these platforms offer robust features and integration capabilities, they sometimes fall short when it comes to preserving document layouts, which is an essential factor for workflows that leverage large language models (LLMs) for data extraction.

This is where LLMWhisperer becomes a compelling Textract alternative. Designed to excel in layout preservation, cost efficiency, and nuanced parsing (especially for challenging formats like vertical text or intricate tables), LLMWhisperer offers a fresh perspective on document processing that can be a game-changer for many applications.

In this article, we will:

  • Walk through the setup processes for both Amazon Textract and LLMWhisperer.
  • Explore and evaluate key features using sample documents such as an OSHA form, a Loan Estimate form, and an Apollo Space Program document.
  • Compare the performance and pricing of each solution to help you determine the best fit for your OCR and data extraction needs.

Let’s dive in and explore how each tool performs, and why LLMWhisperer might just be the superior choice for your next OCR project.

Here is the GitHub repository where you will find all the codes written for this article.


Amazon Textract

Amazon Textract is a powerful machine learning service provided by AWS that automates the process of extracting text and data from scanned documents and images.

Designed for OCR and document parsing, Textract goes beyond simple optical character recognition by not only reading printed text but also intelligently identifying and capturing key elements such as forms and tables.

Key functionalities of Amazon Textract include:

  • Key-Value Pair Extraction: Textract efficiently locates and extracts information from structured documents by identifying key-value pairs, which is essential for automating data entry and processing workflows.
  • Table to CSV API: For documents containing tabular data, Textract offers a specialized API that converts tables into CSV format, simplifying data analysis and integration with other systems.

These features, along with its seamless integration within the AWS ecosystem, make Textract a robust solution for organizations aiming to streamline their document processing and data extraction tasks.

Textract Setup Guide

Setting up Amazon Textract is a straightforward process that involves configuring your AWS account, preparing your documents, and integrating the service into your workflow.

Follow these steps to get started.

Create an AWS Account:

  • If you don’t already have one, sign up for an AWS account at AWS Signup.

Access the AWS Management Console:

  • Log in to the AWS Management Console and search for “Textract” in the services menu.

Configure IAM Permissions:

  • Sign in to the AWS Management Console and open the IAM console at https://console.aws.amazon.com/iam/.
  • In the navigation pane, choose Users and create a new user, you can call it ‘textract’.
  • Attach policies directly and choose ‘AmazonTextractFullAccess’ as the permission.
  • After creating, access the user and navigate to the ‘Security Credentials’ tab, and then create a new ‘Access Key’.
  • Copy the Access key and Secret access key, you will need them later on.

Install the necessary Python libraries:

  • The code examples that we will work on later in this article use the AWS Python SDK (boto3), so you will need to install it:
pip install boto3

For more detailed guidance, refer to the AWS Textract Documentation.

Evaluating Textract Using Sample Documents

To assess the performance of Amazon Textract, we applied it to several sample documents that represent common challenges in document parsing.

Key-Value Pairs (Form data)

First, let’s start by taking a look at the Python code used to extract data as key-value pairs:

# textract_kv.py

import boto3  
import sys  
from collections import defaultdict  
 

def get_kv_map(file_name):  
    # Read the PDF file as bytes  
    with open(file_name, 'rb') as file:  
        document_bytes = file.read()  
    print("Document loaded:", file_name)  

    # Initialize a boto3 client  
    client = boto3.Session(aws_access_key_id="",  
                          aws_secret_access_key="",  
                          region_name='eu-central-1').client('textract')  

    # Call analyze_document API; per AWS docs, PDFs are supported here if within limits  
    response = client.analyze_document(  
        Document={'Bytes': document_bytes},  
        FeatureTypes=['FORMS']  
    )  

 
    blocks = response['Blocks']  
    key_map = {}  
    value_map = {}  
    block_map = {}  
  
    for block in blocks:  
        block_id = block['Id']  
        block_map[block_id] = block  
        if block['BlockType'] == "KEY_VALUE_SET":  
            # Distinguish keys from values based on the 'EntityTypes' field  
            if 'KEY' in block.get('EntityTypes', []):  
                key_map[block_id] = block  
            else:  
                value_map[block_id] = block  

    return key_map, value_map, block_map  

  
def get_text(block, block_map):  
    """Extracts text from a block by concatenating text from its child relationships."""  
    text = ""  
    if block and 'Relationships' in block:  
        for relationship in block['Relationships']:  
            if relationship['Type'] == 'CHILD':  
                for child_id in relationship['Ids']:  
                    child = block_map.get(child_id)  
                    if child and child['BlockType'] == 'WORD':  
                        text += child['Text'] + ' '  
                    elif child and child['BlockType'] == 'SELECTION_ELEMENT' and child.get(  
                            'SelectionStatus') == 'SELECTED':  
                        text += 'X '  # Marking selected checkboxes  
    return text.strip()  


def find_value_block(key_block, value_map):  
    """Finds the value block associated with a given key block."""  
    if 'Relationships' in key_block:  
        for relationship in key_block['Relationships']:  
            if relationship['Type'] == 'VALUE':  
                for value_id in relationship['Ids']:  
                    return value_map.get(value_id)  
    return None  

  

  

def get_kv_relationship(key_map, value_map, block_map):  
    """Builds the key-value pairs from the key and value maps."""  
    kvs = defaultdict(list)  
    for key_id, key_block in key_map.items():  
        key_text = get_text(key_block, block_map)  
        value_block = find_value_block(key_block, value_map)  
        value_text = get_text(value_block, block_map) if value_block else ""  
        kvs[key_text].append(value_text)  
    return kvs  
  

def print_kv_pairs(kvs):  
    """Prints the key-value pairs to the console."""  
    for key, values in kvs.items():  
        print(f"{key}: {', '.join(values)}")  

def main(file_name):  
    key_map, value_map, block_map = get_kv_map(file_name)  
    kvs = get_kv_relationship(key_map, value_map, block_map)  
    print("\nExtracted Key-Value Pairs:\n")  
    print_kv_pairs(kvs)  


if __name__ == "__main__":  
    if len(sys.argv) < 2:  
        print("Usage: python script.py <PDF file>")  
        sys.exit(1)  
    main(sys.argv[1])

Don’t forget to fill in your aws_access_key_id and aws_secret_access_key.

Here’s a breakdown of the code:

  • get_kv_map(file_name):
    • Reads a PDF file and initializes a boto3 client for Amazon Textract.
    • Calls the analyze_document API to extract blocks of text from the PDF.
    • Separates blocks into key and value maps based on their types.
    • Returns three maps: key_map, value_map, and block_map.
  • get_text(block, block_map):
    • Extracts text from a block by concatenating text from its child relationships.
    • Handles both text and selection elements (e.g., checkboxes).
  • find_value_block(key_block, value_map):
    • Finds the value block associated with a given key block using relationships.
  • get_kv_relationship(key_map, value_map, block_map):
    • Builds key-value pairs from the key and value maps.
    • Uses defaultdict to handle multiple values for a single key.
  • print_kv_pairs(kvs):
    • Prints the extracted key-value pairs to the console.
  • main(file_name):
    • Orchestrates the extraction process by calling the above functions.
    • Prints the extracted key-value pairs.

You can execute the script with:

python textract_kv.py my-document.pdf

Tables to CSV

Now, let’s look at the Python code used to extract tables to a CSV file:

# textract_csv.py

import os  
import sys  
import json  
import boto3  
from io import BytesIO  
from pprint import pprint   
 

def get_text(result, blocks_map):  
    """Extracts text from a block by combining its child words and selection elements."""  
    text = ''  
    if 'Relationships' in result:  
        for relationship in result['Relationships']:  
            if relationship['Type'] == 'CHILD':  
                for child_id in relationship['Ids']:  
                    word = blocks_map.get(child_id)  
                    if word and word['BlockType'] == 'WORD':  
                        # If the word contains a comma and is numeric when commas are removed, wrap it in quotes  
                        if "," in word['Text'] and word['Text'].replace(",", "").isnumeric():  
                            text += f'"{word["Text"]}" '  
                        else:  
                            text += word['Text'] + ' '  
                    elif word and word['BlockType'] == 'SELECTION_ELEMENT':  
                        if word.get('SelectionStatus') == 'SELECTED':  
                            text += 'X '  
    return text.strip()  

 

def get_rows_columns_map(table_result, blocks_map):  
    """  
    Creates a mapping of rows and columns for a table block.    Returns a dictionary of rows with their corresponding cells' text and a list of confidence scores.    
    """    
    rows = {}  
    scores = []  
    for relationship in table_result.get('Relationships', []):  
        if relationship['Type'] == 'CHILD':  
            for child_id in relationship['Ids']:  
                cell = blocks_map.get(child_id)  
                if cell and cell['BlockType'] == 'CELL':  
                    row_index = cell['RowIndex']  
                    col_index = cell['ColumnIndex']  
                    if row_index not in rows:  
                        rows[row_index] = {}  
                    scores.append(str(cell.get('Confidence', 0)))  
                    rows[row_index][col_index] = get_text(cell, blocks_map)  
    return rows, scores  

  

def generate_table_csv(table_result, blocks_map, table_index):  
    """  
    Generates a CSV string for the table block.    
    """    
    rows, scores = get_rows_columns_map(table_result, blocks_map)  
    table_id = f'Table_{table_index}'  
    csv_output = f'Table: {table_id}\n\n'  
  
    # Create CSV rows  
    for row_index in sorted(rows.keys()):  
        row_data = rows[row_index]  
        # Sort columns by their index  
        row_text = ",".join(row_data[col_index] for col_index in sorted(row_data.keys()))  
        csv_output += row_text + "\n"  
  
    # Append confidence scores at the end  
    csv_output += "\nConfidence Scores (per cell):\n"  

    # Assuming each row has same number of columns, get the number of columns from the first row  
    if rows:  
        col_count = len(next(iter(rows.values())))  
        for i, score in enumerate(scores, start=1):  
            csv_output += score + ","  
            if i % col_count == 0:  
                csv_output += "\n"  
    csv_output += "\n\n"  
    return csv_output  


def get_table_csv_results(file_name):  
    """  
    Reads a PDF file, sends it to Textract to extract tables,    and returns a CSV string with the extracted table data.    
    """    
    # Read PDF file bytes  
    with open(file_name, 'rb') as file:  
        pdf_bytes = file.read()  
    print('PDF loaded:', file_name)  

    # Initialize a boto3 client  
    client = boto3.Session(aws_access_key_id="",  
                          aws_secret_access_key="",  
                          region_name='eu-central-1').client('textract')    

    # Call analyze_document API (PDFs are supported if within size limits)  
    response = client.analyze_document(  
        Document={'Bytes': pdf_bytes},  
        FeatureTypes=['TABLES']  
    )  

    # Optionally, print all the blocks for debugging  
    # pprint(response['Blocks'])  

    blocks = response['Blocks']  
    blocks_map = {}  
    table_blocks = []  
    for block in blocks:  
        blocks_map[block['Id']] = block  
        if block['BlockType'] == "TABLE":  
            table_blocks.append(block)  


    if not table_blocks:  
        return "<b>No table found in the document.</b>"  
  
    csv_output = ''  
    for index, table in enumerate(table_blocks, start=1):  
        csv_output += generate_table_csv(table, blocks_map, index)  
        csv_output += "\n"  
    return csv_output  


def main(file_name):  
    # Get CSV output from the PDF file  
    table_csv = get_table_csv_results(file_name)  
  
    # Write the CSV output to a file  
    output_file = 'output.csv'  
    with open(output_file, "w", encoding='utf-8') as fout:  
        fout.write(table_csv)  

    # Print a confirmation and the CSV output  
    print('CSV OUTPUT FILE:', output_file)  
    print('\nExtracted CSV Content:\n')  
    print(table_csv)  
  

if __name__ == "__main__":  
    if len(sys.argv) < 2:  
        print("Usage: python script.py <PDF file>")  
        sys.exit(1)  
    main(sys.argv[1])

Don’t forget to fill in your aws_access_key_id and aws_secret_access_key.

Here’s a detailed breakdown of the code:

  • get_text(result, blocks_map):
    • Extracts text from a block by combining its child words and selection elements.
    • Handles numeric values with commas by wrapping them in quotes.
    • Marks selected checkboxes with an “X”.
  • get_rows_columns_map(table_result, blocks_map):
    • Creates a mapping of rows and columns for a table block.
    • Returns a dictionary of rows with their corresponding cells’ text and a list of confidence scores.
  • generate_table_csv(table_result, blocks_map, table_index):
    • Generates a CSV string for the table block.
    • Includes confidence scores for each cell at the end of the CSV.
  • get_table_csv_results(file_name):
    • Reads a PDF file and sends it to Textract to extract tables.
    • Returns a CSV string with the extracted table data.
    • Initializes a boto3 client for Textract and calls the analyze_document API.
  • main(file_name):
    • Orchestrates the extraction process by calling the above functions.
    • Writes the CSV output to a file named output.csv.
    • Prints the extracted CSV content to the console.

You can execute the script with:

python textract_csv.py my-document.pdf

Below is a detailed evaluation of how Textract performs on each implementation with the different sample documents.

Test Document: OSHA Form (Operational Health and Health Administration)

You run the script with:

python textract_kv.py osha-forms-filled.pdf

or

python textract_csv.py osha-forms-filled.pdf

The file to analyze: 

This is the key-value output:

Document loaded: .\osha-forms-filled.pdf

Extracted Key-Value Pairs:

days: , , , , , , , , , , , , , , 16, , , , , , , , 24, 32, , 
(1): 
Year: 20
(4):
month/day: , /, /, 2 4, 2 / 4, , /, /, /, /, /, /, /
(2) (3):
State:
(Rev.: 01/2004)
Establishment name: Dock Logistics
City: California
Page: of
Form approved OMB no.: 1218-0176

This is the CSV output:

Identify,the person,,Describe,the case,,Classify,the,case,,,,,,,,,
(A) Case,(B) Employee's name,(C) Job title,(D) Date of injury,X (E) Where the event occurred,(F) Describe injury or illness, parts of body affected,,CHECK based 
that,ONLY ONE on the most case:,box for each serious,case outcome for,Enter the days the ill worker,number of injured or was:,Check choose,the,"Injury" one type,of,column illness:,or
no.,,(e.g., Welder),or onset of illness,(e.g., Loading dock north end),and object/substance that directly injured or made person ill (e.g., Second degree burns on,,,Remained,at Work,,,(M),,,,,
,,,,,right forearm from acetylene torch),Death,Days away from work,Job transfer or restriction,Other record- able cases,Away from work,On job transfer or restriction,,Stain,,,,
,,,,,,(G),(H),(I),(J),(K),(L),(1),(2),(3),(4),(5),(6)
3443,Roger Smith,Engineer,2 / 4,Dock yard,First degree burns in arms,,X,,,32 days,days,X,,,,,
8932,William potter,Engineer,month/day 2 4,Dock yard,laceration in the neck,,X,,,24 days,days,X,X,,,,
767,Simon Dawes,Engineer,month/day 2 / 4,Dock yard,Fractured right leg,,,X,,16 days,days,X,,,,,
,,,month/day,,,,,,,days,days,,,,,,
,,,month/day /,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,
,,,/,,,,,,,days,days,,,,,,
,,,month/day /,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,
,,,/,,,,,,,days,days,,,,,,
,,,month/day /,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,
,,,/,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,
,,,/,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,
,,,/,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,
,,,/,,,,,,,days,days,,,,,,
,,,month/day,,,,,,,,,,,,,,

Issues Encountered with Textract

Orientation Issues:  Textract appears to struggle when processing documents with vertical text. In the provided output, notice how the data under certain keys (e.g., “days:” and “month/day:”) is misaligned, with characters and numbers appearing in an unexpected order. This is indicative of Textract reorienting vertical text into a horizontal flow, which disrupts the original document layout.

Fragmented Data Extraction:  The output shows several commas and disjointed numerical entries under the “days:” field. This suggests that the tool is not consistently capturing entire columns or lines of vertical text, leading to fragmented outputs. Such misalignment can cause serious issues when the extracted data is used for downstream processing or decision-making, as it may no longer accurately reflect the intended values or structure.

Loss of Context: Vertical text often provides contextual or structural information that is vital in forms and tables. When Textract fails to maintain the correct orientation, the relationship between different data elements can be lost, making it difficult to determine which numbers or words belong together.

Inconsistent Detection: The output reveals that checkboxes are not being reliably distinguished. Textract sometimes fails to clearly mark whether a checkbox is selected or not. For instance, while a selected checkbox should ideally be represented by a distinct symbol (such as an “X”), the output may simply show extraneous commas or blank spaces, leading to ambiguity.

Test Document: Loan Estimate Form

The file to analyze:

You run the script with:

python textract_kv.py loan-estimate-filled.pdf

or

python textract_csv.py loan-estimate-filled.pdf

This is the key-value output:

Document loaded: .\loan-estimate-filled.pdf

Extracted Key-Value Pairs:

Mortgage Insurance:

YES,: X

Monthly Principal & Interest See Projected Payments below for: $761.78

Estimated Taxes, Insurance & Assessments: $206 a month

PRODUCT: Fixed Rate

SALE PRICE: $180,000

NO:

Conventional: X

DATE ISSUED: 2/15/2013

FHA:

VA:

APPLICANTS: Michael Jones and Mary Stone 123 Anywhere Street Anytown, ST 12345

PROPERTY: 456 Somewhere Avenue Anytown, ST 12345

LOAN ID #: 123456789, 123456789

Homeowner's Insurance: X

Estimated Cash to Close: $16,054 Includes Closing Costs. See Calculating Cash to Close on page 2 for details.

LOAN TERM: 30 years

PURPOSE: Purchase

Property Taxes: X

Estimated Closing Costs: $8,054 Includes $5,672 in Loan Costs + $2,382 in Other Costs - $0 in Lender Credits. See page 2 for details.

Prepayment Penalty:

Loan Amount: $162,000

Does the loan have these features?: YES As high as $3,240 if you pay off the loan during the first 2 years

In escrow?: YES YES

Interest Rate: 3.875%

Balloon Payment: NO

PAGE: 1 OF 3

Estimated Total Monthly Payment: $1,050 $968

Visit: www.consumerfinance.gov/mortgage-estimate

Other::

This is the CSV output:

LOAN TERM,30 years  

PURPOSE,Purchase  

PRODUCT,Fixed Rate  

LOAN TYPE,X Conventional FHA VA  

LOAN ID #,123456789  

RATE LOCK,X NO YES, until 4/16/2013 at 5:00 p.m. EDT Before closing, your interest rate, points, and lender credits can change unless you lock the interest rate. All other estimated closing costs expire on 3/4/2013 at 5:00 p.m. EDT  

  

DATE ISSUED,2/15/2013  

APPLICANTS,Michael Jones and Mary Stone 123 Anywhere Street Anytown, ST 12345  

PROPERTY,456 Somewhere Avenue Anytown, ST 12345  

SALE PRICE,$180,000  

  

Loan Terms,,Can this amount increase after closing?  

Loan Amount,$162,000,NO  

Interest Rate,3.875%,NO  

Monthly Principal & Interest See Projected Payments below for your Estimated Total Monthly Payment,$761.78,NO  

Prepayment Penalty,,Does the loan have these features? YES As high as $3,240 if you pay off the loan during the first 2 years  

Balloon Payment,,NO  

  

Payment Calculation,Years 1-7,Years 8-30  

Principal & Interest,$761.78,$761.78  

Mortgage Insurance,+ 82,+ -  

Estimated Escrow Amount can increase over time,+ 206,+ 206  

Estimated Total Monthly Payment,$1,050,$968  

  

Estimated Closing Costs,$8,054 Includes $5,672 in Loan Costs + $2,382 in Other Costs - $0 in Lender Credits. See page 2 for details.  

Estimated Cash to Close,$16,054 Includes Closing Costs. See Calculating Cash to Close on page 2 for details.

Issues Encountered with Textract

Omitted Data:  Some critical fields are either completely missing or only partially extracted. For instance, fields such as “FHA:” and “VA:” appear in the output without any associated values—even if they were meant to be part of the form’s complete data set.

Empty or Blank Fields:  The key-value output shows entries like “Prepayment Penalty:” with no value, and other fields like “NO:” are extracted as stand-alone keys without clear context, indicating that certain portions of the form are not being captured.

Misalignment Between Labels and Values:  In an ideal scenario, each form label should directly correspond to its value. However, the output shows that elements such as “Mortgage Insurance:” are followed by “YES,: X” where the additional symbols or characters (like the extra comma or “X”) cause confusion regarding the intended response.

Discrepancies in CSV Output: The CSV output, which is supposed to provide a clear and organized representation of the table data, contains fields where critical details are missing or not clearly separated. For example, the “RATE LOCK” row appears to merge multiple pieces of information into one entry, making it unclear how the values should be interpreted.

Fragmented Data Across Multiple Columns:  Some rows in the CSV output include extra columns with ambiguous or misplaced values. This fragmentation indicates that Textract is having difficulty recognizing the boundaries between different form elements, resulting in a disjointed data structure.

Test Document: Apollo Space Program Document

The file to analyze:

You run the script with:

python textract_kv.py apollo-documents-vertical-text.pdf

or

python textract_csv.py apollo-documents-vertical-text.pdf

This is the key-value output:

Document loaded: .\apollo-documents-vertical-text.pdf

Extracted Key-Value Pairs:

Phase: ,

Figure:

- Status Unknown:

I - Initiated:

U:

C: Complete

3-4.: Apollo-Saturn 201 Vehicle Reliability and Quality Program Status

This is the CSV output:

NPC-500-5,,Engines,,,Booster,,CSM  

Program Elements,,H-1,J-2,S-IB,S-IVB,IU,SLA  

Reliability Goals R&QA Plan Reliability Predictions,Conceptual Phase,C U C,C I I,C I C,C I C,C I C,C C I  

Apportionments FMEA's Specification Reliability Req. Mission Profile Human Eng. and Maint. Parts and Materials Test Requirements,Design Phase,U C C C I I C,U C C C C I C,I C C C C I C,C C C I I I C,I I I C I I C,C C C C U U C  

Change Control Critical Items FR's and Corrective Action,Development Phase,C C U,C C I,C C C,C C I,C I I,C I I  

Reliability Assessments MRB Configuration Control Program Reviews Contractor Audits by Center,Fabrication Phase,I C C U I,I I C U I,U C C I I,U C C I I,U I C I I,I I C I I  

Qualification Tests Qual. Status List Reliability Demo. Test EI Accept. Tests Checkout Equipment Logs Buy-Off,Ground Test Phase,C U I C C I I,I I I C I I I,I C I C I U U,I U U I I U U,I I U I I U U,I I U I U U U

Issues Encountered

Loss of Original Formatting:  In the Apollo Space Program Document, the table was originally designed with vertical text intended to serve as headers or labels. Textract reorients this vertical text into a horizontal format, thereby disrupting the intended structure of the table. As a result, what were clearly distinct headings in the original layout became jumbled with the rest of the data, causing misalignment and fragmentation of information.

Ambiguity and Misinterpretation of Data:  The reorientation of vertical text leads to ambiguity in the extracted output. For example, key headings such as “Phase,” “Figure,” and various status indicators become misaligned. In the key-value output, we see incomplete or misplaced values (e.g., “Phase: ,” and “I – Initiated:”), while the CSV output shows rows of data that no longer clearly correspond to their original headers. This misinterpretation can confuse downstream applications or human reviewers who depend on the precise mapping of headers to their associated data.

Loss of Spatial Context:  Vertical text in tables often conveys crucial spatial context by delineating different sections or columns. When Textract rotates this text horizontally, it loses this spatial cue, leading to difficulties in distinguishing between different categories. The reoriented data, as seen in the CSV output, can result in merged or misplaced column headers and values, ultimately undermining the integrity of the dataset.

Key Features of Amazon Textract Explored via Sample Documents

These were the key features that we checked in analyzing the sample documents:

  • Forms: Textract’s ability to extract key-value pairs is evident, though layout nuances can sometimes be lost.
  • Tables: While the Table to CSV API is effective, orientation issues (as seen in the Apollo document) reveal limitations in handling complex layouts.
  • General Text: Basic text extraction works reliably; however, formatting details (such as vertical text) may require additional processing or post-extraction adjustments.

These evaluations demonstrate that while Amazon Textract offers robust OCR capabilities, certain document types—especially those with unique formatting challenges—can lead to extraction inaccuracies.

Strengths and Weaknesses of Amazon Textract

Amazon Textract is a robust and scalable solution for OCR and document parsing, but like any tool, it comes with its unique advantages and limitations.

Understanding these is key to determining when and how to best use Textract for your document processing needs.

Strength of TextractWeakness of Textract
Robust APIs and Seamless AWS Integration: Textract offers well-documented and powerful APIs that integrate effortlessly with other AWS services. This seamless integration streamlines the development process and makes it easier to incorporate Textract into existing AWS-centric infrastructures.Layout Preservation Challenges: Despite its strong OCR capabilities, Textract sometimes struggles to maintain the original document layout—especially with complex formatting—leading to inaccuracies that can disrupt workflows reliant on precise layout information.
Scalability for Enterprise Needs: Built to handle high volumes of documents, Textract scales efficiently for enterprise environments, ensuring reliable performance whether processing hundreds or millions of documents.Cost Implications: Textract’s pricing structure, particularly when dealing with forms and tables, can add up quickly, making it a significant consideration for organizations with large-scale or highly detailed document processing needs.
Specific Use-Cases Where Textract Underperforms: Certain document formats, such as those with vertical text or intertwined forms and tables, can lead to extraction errors that require additional manual adjustments or supplemental processing.

When to Use Textract

These are some situations where using Textract can be recommended:

Ideal for AWS-Centric Environments:  If your infrastructure is built on AWS, Textract’s seamless integration makes it a natural choice for automating document processing workflows.

Standard Document Formats and High-Volume Processing:  Textract works best with documents that have conventional layouts. For large-scale processing of standard documents, its scalability and robust API offerings provide significant advantages.

Scenarios Requiring Enterprise-Level Automation:  When your organization needs to automate document parsing at scale, Textract’s enterprise-grade performance and reliability can be a strong asset.

While Amazon Textract is a powerful tool with impressive scalability and integration capabilities, it may face challenges with layout preservation and certain specialized document formats.

It excels in environments with standard documents and robust AWS integration, but if you need precise layout retention or are dealing with complex document structures, you might need to consider alternative solutions or additional processing steps.

The next section will explore how LLMWhisperer addresses these challenges to offer a more refined solution.


Introducing LLMWhisperer as the Amazon Textract Alternative

LLMWhisperer is an innovative document parsing tool designed with a modern approach to handling OCR challenges, particularly in environments that rely on large language models (LLMs) for data extraction.

Here’s what sets it apart.

Core Philosophy and Capabilities: LLMWhisperer is built on the idea of maximizing layout preservation and enhancing parsing accuracy by leveraging advanced LLM techniques. Unlike traditional OCR tools, it focuses on understanding the context and structure of documents, ensuring that even complex layouts—such as vertical text or intertwined forms and tables—are maintained accurately in the extracted data.

Development Motivation: LLMWhisperer was developed as an alternative to solutions like Amazon Textract to address specific shortcomings. While Textract offers robust OCR capabilities, it often struggles with preserving document layouts and managing cost-intensive tasks such as form and table extraction.

LLMWhisperer’s Document Layout Preservation

Key Features and Advantages

LLMWhisperer is engineered to tackle the shortcomings found in traditional OCR solutions, making it particularly effective in modern, LLM-driven workflows.

Its standout features include:

Layout Preservation: LLMWhisperer excels at maintaining the original document structure, which is crucial for downstream LLM-based extraction workflows. This superior layout retention ensures that the context, formatting, and hierarchy of the document remain intact, leading to more accurate and meaningful data extraction.

Cost Efficiency: One of the key advantages of LLMWhisperer is its cost-effectiveness, particularly when dealing with forms and tables. By offering a more affordable solution compared to alternatives like Textract, LLMWhisperer makes it economically viable to process large volumes of complex documents without compromising on quality.

Enhanced Parsing: LLMWhisperer offers advanced parsing capabilities that handle challenging scenarios with ease. It is adept at processing vertical text and making a clear distinction between forms and tables, areas where traditional OCR tools often struggle. This enhanced parsing ability minimizes errors and reduces the need for post-processing corrections, streamlining your document extraction workflows.

Together, these features make LLMWhisperer a compelling choice for businesses looking to optimize their document processing, ensuring that both accuracy and cost-effectiveness are achieved in every workflow.

Targeted Use-Cases of LLMWhisperer

LLMWhisperer is specifically designed to address the challenges that traditional OCR solutions like Amazon Textract face, making it ideal in several scenarios:

LLM-Based Extraction Workflows: When downstream processes rely on large language models, maintaining the exact layout and structure of the original document is critical. LLMWhisperer preserves contextual and formatting details, ensuring that LLMs have a reliable foundation to work from.

Complex Document Formats: For documents featuring vertical text, intertwined forms and tables, or non-standard layouts, LLMWhisperer excels in accurately parsing and preserving the intended format. This capability minimizes errors that often arise when these elements are misinterpreted by conventional OCR tools.

Cost-Sensitive Environments: In situations where processing large volumes of forms and tables is necessary, costs can quickly escalate. LLMWhisperer’s cost efficiency makes it a compelling choice, offering high-quality extraction without the premium price tag associated with some traditional solutions.

Specialized Business Documents: Industries that deal with specialized or regulatory documents—such as healthcare, finance, or government forms—often require meticulous layout preservation to ensure data accuracy and compliance. LLMWhisperer’s enhanced parsing capabilities are particularly well-suited for these use cases, providing clear differentiation between similar-looking elements like forms versus tables.

Scenarios Requiring Minimal Post-Processing: When extracted data needs to be immediately actionable, reducing the need for extensive post-processing is crucial. By delivering more precise and well-structured outputs, LLMWhisperer reduces the time and effort required to clean and reformat data after extraction.

In these targeted use cases, LLMWhisperer not only addresses the shortcomings of existing tools but also enhances overall efficiency and reliability, making it a superior choice for modern OCR and document parsing needs.

LLMWhisperer Setup Process

Getting started with LLMWhisperer is straightforward, whether you’re integrating it into an existing application or using it as a standalone tool.

Follow these steps to set up LLMWhisperer and begin processing your documents.

Sign Up and Obtain API Credentials:

  • Register: Visit the LLMWhisperer website and create an account.
  • API Key: Once registered, navigate to your dashboard to generate and retrieve your API key. This key will be required to authenticate all API calls.

Install the LLMWhisperer Python SDK:

pip install llmwhisperer-client

By following these steps, you can quickly set up and integrate LLMWhisperer into your document processing workflow, allowing you to take full advantage of its superior layout preservation, cost efficiency, and enhanced parsing capabilities.

Evaluating LLMWhisperer Using Sample Documents

To determine how well LLMWhisperer performs in real-world scenarios, we applied it to the same set of sample documents seen before.

First, let’s take a look at the Python code, which uses the LLMWhisperer new V2 API:

# llmwhisperer_example.py

import sys  
from unstract.llmwhisperer import LLMWhispererClientV2  
from unstract.llmwhisperer.client_v2 import LLMWhispererClientException  


# Define a function to process a document  
def process_document(file_path):  
    # Provide the base URL and API key explicitly  
    client = LLMWhispererClientV2(base_url="https://llmwhisperer-api.us-central.unstract.com/api/v2",api_key="")  

    try:  
        # Process the document  
        result = client.whisper(  
                    file_path=file_path,  
                    wait_for_completion=True,  
                    wait_timeout=200,  
                )  
        # Print the extracted text  
        print(result['extraction']['result_text'])  
    except LLMWhispererClientException as e:  
        print(e)  

  

# Main function  
if __name__ == "__main__":  
    if len(sys.argv) < 2:  
        print("Usage: python script.py <PDF file>")  
        sys.exit(1)  
    process_document(sys.argv[1])

Don’t forget to replace api_key with your own API key.

Here’s a breakdown of what the code does:

  • process_document(file_path): This function takes a file path as an argument and processes the document located at that path.
    • It initializes an LLMWhispererClientV2 object with a specified base URL and API key.
    • It attempts to process the document using the whisper method of the client, waiting for the completion of the process with a timeout of 200 seconds.
    • If successful, it prints the extracted text from the document.
    • If an exception occurs (specifically LLMWhispererClientException), it catches the exception and prints the error message.

You can execute the script with:

python llmwhisperer_example.py my-document.pdf

Below is a detailed evaluation of how LLMWhisperer performs on each implementation with the different sample documents.

OSHA Form (Operational Health and Health Administration)

You run the script with:

python llmwhisperer_example.py osha-forms-filled.pdf

The file to analyze:

This is output:

LLMWhisperer Output Analysis

Observation: LLMWhisperer has demonstrated exceptional capability in preserving the original layout of the OSHA form. The vertical text is rendered in its intended orientation, with no evidence of the reflow or misalignment issues observed in other OCR tools. Additionally, checkboxes are distinctly recognized and correctly marked, ensuring that both selected and unselected states are clearly identifiable.

Details: The output from LLMWhisperer mirrors the original form structure with remarkable fidelity. Every element—from form labels and data fields to the orientation of vertical text and checkboxes—is accurately captured. The preservation of vertical text is particularly noteworthy, as it maintains the spatial and contextual relationships essential for accurate data interpretation. Furthermore, the checkboxes are processed with precision, eliminating ambiguity that can arise from misinterpreted selections. This high level of accuracy minimizes the need for manual corrections and ensures that any subsequent data extraction or analysis is based on reliable and well-structured information.

Live coding session on data extraction from a scanned PDF form with LLMWhisperer

You can also watch this live coding webinar where we explore all the challenges involved in scanned PDF parsing. We’ll also compare the capabilities of different PDF parsing tools to help you understand their strengths and limitations.

Test Document: Loan Estimate Form

You run the script with:

python llmwhisperer_example.py loan-estimate-filled.pdf

The file to analyze:

This is the output:

FICUS BANK  

4321 Random Boulevard · Somecity, ST 12340                  Save this Loan Estimate to compare with your Closing Disclosure.  

  

Loan Estimate                                               LOAN TERM    30 years  

                                                            PURPOSE      Purchase  

DATE ISSUED   2/15/2013                                     PRODUCT      Fixed Rate  

APPLICANTS    Michael Jones and Mary Stone                  LOAN TYPE   [X]  Conventional [ ] FHA [ ] VA [ ]  

              123 Anywhere Street                           LOAN ID #    123456789  

              Anytown, ST 12345                             RATE LOCK    [ ] NO [X]  YES, until 4/16/2013 at 5:00 p.m. EDT  

PROPERTY      456 Somewhere Avenue                                       Before closing, your interest rate, points, and lender credits can  

              Anytown, ST 12345                                          change unless you lock the interest rate. All other estimated  

SALE PRICE    $180,000                                                   closing costs expire on 3/4/2013 at 5:00 p.m. EDT  

  

 Loan Terms                                                 Can this amount increase after closing?  

  

 Loan Amount                        $162,000                NO  

  

 Interest Rate                      3.875%                  NO  

  

 Monthly Principal & Interest       $761.78                 NO  

 See Projected Payments below for your  

 Estimated Total Monthly Payment  

  

                                                            Does the loan have these features?  

  

 Prepayment Penalty                                         YES     . As high as $3,240 if you pay off the loan during the  

                                                                      first 2 years  

  

 Balloon Payment                                            NO  

  

 Projected Payments  

  

 Payment Calculation                               Years 1-7                                   Years 8-30  

  

  Principal & Interest                              $761.78                                      $761.78  

  

  Mortgage Insurance                          +        82                                  +  

  

  Estimated Escrow                            +       206                                  +       206  

  Amount can increase over time  

  

  Estimated Total  

  Monthly Payment                                  $1,050                                        $968  

  

                                                         This estimate includes                      In escrow?  

                                                         [X] Property Taxes                          YES  

 Estimated Taxes, Insurance         $206                                                             YES  

                                                         [X]  

 & Assessments                                              Homeowner's Insurance  

 Amount can increase over time      a month              [ ] Other:  

                                                         See Section G on page 2 for escrowed property costs. You must pay for other  

                                                         property costs separately.  

  

 Costs at Closing  

  

 Estimated Closing Costs            $8,054         Includes $5,672 in Loan Costs + $2,382 in Other Costs - $0  

                                                   in Lender Credits. See page 2 for details.  

  

 Estimated Cash to Close            $16,054        Includes Closing Costs. See Calculating Cash to Close on page 2 for details.  

  

                 Visit www.consumerfinance.gov/mortgage-estimate for general information and tools.  

LOAN ESTIMATE                                                                                PAGE 1 OF 3 . LOAN ID # 123456789  

<<<

LLMWhisperer Output Analysis

Observation:  LLMWhisperer demonstrates outstanding performance by capturing every form element with comprehensive detail. In the output, every label—from headers to sub-sections—is present, and all corresponding values are accurately extracted. Unlike other OCR solutions that might leave gaps or omit critical fields, LLMWhisperer ensures that no element is missing, preserving the logical relationship between labels and their associated values.

Details:  The output provides a complete mapping of the form, reflecting the structure and hierarchy of the original document. Each data point is not only extracted but also aligned correctly with its corresponding label. For example, financial figures, dates, and text entries are all positioned exactly as they appear on the original form. This meticulous attention to detail means that numerical data retains its formatting (such as currency symbols and decimal points) and text fields are fully captured without truncation. The spatial and contextual alignment of the extracted elements mirrors the original layout, making it easier to verify and validate the data. Furthermore, the output shows that checkboxes and other selection elements are clearly marked, providing an unambiguous representation of user inputs. This level of accuracy significantly reduces the need for any manual correction or post-processing, ensuring that the data is immediately ready for downstream applications.

Test Document: Apollo Space Program Document

You run the script with:

python llmwhisperer_example.py apollo-documents-vertical-text.pdf

The file to analyze:

This is the output:

 The overall summary of reliability and quality status on those items of flight hardware  

          which have been designated for the Apollo-Saturn 201 Mission appears in Figure 3-4.  

          The measurement yardstick used as a base is derived from the phased program ele-  

         ments of NPC 500-5, "Apollo Reliability and Quality Assurance Program Plan" (2).  

  

                             NPC-500-5                       Engines     Booster      CSM  

  

                                                                                      SLA  

                         Program Elements                   H-1 J-2 S-IB S-IVB IU  

  

          Reliability Goals                  Conceptual      C    C    C      C    C    C  

          R&QA Plan                           Phase          U     I   I      I    I    C  

          Reliability Predictions                            C     I   C      C    C    I  

  

          Apportionments                                     U    U     I     C    I    C  

           FMEA's                                            C    C    C      C    I    C  

          Specification Reliability Req.                     C    C    C      C    I    C  

          Mission Profile                    Design          C    C    C      I    C    C  

          Human Eng. and Maint.               Phase           I   C    C       I   I    U  

           Parts and Materials                                I    I    I      I   I    U  

           Test Requirements                                      C    C      C    C    C  

  

           Change Control                                    C    C    C      C    C    C  

           Critical Items                    Development      C   C    C      C    I    I  

           FR's and Corrective Action         Phase          U     I   C       I    I   I  

  

           Reliability Assessments                            I    I   U      U    U    I  

           MRB                               Fabrication     C     I   C      C     I   I  

           Configuration Control              Phase           C   C    C      C    C    C  

           Program Reviews                                    U   U     I      I    I   I  

           Contractor Audits by Center                        I    I    I      I    I   I  

  

          Qualification Tests                                 C    I    I      I    I   I  

           Qual. Status List                                  U    I   C      U     I    I  

           Reliability Demo. Test            Ground Test      I    I    I     U    U    U  

           EI Accept. Tests                   Phase           C    C   C       I    I   I  

           Checkout                                           C    I    I      I    I   U  

                                                                                   U    U  

           Equipment Logs                                     I    I    U      U  

           Buy-Off                                            I    I    U      U   U    U  

  

          Key  

            C - Complete  

            I - Initiated  

            U - Status Unknown  

  

              Figure 3-4. Apollo-Saturn 201 Vehicle Reliability and Quality Program Status  

  

                                                                                        3-3  

<<<

LLMWhisperer Output Analysis

Observation: This document features tables that incorporate vertical text—a format that many OCR tools struggle to interpret accurately. While other solutions often rotate or misinterpret such text, LLMWhisperer successfully recognizes and maintains the vertical orientation as it appears in the original document. This capability is crucial, as vertical text typically serves as headers or key labels that define the structure and meaning of the table.

Details: The extracted table produced by LLMWhisperer retains the exact structure of the original document. The vertical text elements remain correctly oriented, preserving their spatial relationship to the corresponding data cells. This means that headers, labels, and any text presented vertically are not rotated or misplaced, ensuring that the hierarchy and categorization within the table are clear. As a result, the risk of misinterpretation or data misalignment is greatly reduced. The precise preservation of vertical text enables accurate mapping of each column and row, allowing for seamless integration of the extracted data into further processing workflows.

Key Features of LLMWhisperer Explored via Sample Documents

With the analysis of the sample documents, we tested some of the key features of LLMWhisperer:

Layout Preservation: Across all tested documents, LLMWhisperer consistently maintains the original document structure. This is crucial for workflows relying on accurate layout information for LLM-based processing.

Enhanced Parsing: The tool demonstrates an advanced understanding of complex formats, effectively differentiating between forms and tables and accurately processing vertical text.

Consistency: Regardless of the document’s complexity, LLMWhisperer delivers consistent, high-quality outputs, reducing the need for manual corrections and additional post-processing.

These evaluations clearly illustrate that LLMWhisperer not only meets but exceeds expectations in preserving document integrity and handling challenging formats, making it a standout choice for modern OCR and document parsing workflows.


Comparative Analysis: Textract vs. LLMWhisperer

Performance on Sample Test Documents

Test DocumentText Extraction ServiceObservation
OSHA FormTextractTextract struggles with vertical text, often misinterpreting its orientation. Additionally, the tool sometimes mishandles checkboxes, leading to ambiguity in the extracted data.
LLMWhispererLLMWhisperer demonstrates superior handling of the OSHA form. It accurately preserves the vertical text orientation and clearly distinguishes checkboxes, ensuring that every element of the form is captured as intended.
Loan Estimate FormTextractWhen processing the Loan Estimate form, Textract occasionally misses key form elements. This incomplete extraction results in gaps where some field labels and values are not properly recognized or associated.
LLMWhispererLLMWhisperer excels with the Loan Estimate form by capturing all form elements with high accuracy. Its output reflects a complete and organized mapping of the form, maintaining the necessary associations between labels and their values.
Apollo Space Program DocumentTextractFor the Apollo Space Program document, which includes tables with vertical text, Textract tends to reorient the vertical text into a horizontal layout. This alteration can disrupt the intended structure of the data, making it harder to interpret.
LLMWhispererLLMWhisperer effectively parses the Apollo document by correctly preserving the vertical orientation of text within tables. This accurate extraction maintains the integrity of the document’s original layout, ensuring that the data is both reliable and immediately usable.

In summary, while Textract delivers robust functionality for standard document processing, LLMWhisperer clearly outperforms it in scenarios involving complex layouts—particularly with vertical text and intricate form structures—making it the more reliable choice for precise and context-aware document parsing.

Textract vs. LLMWhisperer: Pricing Comparison

When evaluating OCR solutions like Textract and LLMWhisperer, pricing plays a crucial role, especially for organizations processing large volumes of documents or working with complex layouts.

Below is a detailed breakdown of their cost structures and an analysis of their cost-effectiveness:

ServicePricing ModelEstimated Cost per 1,000 PagesAdditional Costs
TextractPay-as-you-go, based on pages processedVaries by document type (e.g., plain text, forms, tables)Extra fees for form and table analysis
LLMWhispererSubscription-based or pay-per-useGenerally lower for structured documentsNo additional cost for layout preservation

Cost-Effectiveness Analysis

Textract: While Textract offers strong integration with AWS services, its pricing can quickly add up, particularly for complex documents containing forms, tables, or vertical text. Organizations processing large-scale document batches may find these costs significant.

LLMWhisperer: LLMWhisperer provides a more cost-effective alternative for structured documents, as it does not charge extra for maintaining layout integrity. This makes it a better choice for businesses dealing with complex document structures on a regular basis.

Best Use Cases Based on Pricing

Textract is ideal for: Organizations already embedded in the AWS ecosystem that prioritize seamless integration over cost.

LLMWhisperer is better suited for: Businesses handling high volumes of structured documents that require precise layout preservation at a predictable or lower cost.

Overall, while Textract remains a competitive solution within AWS, LLMWhisperer may offer better value for organizations needing accurate, layout-preserving OCR without excessive additional costs.

Textract vs. LLMWhisperer: Overall Evaluation

Both Textract and LLMWhisperer are powerful OCR solutions, but they excel in different areas.

While Textract offers deep integration with AWS and a solid foundation for text extraction, LLMWhisperer outperforms it in critical aspects such as layout preservation, structured data extraction, and handling of complex document formats. These advantages make LLMWhisperer the preferable Textract alternative.

Key Areas of Parity

Basic OCR Performance: Both tools effectively extract standard printed text from documents.

Scalability: Each solution is designed to handle high-volume document processing, making them viable for enterprise use.

API Availability: Both provide well-documented APIs, allowing developers to integrate them into existing workflows.

Where LLMWhisperer Provides a Significant Edge Over Textract

Layout Preservation: Unlike Textract, which often distorts tables and vertical text, LLMWhisperer accurately maintains the original document structure.

Form and Table Extraction: LLMWhisperer ensures that labels, values, and checkboxes are correctly associated, reducing errors in structured data extraction.

Vertical Text Handling: While Textract frequently misinterprets vertical text by reorienting it, LLMWhisperer retains it as intended, improving accuracy.

Cost-Effectiveness: For businesses working with structured documents, LLMWhisperer provides better value by avoiding additional fees for preserving complex layouts.

If you want to quickly take it for test drive, you can checkout our free playground.

PDF hell and practical RAG applications

Extracting text from PDFs for use in applications like Retrieval-Augmented Generation (RAG) and Natural Language Processing (NLP) is challenging due to their fixed layout structure and lack of semantic organization.

Learn how faithfully reproducing the structure of tables and other elements by preserving the original PDF layout is crucial for better performance in LLM and RAG applications.

Learn more →

🌍 Better Textract Alternative: Related Reads

1) Convert invoice PDFs to text: An LLMWhisperer guide

2) Guide to parsing bank statement PDF to text

3) Guide to PDF to text parsing in a resume