April 29, 2025
Nuno Bispo

From Inbox to Database: Automating Document Extraction with Unstract + n8n

Introduction

Accounting firms today are inundated with a continual stream of client documents arriving via email, from vendor invoices to specialized tax-form submissions.

Each morning, staff must sift through dozens or even hundreds of messages to find billable attachments, then painstakingly download and transcribe data into spreadsheets or accounting systems.

This repetitive manual process not only diverts skilled accountants from high-value analysis but also creates bottlenecks that slow turnaround times and inflate labor costs, all while introducing the risk of typos and mis-mapped fields that can cascade into reporting errors and compliance headaches.

n8n addresses the orchestration challenge by providing a self-hosted, open-source workflow automation platform.

Its visual, drag-and-drop interface lets you connect Gmail triggers, branching logic, custom JavaScript (or Python) functions, HTTP request nodes, and database clients into a single, end-to-end pipeline.

With n8n’s powerful branching and retry capabilities, you can filter incoming emails, parse subject lines to identify form types, and route attachments through the appropriate processing paths, all under your data governance.

Complementing n8n’s orchestration is Unstract’s custom extraction API. Within Unstract’s Prompt Studio, you build and train form-specific extractors—mapping fields like invoice numbers, dates, line-item details, or tax-form codes—then deploy them as REST endpoints.

When n8n submits a PDF or scanned image to Unstract, the API returns perfectly structured JSON suitable for insertion into your cloud database.

By combining n8n’s trigger-and-branch logic with Unstract’s intelligent document parsing, you eliminate manual keystrokes, accelerate processing, and ensure reliable, error-free data capture from inbox to database.

Use Case Deep Dive: Accounting Firm Tax Forms

Scenario Description

Imagine an accounting firm that receives a steady stream of tax-related documents via email, each clearly labelled in the subject line with its form type—say, Form 1040 for income tax and Form 990 for tax exemption.

For example, an incoming message might read “Adam Scott – Form 1040 – Income Tax” or “Acme Corp – Form 990 – Tax exemption”.

Although both arrive as PDF attachments, each form type has its own unique set of fields: Form 1040 has income lines, whereas Form 990 has revenue lines.

Distinguishing these immediately allows us to tailor the downstream processing precisely to each document’s schema.

Desired Workflow Outcomes

The goal is a fully automated pipeline that, based solely on the email subject, auto-routes each attachment into the correct extraction workflow and then persists the resulting structured JSON into separate database tables.

Form 1040 documents should flow through the “Form 1040” extractor and land in the form-1040 table, with columns for each income line item.

Form 990 files should invoke the “Form 990” extractor and populate the form-990 table, capturing revenue lines.

By isolating each form’s data in its table, we maintain clear separation of schemas, simplify analytics, and ensure that reporting or reconciliation queries never have to sift through irrelevant fields.

Prerequisites & Setup

n8n Self-Hosted Installation

To get full control over your data and meet security requirements, you can deploy n8n on your own infrastructure using Docker.

Although it is possible to deploy it with a single Docker command and the built-in SQLite database, the best and most reliable way to deploy it is with Postgres and a separate worker.

That can be easily achieved by creating a Docker Compose file:

version: '3.8'

volumes:
  db_storage:
  n8n_storage:
  redis_storage:

x-shared: &shared
  restart: always
  image: docker.n8n.io/n8nio/n8n
  environment:
    - DB_TYPE=postgresdb
    - DB_POSTGRESDB_HOST=postgres
    - DB_POSTGRESDB_PORT=5432
    - DB_POSTGRESDB_DATABASE=${POSTGRES_DB}
    - DB_POSTGRESDB_USER=${POSTGRES_NON_ROOT_USER}
    - DB_POSTGRESDB_PASSWORD=${POSTGRES_NON_ROOT_PASSWORD}
    - EXECUTIONS_MODE=queue
    - QUEUE_BULL_REDIS_HOST=redis
    - QUEUE_HEALTH_CHECK_ACTIVE=true
    - N8N_ENCRYPTION_KEY=${ENCRYPTION_KEY}
  links:
    - postgres
    - redis
  volumes:
    - n8n_storage:/home/node/.n8n
  depends_on:
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy

services:
  postgres:
    image: postgres:16
    restart: always
    environment:
      - POSTGRES_USER
      - POSTGRES_PASSWORD
      - POSTGRES_DB
      - POSTGRES_NON_ROOT_USER
      - POSTGRES_NON_ROOT_PASSWORD
    volumes:
      - db_storage:/var/lib/postgresql/data
      - ./init-data.sh:/docker-entrypoint-initdb.d/init-data.sh
    healthcheck:
      test: ['CMD-SHELL', 'pg_isready -h localhost -U ${POSTGRES_USER} -d ${POSTGRES_DB}']
      interval: 5s
      timeout: 5s
      retries: 10

  redis:
    image: redis:6-alpine
    restart: always
    volumes:
      - redis_storage:/data
    healthcheck:
      test: ['CMD', 'redis-cli', 'ping']
      interval: 5s
      timeout: 5s
      retries: 10

  n8n:
    <<: *shared
    ports:
      - 5678:5678

  n8n-worker:
    <<: *shared
    command: worker
    depends_on:
      - n8n

This Docker Compose file defines three named volumes—db_storage, n8n_storage, and redis_storage—to persist database, n8n, and Redis data across container restarts.

Using a shared YAML anchor (&shared), it standardizes configuration for both n8n and n8n-worker services: always-restart policies, the official docker.n8n.io/n8nio/n8n image, and environment variables that point n8n to a PostgreSQL backend, enable queued execution via Redis, and secure credentials with an encryption key.

It also mounts n8n_storage into each n8n container at /home/node/.n8n so workflows, credentials, and logs aren’t lost when containers restart.

The postgres service runs postgres:16, uses db_storage for its data directory, and automatically initializes schema or seed data via init-data.sh on first startup. A healthcheck (pg_isready) ensures n8n only starts once PostgreSQL is accepting connections.

Meanwhile, the redis service runs redis:6-alpine with its own volume (redis_storage) and a redis-cli ping healthcheck, guaranteeing the queue backend is ready before any jobs are enqueued.

Finally, two n8n services inherit the shared settings:

n8n exposes port 5678 for the web UI and API, linking to Redis and PostgreSQL once they’re healthy.
n8n-worker, launched with command: worker, handles background job processing separately from the main web process.

This separation of web and worker processes, combined with persistent volumes and health-checked dependencies, delivers a robust, production-ready n8n deployment.

You will also need to create a .env file:

POSTGRES_USER=changeUser
POSTGRES_PASSWORD=changePassword
POSTGRES_DB=n8n

POSTGRES_NON_ROOT_USER=changeUser
POSTGRES_NON_ROOT_PASSWORD=changePassword

ENCRYPTION_KEY=changeEncryptionKey

Make sure to create your credentials here, and change these default values.

As usual, you can run the Docker compose file with:

docker-compose up

You can now access the workflow editor at http://localhost:5678.

On first opening of the URL, you will be asked to set up an owner account:

Skipping the initial pop-ups after login takes you to the Dashboard:

Tax Form Samples

In order to accurately extract the information of the 2 different types of tax forms, we are going to create 2 Unstract APIs that are designed specifically for each type.

Let’s start by taking a look at the examples of the 2 types that we will use in these examples.

1040 Tax form:

unstract n8n integration sample document

990 Tax Form:

To get an initial idea of the extracted information that Unstract and its LLMWhisperer can provide, let’s run one of the files in the LLMWhisperer Playground:

Unstract Prompt Studio Projects

Unstract is an open-source no-code LLM platform to launch APIs and ETL pipelines to structure unstructured documents. Get started with this quick guide.

Unstract’s Prompt Studio is a powerful tool that enables users to design and customize AI-driven prompts for extracting specific information from unstructured data.

In this section, we’ll focus on creating prompts to extract the necessary fields from the tax forms, like income, revenue, and identification.

Visit the Unstract website and create an account. The registration process is straightforward and grants you access to the platform’s features, including the Prompt Studio and LLMWhisperer tools.

Upon signing up, you will receive a 14-day trial that includes $10 in LLM tokens, allowing you to start using the account immediately.

Setting Up Prompts in Prompt Studio

Navigate to the Prompt Studio interface in Unstract and create a new project specific for the first tax form, let’s call it ‘Form-1040’.

Add the document on which you want to test and write the prompts for it with ‘Manage Documents’.

Prompts are designed to instruct the AI to focus on specific information fields within the document.

Prompt: “Extract the first name, last name, home address, apt no (if exists), city, state, zip code. Return JSON with these exact field names.”

Note: Remember to set the output format as JSON.

Running the prompt, we get the following JSON:

Prompt: “From the income section, extract the total amount from Form W-2 as total amount, household wages, medical payments, other income, and total income. Return JSON with these exact field names.”

Note: Remember to set the output format as JSON.

Running the prompt, we get the following JSON:

Output Format

The extracted data is organized into structured JSON, as mentioned. The combined output of the different prompts is, for example:

{
  "identification": {
    "apt no": "234",
    "city": "Miami",
    "first name": "Roger",
    "home address": "23, Country avenue park, Beach road",
    "last name": "Ferdinand",
    "state": "Florida",
    "zip code": "6784"
  },
  "income": {
    "household wages": 23000,
    "medical payments": 8990,
    "other income": 9000,
    "total amount": 230000,
    "total income": 300000
  }
}

Deploying as an API

Once you’ve set up your Prompt Studio project and fine-tuned your prompts for precise data extraction, the next step is to deploy your Unstract solution as an API.

This deployment enables you to integrate the parsing functionality directly into your applications or systems, including n8n, to support real-time processing and scalable operations.

Creating a Tool

Begin by converting your project into a tool that can be incorporated into a workflow. In your Prompt Studio project, click the Export as tool icon located at the top right corner.

This action will transform your project into a ready-to-use tool.

Creating a Workflow

Next, create a new workflow:

Navigate to BUILD → Workflows.
Click on + New Workflow to start a new workflow.

Then, in the Tools section on the right, locate the tool you just created (e.g., “Form-1040”) and drag and drop it into the workflow editor on the left side:

Creating an API

Now that your workflow is ready, you can transform it into an API. Begin by navigating to MANAGE → API Deployments and clicking on the + API Deployment button to create a new API deployment by selecting the created workflow.

Once the API is set up, you can use the Actions links to manage different aspects of the API.

For example, you can manage the API keys or download a Postman collection for testing:

You can now repeat the same steps for the other tax form type.

For example, a possible output for the second tax form is:

{
  "identification": {
    "address": "89, Edmond St, Apartment 4B",
    "employer number": "789933",
    "organization name": "Child Welfare Organization, Portland",
    "zip code": "6783"
  },
  "revenue": {
    "contribuitions": 30000,
    "investment": 3000,
    "program": 3987,
    "total revenue": 306987
  }
}

Once you have created the second API, you will see these APIs created: Form-1040-API and Form-990-API:

Make note of the following information (for each API):

API Name (for example Form-1040-API)
Organization Name (org_*******)
API Key (click on ‘…’ in the ‘Actions’ column and then ‘Manage keys’)

This information will be used when connecting the Unstract n8n node later on.

Supporting Services

Gmail

You should generate an App Password for n8n to avoid OAuth complexity in accessing the inbox.

To manage your app passwords, you can go here:

Give it a name and then click on ‘Create’.

A pop-up will be shown with your app password; copy it to a safe location so you can use it later.

Note: To use an app password, you need to have 2-step authentication enabled in your account.

NeonDB Postgres

Create a new project:

And you will be redirected to the Dashboard:

Click on ‘Connect to your database’:

Select ‘Parameters only’ and click on ‘Copy snippet’, you will need these values later to connect n8n to this Postgres database.

Next, you can create the tables necessary to store the tax form data, starting with the table form-1040:

Add the columns as the fields defined in the JSON created in Unstract’s Prompt Studio. Then click on ‘Review and create’ and then confirm by clicking on ‘Create table’.

Repeat the process for the table form-990:

PDF Data Extraction: Architecture Overview

High-Level Flow Diagram

Below is a simplified flowchart illustrating the end-to-end pipeline—from incoming email to structured JSON in the database:

Component Responsibilities

n8n:

Trigger: Watches a Gmail account for new emails with attachments.
Branching Logic: Parses the subject line to identify “Form 1040” vs. “Form 990” and routes each message accordingly.
API Calls: Sends the PDF attachment to the corresponding Unstract REST endpoint, receives parsed JSON, and then writes data into the target database.

Unstract

Document Parsing: Hosts form-specific extractors trained in Prompt Studio to recognize fields unique to each tax form.
JSON Extraction: Exposes each extractor as a secure REST API that converts uploaded PDFs into structured JSON payloads matching your schema.

Database (NeonDB Postgres)

Storage: Houses two tables—form-1040 and form-990—with columns aligned to the JSON fields produced by Unstract.
Schema Enforcement: Ensures data integrity via proper types and constraints, enabling reliable downstream analytics and reporting.

Step-by-Step Implementation

In this section, we will describe step by step the necessary configurations of each node of the workflow and its connection to the other to process the tax forms.

Access the n8n dashboard at http://localhost:5678/:

Click on ‘Create Workflow’. You can rename your workflow to something like ‘Tax Forms’:

In the next sections, we will create and configure each of the necessary nodes.

1. Configure n8n IMAP Trigger

Since this is the first node of the workflow, you can create it by clicking on ‘Add first step’ and selecting ‘Email Trigger (IMAP)’ from the list:

Click on ‘Create new credential’ to associate your app password:

Fill in the required information:

It should be filled as per:

User -> The email address
Password -> The app password defined previously
Host -> imap.gmail.com

Click ‘Save’ to save the credentials.

Automatically, a test connection will be attempted. If all is correct, you will see the success message:

Finally, tick the option to ‘Download Attachments’:

Click ‘Back to canvas’ to return to the workflow editor and add the next node of the workflow.

2. Handle Branching Logic

Use n8n’s Switch node to branch based on form type.

Click on the ‘+’ icon to add a new node after the Email Trigger:

Define the rules by selecting the appropriate field and matching expression:

Note: Here we are doing a single string contains either Form 1040 or Form 990 for simplicity. In true Production settings, you should create a strong regex expression.

3. Call Unstract Extraction API

In order to use the Unstract APIs in n8n, you need to install the Unstract node, which is part of the community nodes.

Navigate to ‘Settings’ -> ‘Community nodes’ and click on ‘Install a community node’:

Fill in the npm package name for the Unstract node, n8n-nodes-unstract, and click ‘Install’:

After a couple of seconds, the community node for Unstract should be installed:

Return to the workflow, click on the ‘+’ icon after the branch named ‘Form 990’ and select the ‘Unstract’ node from the list:

Then click on ‘Create new credential’ and fill in the corresponding API key and Organization ID mentioned previously:

Additionally, in the node configurations, you all need to fill in the corresponding ‘API Deployment Name’, which corresponds to the API Name from the Unstract API definition mentioned previously:

Then repeat the same steps for the Switch branch ‘Form 1040’, including setting a new credential since each API has its own credentials and API name:

4. Store Results in Database

Returning to the workflow editor, the next step is to configure the output of the Unstract APIs to a Postgres node.

Click on the ‘+’ icon after the Unstract node in the branch ‘Form 1040’ and select a Postgres node:

As action, select ‘Insert rows in a table’:

Then click on ‘Create new credential’ and fill in the corresponding parameters from the NeonDB project, as described previously

Note: Don’t forget to tick the Allow SSL option by scrolling down on this pop-up.

Now a field mapping needs to be configured between the Unstract API result and the Postgres corresponding table.

Select the schema, in this case ‘Public’, and the table ‘form-1040’. The dropdowns will populate with real-time info from the database:

Remove the id field and map each of the other fields and columns by dragging from the left INPUT section and dropping into the corresponding table field.

Repeat the process for the other branch ‘Form 990’:

5. Complete Workflow

The complete workflow is defined as:

It gets triggered with a new email (read from IMAP), parses the email subject, chooses the appropriate branch, calls the Unstract AP, sends the attached PDF, and inserts the JSON output into the corresponding Postgres table.

Testing & Validation

Since the workflow is not active by default, to test a workflow, you need to click on ‘Test workflow’.

After that, send an email with the appropriate subject, containing either ‘Form 1040’ or ‘Form 990’, and attach the corresponding tax form PDF.

The workflow will execute and process all the nodes.

You can check the workflow executions by selecting ‘Executions’ when in the ‘Editor’.

Here you can see the corresponding executions, including the paths taken by the workflow, here for ‘Form 1040’:

And here for ‘Form 990’:

Once you have confirmed that the workflow works as expected you can set to Active so it runs automatically. You can do this by ticking the ‘Inactive’ toggle in the editor to change it to ‘Active.

Database Verification

As an additional verification, you can check the stored data in the NeonDB Postgres database by querying the inserted data.

Navigate to ‘Tables’ and select one of the tables, in this case, form-1040:

Can also check table form-990:

End-to-End Document Automation with n8n and Unstract

By combining n8n’s flexible, self-hosted orchestration with Unstract’s AI document extraction APIs, accounting firms can replace tedious, error-prone manual data entry with a reliable, fully automated pipeline.

Incoming emails are automatically filtered and routed based on simple subject-line logic, PDFs are sent to form-specific extractors trained in Prompt Studio, and the resulting structured JSON lands directly in your NeonDB Postgres tables—no human intervention required.

This end-to-end workflow accelerates document processing automation, frees up your team for higher-value work, and ensures consistent data quality.

With retry strategies, failure alerts, and execution logs in place, you maintain full visibility and control over every step of the pipeline.

As your document types evolve, you can rapidly iterate on new extractors in Unstract and seamlessly plug them into your existing n8n workflows.

Ready to take the next step?

Whether you’re handling two tax-form variants today or scaling to dozens of document types tomorrow, Unstract + n8n delivers the agility, accuracy, and efficiency modern accounting teams demand.

Unstract is a no-code platform to eliminate manual processes involving unstructured data using the power of LLMs. The entire process discussed above can be set up without writing a single line of code. And that’s only the beginning. The extraction you set up can be deployed in one click as an API or ETL pipeline.

With API deployments you can expose an API to which you send a PDF or an image and get back structured data in JSON format. Or with an ETL deployment, you can just put files into a Google Drive, Amazon S3 bucket or choose from a variety of sources and the platform will run extractions and store the extracted data into a database or a warehouse like Snowflake automatically. Unstract is an Open Source software and is available at https://github.com/Zipstack/unstract.

If you want to quickly try it out, signup for our free trial. More information here .

Signup for a free trial

UNSTRACT

End Manual Document Processing

Leveraging AI to Convert Unstructured Documents into Usable Data

Get complex documents ready for LLM consumption

Unstract MCP Server: Document Processing Automation on the Go

Why LLMs Are Not (Yet) the Silver Bullet for Unstructured Data Processing

Best OCR Software in 2025 — A Tool Comparison & Evaluation Guide

About Author

Nuno Bispo

Nuno Bispo is a Senior Software Engineer with more than 15 years of experience in software development. He has worked in various industries such as insurance, banking, and airlines, where he focused on building software using low-code platforms. Currently, Nuno works as an Integration Architect for a major multinational corporation. He has a degree in Computer Engineering.