Logo brand logo for Granite Marketing. The logo is a simple, modern, and clean logo that is easy to recognize and remember.
ServicesResultsProcessTestimonialsFAQsBlogTemplates
Get Started
Blog›Automation Tips
Automation Tips

Advanced OCR with Llamaindex Extraction and n8n

By Granite Marketing•Published Jul 2, 2025•12 min read

Share this post

Explore the template

Advanced OCR with Llamaindex Extraction and n8n

How to Extract Structured Data from PDFs Using LlamaIndex and n8n

If you've ever tried to pull specific information from a PDF: company names, funding amounts, contact details, you know it's tedious work. Traditional OCR gives you text, sure, but it doesn't understand what that text means or how it should be organised.

LlamaIndex's extraction tools change that. Instead of dumping raw text at you, they let you define exactly what data you want and return it in a clean, structured format you can actually use. So when you wire this into an n8n workflow, you can automate the entire process. No copy-pasting, no manual data entry.

In this tutorial, we'll walk through building an n8n workflow that uses LlamaIndex Cloud to extract structured data from pitch decks. You'll learn how to set up the API calls, handle asynchronous jobs, and get back JSON that's ready to feed into your CRM, database, or wherever else you need it.

What Problem Does This Solve?

Let's say you're a VC analyst reviewing dozens of pitch decks every week. Each deck has the same basic info: company name, founders, funding ask, market size, but it's scattered across slides in different formats.

Manually extracting this into a spreadsheet takes hours. Traditional OCR tools give you all the text, but you still have to hunt through paragraphs to find what you need.

LlamaIndex Extraction solves this by letting you define a schema: basically a template that says "I want company name, funding amount, and team size", and it pulls exactly those fields from the document. The result is structured JSON you can immediately use in automation workflows.

This isn't just for pitch decks. You can use the same approach for invoices, contracts, forms, research papers, any document where you need specific data points extracted reliably.

What You'll Need

Before you start building, make sure you have:

  • An n8n instance (cloud or self-hosted)
  • A LlamaIndex Cloud account with an API key (sign up here)
  • A custom extraction agent set up in LlamaIndex Cloud (we'll cover what this is in a moment)
  • A sample PDF to test with (a pitch deck works great)

If you're new to LlamaIndex Cloud, you'll want to create an extraction agent first. This is where you define your schema: the fields you want extracted. You can do this manually or let LlamaIndex generate a schema by uploading example documents and describing what you need.

It’s honestly as simple as:

1. Go to Extract, in the left-hand menu

2. Tap on the playground sign and select “Manage Agents”

Manage agents.
Manage agents

3. Create a new agent by pressing the button in the top right and give it a name. We called our granite-agent.

4. Click on your agent’s name

Agent view
Agent view

5. Select Auto-Generate (this is the fastest way to get setup)

Auto-generate schema
Auto-generate schema

6. Write a simple prompt or upload a file to get suggestions

Prompt suggestion
Prompt suggestion

7. Confirm your field names

Confirm field names and descriptions
Confirm field names and descriptions

8. Click update agent and publish your configuration. It’ll actually show you any changes you’ve made as well which is very handy.

Update agent
Update agent

9. Keep a record of your agents name somewhere

That’s it, you’re now setup and ready to start setting up everything inside of n8n.

How the Workflow Works

Here's the high-level flow:

  1. Fetch the extraction agent ID by name
  2. Download the PDF you want to process
  3. Upload the PDF to LlamaIndex Cloud to create a file object
  4. Start an extraction job using your agent and file
  5. Poll the job status until it's complete
  6. Retrieve the structured data from the completed job

LlamaIndex runs extraction jobs asynchronously, which means you kick off the job and then check back periodically until it's done. This is perfect for n8n's loop and wait capabilities.

Let's build it step by step.

Step 1: Get Your API Key

Main dashboard
Main dashboard

Before we begin, it's important to know where to get your API key. Now, this is relatively straightforward in Llamaindex because once you land on the main dashboard after signing up, you'll see a page that looks like this. And all you have to do is click on the API keys in the bottom left corner.

API Key Preview
API Key Preview

Then you just have to create your API key, and that’s it.

Pro Tip: It's usually good practise to create an API key that serves a specific function. So some examples being a specific client or a particular section of a project. This will help you track usage to see which part of your workflows are the most labour-intensive.

Step 2: Get Your Extraction Agent ID

First, you need to tell LlamaIndex which extraction agent to use. If you know the agent's name, granite-agent in our case, you can look up its ID with a simple API call.

In n8n, add an HTTP Request node with these settings:

  • Method: GET
  • URL: https://api.cloud.llamaindex.ai/api/v1/extraction/extraction-agents/by-name/YOUR_AGENT_NAME
  • Authentication: Generic Credential Type
    • Header Auth
    • Name: Authorization
    • Value: Bearer YOUR_LLAMA_CLOUD_API_KEY

Replace YOUR_AGENT_NAME with the name of your extraction agent (e.g., pitch_deck_extractor).

Tip: Don't add your API key directly to the HTTP node. You want to select an authentication type of "generic credential type", then you want to select "generic auth type" as "bearer", and then that's where you create your bearer token. This means that your API keys are encrypted and they aren't visible in the workflow, which can be useful if you want privacy, and you can add that same authentication to multiple HTTP nodes, across workflows, without having to copy and paste it every time.
Http node setup
Http node setup with basic auth

The response will include an id field. You'll use this in later steps, so make sure to reference it as {{ $json.id }} in subsequent nodes.

Why this step matters: LlamaIndex uses agent IDs internally, but it's easier for you to reference agents by name. This lookup step bridges that gap.

Step 3: Download the PDF

Next, you need to get the PDF you want to process. This could come from anywhere: a webhook, a cloud storage bucket, an email attachment, or a direct URL.

For this example, we'll assume you have a direct URL to a PDF.

Add another HTTP Request node:

  • Method: GET
  • URL: https://example.com/path/to/pitch-deck.pdf
  • Response Format: File

Setting the response format to "File" tells n8n to treat the response as binary data, which is exactly what LlamaIndex expects.

Step 4: Upload the PDF to LlamaIndex Cloud

Before LlamaIndex can extract data, it needs a reference to your file. You create this by uploading the PDF and getting back a file_id.

Add an HTTP Request node:

  • Method: POST
  • URL: https://api.cloud.llamaindex.ai/api/v1/files
  • Authentication: Same header auth as before
  • Body Content Type: Form Data
  • Body Parameters:
    • Name: upload_file
    • Value: data (this references the binary data from the previous step)

The response will include a file_id. Save this as {{ $json.id }} for the next step.

Upload document output
Upload document output
Tip: If you're processing multiple files, you can loop through them here. Just make sure each file gets its own file_id before moving forward.

Step 5: Start the Extraction Job

Now you can kick off the actual extraction.

Add another HTTP Request node:

  • Method: POST
  • URL: https://api.cloud.llamaindex.ai/api/v1/extraction/jobs
  • Authentication: Same header auth
  • Body Content Type: JSON
  • Body:

{
"webhook_configurations": [
{
"webhook_url": "your-webhook-url",
"webhook_headers": {},
"webhook_events": [
"extract.pending", "extract.success", "extract.error", "extract.partial_success"
]
}
],
"extraction_agent_id": "{{ $('Get Extraction Agent').item.json.id }}",
"file_id": "{{ $('Upload Document to LlamaExtract').item.json.id }}",
"data_schema_override": {},
"config_override": {
"priority": "low",
"extraction_target": "PER_DOC",
"extraction_mode": "BALANCED",
"multimodal_fast_mode": false,
"use_reasoning": false,
"cite_sources": false,
"chunk_mode": "PAGE",
"invalidate_cache": false
}
}

Replace 'Get Agent ID' and 'Upload PDF' with the actual names of your nodes.

The response will include a job_id. This is what you'll use to check the status and retrieve results if you go down the polling route.

Using Llamaindex With Webhooks

You can also set a webhook URL to trigger when the analysis is completed which is a much cleaner solution. Just don't forget to set it to POST, I always do and wonder why it isn't working...

Just be aware that the webhook sends two requests:

  1. One to let you know that it's pending
  2. Another one when it is completed

Therefore, we have to set up a filter to grab the successful response. Then, we use a set node and set the body type to object, which just makes accessing the id simple to allow us to get the results of the execution.

Running a Lamaindex job
Run job

Add an HTTP Request node:

  • Method: GET
  • URL: https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/{{ $('Set Return Data').item.json.body.data.job_id }}/result
  • Authentication: Same header auth
Final output
Final output

Optional parameters:

You don’t need to touch most of these to get started. The defaults are sensible. But if you want more control, here’s what’s available:

  • extraction_mode: Control how the extraction runs (default is usually fine).
    • Choose between speed and accuracy — for example, FAST for quicker runs, BALANCED for most use cases, or more advanced modes for complex layouts and tables.
  • system_prompt: Add custom instructions to guide the extraction (useful if you need specific formatting or context).
    • This is helpful when you want stricter outputs or domain-specific behaviour.
  • priority: Set job priority (e.g., "low").
    • Useful if you’re running large batches and want to manage cost or queue behaviour.
  • extraction_target: Decide how results are grouped.
    • PER_DOC returns one structured result per document.
    • PER_PAGE or PER_TABLE_ROW can split output into more granular entries.
  • multimodal_fast_mode: Speed up multimodal extraction.
    • Trades a bit of quality for faster processing. Usually safe to leave off unless optimising for cost.
  • use_reasoning: Enable deeper reasoning during extraction.
    • Helpful when information isn’t explicitly stated and needs interpretation.
  • cite_sources: Return citation metadata.
    • Adds page references and source locations so you can trace exactly where each value came from.
  • chunk_mode: Control how the document is split before extraction.
    • PAGE is standard. Other modes can help with structured or sectioned documents.
  • invalidate_cache: Force a fresh extraction.
    • Useful during testing if you’ve updated a document and don’t want cached results.
  • data_schema_override: Temporarily override the agent’s schema.
    • Great for one-off experiments without creating a new extraction agent.
  • webhook_configurations: Receive real-time status updates.
    • Instead of polling for results, you can trigger a webhook when the job is pending, successful, partially successful, or fails.

For the webhook, that's it. If you do want to use the polling version, then you can continue on to the following steps.

Step 6: Poll the Job Status (If Not Using Webhook)

Extraction jobs don't finish instantly: they can take anywhere from a few seconds to a couple of minutes depending on document complexity.

You need to check the job status periodically until it's complete.

Add a Loop Over Items node (or use the Wait node with a loop) to poll the status every 30 seconds.

Inside the loop, add an HTTP Request node:

  • Method: GET
  • URL: https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/{{ $('Start Extraction Job').item.json.id }}
  • Authentication: Same header auth

The response will include a status field. Possible values are:

  • pending
  • in_progress
  • completed
  • failed

Use an IF node to check if status equals completed. If not, wait 30 seconds and check again. If yes, move to the next step.

Why polling matters: LlamaIndex processes jobs asynchronously to handle large documents efficiently. Polling is the standard pattern for working with async APIs—you're not blocking the workflow, just checking in periodically.

Step 7: Retrieve the Extracted Data (If Not Using Webhook)

Once the job is complete, you can fetch the structured data.

Add a final HTTP Request node:

  • Method: GET
  • URL: https://api.cloud.llamaindex.ai/api/v1/extraction/jobs/{{ $('Start Extraction Job').item.json.id }}/result
  • Authentication: Same header auth

The response will be a JSON object matching the schema you defined in your extraction agent.

For a pitch deck, it might look like this:


{
"company_name": "Acme AI",
"funding_amount": "$2M",
"team_size": 5,
"market_description": "AI-powered analytics for e-commerce brands",
"founders": ["Jane Doe", "John Smith"]
}


Now you can route this data wherever you need it—into a Google Sheet, a CRM, a database, or even trigger a follow-up workflow.

Understanding Extraction Schemas

The real power here is in the schema. This is where you define exactly what data you want and how it should be structured.

When you create an extraction agent in LlamaIndex Cloud, you can:

  • Manually define fields: Specify field names, types (string, number, array), and descriptions
  • Auto-generate from examples: Upload sample documents and describe what you want—LlamaIndex will infer the schema (what we did in this tutorial)

For a pitch deck, a good schema might include:

  • company_name (string)
  • funding_amount (string)
  • team_size (number)
  • market_description (string)
  • founders (array of strings)
  • problem_statement (string)
  • solution_overview (string)

The more specific your schema, the better the extraction quality. If you need a date formatted a certain way, or a currency amount without symbols, you can add instructions in the field description.

Pro tip: Start with a simple schema and iterate. Test with a few documents, see what gets extracted correctly, and refine from there.

What You Can Do With the Extracted Data

Once you have structured JSON, the possibilities open up:

  • Auto-populate a CRM: Send company details directly into Salesforce, HubSpot, or Airtable
  • Build a deal pipeline: Track funding asks, team sizes, and market focus across dozens of decks
  • Generate summaries: Feed the extracted data into an AI summarisation tool to create one-pagers
  • Trigger follow-ups: If funding amount exceeds a threshold, automatically send an email or Slack message
  • Create dashboards: Aggregate data across documents to spot trends (e.g., average funding ask by industry)

In our test run, we extracted data from a pitch deck in under two minutes: something that would've taken 15+ minutes manually. And because it's structured, we could immediately filter, sort, and analyse it without any cleanup.

Tips and Gotchas

Handle failed jobs gracefully: Always check the status field. If a job fails, the API will return an error message: log it and alert yourself so you can investigate.

Watch your API rate limits: LlamaIndex Cloud has rate limits depending on your plan. If you're processing dozens of documents, consider batching or adding delays between jobs.

Test your schema thoroughly: The quality of extraction depends heavily on your schema. Test with a variety of documents to make sure it handles edge cases (missing fields, unusual formatting, etc.).

Use system prompts for edge cases: If the default extraction isn't quite right, you can add a system prompt to guide the model. For example: "Extract funding amount as a number without currency symbols."

Store file IDs if you need to reprocess: If you might want to re-run extraction with a different schema later, save the file_id: you won't need to re-upload the document.

Wrapping Up

LlamaIndex Extraction turns unstructured PDFs into structured, machine-readable data, and when you combine it with n8n, you can automate the entire process end-to-end.

This isn't just about saving time (though that's a big win). It's about making data accessible in a way that lets you build smarter workflows. Instead of manually copying fields into spreadsheets, you can route data directly into systems that act on it: CRMs, databases, analytics tools, AI agents.

If you're processing documents regularly: pitch decks, invoices, contracts, research papers, this approach scales in a way manual extraction never could.

The next time you're staring at a stack of PDFs, remember: you don't have to extract that data by hand. Let the machines do it.

Ready to automate your workflows

Get practical workflows built for your business. No coding required, just results that matter.

Related Articles

Continue reading

Explore more insights and strategies to enhance your automation journey

Selective Workflow Migration Between n8n Instances: Static vs Dynamic Modes
Automation Tips
Feb 16, 2026•8 min read

Selective Workflow Migration Between n8n Instances: Static vs Dynamic Modes

This article walks through a practical, API-driven approach to selectively moving workflows between instances, using forms, clean imports, and two operational modes (Default and Dynamic) to support everything from simple staging-to-production moves to multi-client environments.

How to Build a Smart GitHub-to-n8n Workflow Importer (Without Recursion Chaos)
Automation Tips
Feb 11, 2026•8 min read

How to Build a Smart GitHub-to-n8n Workflow Importer (Without Recursion Chaos)

Learn how to build a smart GitHub-to-n8n workflow importer that supports nested folder structures and selective imports using a queue-based traversal system that avoids recursion, generates dynamic selection forms, and safely creates workflows via the n8n REST API.

Browser Automation for AI Agents: Why Vercel Agent Browser Actually Works Better
AI Updates
Feb 6, 2026•7 min read

Browser Automation for AI Agents: Why Vercel Agent Browser Actually Works Better

Selector-based browser automation breaks down when AI agents have to make decisions in real time. Learn why snapshot-based interaction performs better and what that means for agent-driven workflows in practice.

Logo brand logo for Granite Marketing. The logo is a simple, modern, and clean logo that is easy to recognize and remember.
  • Services
  • Results
  • Process
  • Testimonials
  • FAQs
  • Blog
  • Templates
© 2026 Granite Marketing. All rights reserved.
PrivacyCookies