How to achieve 99% accuracy in LLM-driven unstructured data ETL

Conducted on 27th May 2025

00:01:00.000 –> 00:15:52.000

Hello everyone, welcome to the webinar. I’m Mahashree, Product Marketing Specialist at Unstract, and also your host for this session today. Now LLMs today are transforming how we extract data from documents.

But without output accuracy, even the most advanced models can become a liability. Furthermore, extraction errors skew insights, invite risks, and also burden teams with fixes.

Now this is exactly where a platform like Unstract comes into the picture with its robust suite of tools that better enable LLM outputs.

00:15:52.000 –> 00:15:58.000
Have seen with our users that they’ve been able to achieve an accuracy rate of a whopping 99% using these tools.

00:15:58.000 –> 00:16:09.000
And over the course of this webinar. You’ll be able to better understand these accuracy challenges that we’re dealing with today, and how these tools really help you overcome them.

00:16:09.000 –> 00:16:16.000
Now, before I kickstart the session. Here are a few housekeeping points I’d like to quickly run over.

00:16:16.000 –> 00:16:22.000
So, firstly, all attendees in this webinar will automatically be on mute throughout the course of this session.

00:16:22.000 –> 00:16:33.000
In case you have any questions, please do drop them in the Q&A tab that you’ll find at the bottom panel of your screen. Our team is working in the backend, and they’ll be able to get back to you with the answers via text.

00:16:33.000 –> 00:16:42.000
Now, in case you have any questions and they are left unanswered on text, not to worry, we will be taking them up towards the end of this session in an interactive Q&A.

00:16:42.000 –> 00:16:45.000
You can also chat with fellow attendees using the chat tab.

00:16:45.000 –> 00:16:49.000
This is where you let us know in case you run into any technical difficulties.

00:16:49.000 –> 00:17:00.000
And as a final point, when you exit this webinar, I request you to leave your feedback when you’ll be redirected to a feedback form. This will really help us improve our webinar experience going forward.

00:17:00.000 –> 00:17:09.000
So that said, here’s the agenda for today. We’ll open up this session by discussing what causes inaccurate document extraction.

00:17:09.000 –> 00:17:13.000
Now, this will set the problem statement that we’re really looking to address.

00:17:13.000 –> 00:17:21.000
It will be followed by an introduction to Unstract and LLM Whisperer, which is Unstract’s in-house text extraction service.

00:17:21.000 –> 00:17:27.000
Along with the powerful suite of accuracy-enabling capabilities that they bring with them.

00:17:27.000 –> 00:17:35.000
Post this, we’ll also have live demonstrations of these capabilities so that you can see accuracy-first extraction in real time.

00:17:35.000 –> 00:17:41.000
And finally, we’ll conclude this session with a summary and also a Q&A, in case we have any questions pending.

00:17:41.000 –> 00:17:49.000
So that said, here is the opening discussion, what causes inaccurate document extraction?

00:17:49.000 –> 00:17:55.000
Now, we’ll gain clarity on what are the key challenges that need to be addressed when it comes to document extraction.

00:17:55.000 –> 00:18:13.000
And, uh, at Unstract, we observed two broad categories of factors that contribute to inaccuracy in extraction. That is, namely document challenges Which deals with the documents themselves, the quality and structure of input data play a critical role in extraction accuracy.

00:18:13.000 –> 00:18:29.000
And secondly, it could also be incompetence on the technological front, or in this case, the technical challenges could be the limitations that come with LLMs today. So we’ll take a look at both these categories in the upcoming segments.

00:18:29.000 –> 00:18:36.000
So up first, we have document challenges. As you can see, we have 3 challenges that are outlined over here.

00:18:36.000 –> 00:18:43.000
Now, these are based on our research and conversations with users of Unstrack and LLM Whisperer on their pain points.

00:18:43.000 –> 00:18:47.000
So the first challenge we have is low-quality scans or images.

00:18:47.000 –> 00:19:02.000
Now, businesses largely still run on physical documents. These documents are usually scanned and uploaded into systems at a later stage for better maintenance or analytics, because These documents, again, contain invaluable business insights.

00:19:02.000 –> 00:19:07.000
But what we see today is that not at all times are we going to get a good scan.

00:19:07.000 –> 00:19:15.000
We could have scans with bad lighting, we could have low resolution scans, skewed scans, or we may be even dealing with watermarks.

00:19:15.000 –> 00:19:22.000
And these are the ingredients that make it difficult for the system to detect accurate information from the documents.

00:19:22.000 –> 00:19:34.000
Now, one way of dealing with this issue today is to first pre-process your documents to create it in a format that is best understandable by the document extraction technology, or in this case, LLMs.

00:19:34.000 –> 00:19:43.000
So we’ll again go over the importance of pre-processing later on in this webinar, and we’ll be able to see how LLM Whisperer really supports you when it comes to this.

00:19:43.000 –> 00:19:52.000
Moving on to the next challenge, we have complex formats. Now, documents are designed only keeping in mind how easy it is for the human to comprehend.

00:19:52.000 –> 00:20:06.000
Now, not always is this going to be the same case with machines, especially when we’re dealing with complex formats that come with nested tables, multi-layout columns, forms with various checkboxes and input text fields, and so on.

00:20:06.000 –> 00:20:13.000
Now, again, the advantage that we see with LLMs, though, is that they read documents very much similar to how a human would.

00:20:13.000 –> 00:20:18.000
Which is why you’re not required to train the model or define any rule-based templates for extraction.

00:20:18.000 –> 00:20:26.000
However, we still do notice the need for preprocessing in terms of making the documents LLM ready.

00:20:26.000 –> 00:20:32.000
Which is really the forte of LLM Whisperer, and we’ll be looking at this in a while.

00:20:32.000 –> 00:20:45.000
And finally, we have unstructured content. Now, unlike forms or invoices that follow a specific template, unstructured documents like emails or letters don’t really offer clear labels or formats.

00:20:45.000 –> 00:20:56.000
This ambiguity makes it difficult to locate and extract relevant transformation accuracy. So with that, we sum up the commonly phrased document challenges that contribute to inaccuracy.

00:20:56.000 –> 00:21:02.000
You can see example screenshots of possible documents that are quite difficult.

00:21:02.000 –> 00:21:13.000
So, over here, we have a letter, which is handwritten, and it is also a badly scanned document, and when we take a look at the checked background, it’s going to make it even more difficult for the system to understand.

00:21:13.000 –> 00:21:31.000
We also have another screenshot of a condensed table with a few columns over here, and with a lot of content. So these are the kind of documents when LLMs are… when they are directly passed to LLMs, they might have trouble with understanding the data and retrieving the right data that you’re looking for.

00:21:31.000 –> 00:21:36.000
So with that, let me move on to the technical challenges.

00:21:36.000 –> 00:21:43.000
So over here, we have five accuracy-related technical challenges. The first one being LLM hallucinations.

00:21:43.000 –> 00:21:50.000
Some hallucinations are basically instances where an LLM generates an incorrect output quite confidently.

00:21:50.000 –> 00:21:55.000
Now, this could happen if the models do not have access to complete input context.

00:21:55.000 –> 00:22:08.000
Or if they interpret the prompts differently from what is required. Moreover, LLMs are non-deterministic in nature, and they will most Uh, probably produce an output no matter what, even if the input context is limited.

00:22:08.000 –> 00:22:20.000
So this, again, stresses the need to provide complete input context, to find a way to provide complete input context, and also make sure that the prompts that we are providing the model with is clear.

00:22:20.000 –> 00:22:29.000
Now, there are, again, uh, capabilities to aid in providing input context and also, you know, in building prompts that we’ll look at later in this webinar.

00:22:29.000 –> 00:22:44.000
And now what another limitation that we see with LLM challenges is that currently, there is no way to catch these hallucinations And, um, LLNs currently do not offer a way to validate your output, which brings me to the second point.

00:22:44.000 –> 00:22:53.000
Inability to validate LLM outputs. Now, without a verification mechanism, it’s again hard to trust the output blindly.

00:22:53.000 –> 00:22:57.000
Now, to overcome these challenges, Unstrak supports a powerful capability called LLM Challenge.

00:22:57.000 –> 00:23:04.000
Which is, again, a differentiator for us in the market, and we’ll cover this in more detail in these sections to come.

00:23:04.000 –> 00:23:09.000
Moving on from there, we have the third technical challenge that is poorly constructed prompts.

00:23:09.000 –> 00:23:20.000
Now, prompt engineering matters. If the prompt is vague or ambiguous, the model may not understand what to extract or how to format it.

00:23:20.000 –> 00:23:24.000
So, from there, next up, we have lack of complete context.

00:23:24.000 –> 00:23:40.000
So what we mean by this is that the LLM is unable to understand what the document based in its entirety. This again happens mostly when the document is difficult to read, as we’d seen earlier when we discussed challenges like badly scanned documents or complex layouts.

00:23:40.000 –> 00:23:46.000
Now, this is… this, again, reiterates the need for preprocessing, which we’ll take a look at in a while.

00:23:46.000 –> 00:23:59.000
And finally, we have industry-specific jargon. Models may stumble when dealing with domain-specific terminology, whether it’s legal phrasing, it could be medical codes, or even financial abbreviations.

00:23:59.000 –> 00:24:04.000
So, under the model has been fine-tuned or trained with domain-specific knowledge.

00:24:04.000 –> 00:24:14.000
It might misinterpret or even completely skip critical content. And with that, we conclude the section on the factors causing inaccurate document extraction.

00:24:14.000 –> 00:24:22.000
We’ll now move on to understanding what are the various features Unstract and LLM Whisperer have to offer to overcome these challenges.

00:24:22.000 –> 00:24:29.000
But before we move on to that, here’s a quick introduction to these two platforms.

00:24:29.000 –> 00:24:33.000
Unstruct, as I mentioned before, is a purpose-built platform for document extraction and ETL.

00:24:33.000 –> 00:24:41.000
The functionalities of this platform can be broadly put into two buckets, that is, the development phase and the deployment phase.

00:24:41.000 –> 00:24:49.000
Now, in the development phase is basically where you define what is the data you want to extract from your document, and in what schema you want this extraction to be done.

00:24:49.000 –> 00:24:54.000
So for this, we use a robust prompt engineering environment called Prompt Studio.

00:24:54.000 –> 00:25:03.000
So here’s where you would also specify a text extractor to pre-process your document before you extract any data from it.

00:25:03.000 –> 00:25:07.000
The popular choice that we see among our users for the text extraction service is LLM Vispera.

00:25:07.000 –> 00:25:13.000
Because one is that it does this job wonderfully well in creating LLM-ready documents.

00:25:13.000 –> 00:25:22.000
With its powerful layout-preserving mode And secondly, it is Unstrak’s in-house text extractor and goes hand-in-hand with the platform that we built.

00:25:22.000 –> 00:25:27.000
Now, depending on your use case, you can also avail LLM Whisperer as a standalone solution.

00:25:27.000 –> 00:25:32.000
So once you extract the raw text from your document, or you pre-process your document.

00:25:32.000 –> 00:25:40.000
You can define the prompts for data extraction, and also test these prompts across multiple document variants within the Prom Studio.

00:25:40.000 –> 00:25:47.000
Now, this is where you can also check Uh, check out the performance of the various accuracy enablers we have, like LLM Challenge.

00:25:47.000 –> 00:25:59.000
Or grammar, and we’ll come to this in a little while. So this wraps up the development phase once you have defined your prompts and you’re happy with the extraction outcomes that you’re getting with the uploaded documents.

00:25:59.000 –> 00:26:08.000
You can then export your project from Prom Studio and choose from various deployment options in which you want this project to be deployed in your business.

00:26:08.000 –> 00:26:17.000
So abstract inherently supports deployment as an ETL pipeline, an API, task pipeline, and also as a human-in-the-loop deployment.

00:26:17.000 –> 00:26:26.000
In this session, however, we’ll only focus on the human-in-the-loop feature, as this helps invalidate extraction outcomes and also preserve the accuracy of your output.

00:26:26.000 –> 00:26:36.000
But to understand these other deployment options in more detail, and the other features of Unstruct in general, you can check out one of our previous webinars called Unstract 101.

00:26:36.000 –> 00:26:44.000
Where we cover the end-to-end capabilities of the platform, uh, briefly, and give you an overview of what the platform can do.

00:26:44.000 –> 00:26:58.000
So, our team will now drop the link to that webinar in chat, so you can always check that out in case you want to learn about it. Now, to further sum it up, here are some of the numbers that, uh, that might help you get a better understanding of the platform.

00:26:58.000 –> 00:27:09.000
Unstruct now has 5K plus stars on GitHub. We have a 650 plus member Slack community, and currently we process over 7 million plus pages monthly by paid users alone.

00:27:09.000 –> 00:27:25.000
So moving on to the various ways in which the platform can be integrated into your Construct is available as an open source edition with limited features for you to try out… try it out on your own, and also you can avail it as a cloud or an on-premise offering as well.

00:27:25.000 –> 00:27:44.000
Moving on to LLM Whisperer, this again has a playground, which is a cloud offering, which basically lets you use the end-to-end functionalities of the platform for free, up to a daily limit of 100 pages. So, uh, you can upload 100 pages when you’re using the free plan, and use all the capabilities, and you’ll be billed only after that.

00:27:44.000 –> 00:27:50.000
Llm Whisperer can be deployed as an API, a JavaScript client, or as a Python client as well.

00:27:50.000 –> 00:27:59.000
And both the platform’s topmost priority is data privacy, as it goes without saying that customers do put important and potentially sensitive documents through them.

00:27:59.000 –> 00:28:06.000
So to ensure privacy and security, our platforms are ISO, GDPR, SOC 2, and HIPAA compliant as well.

00:28:06.000 –> 00:28:17.000
Now, apart from the various compliance standards we meet, the Anstract platform is also carefully designed from ground up not to store any documents that pass through it during any normal course of operation.

00:28:17.000 –> 00:28:25.000
So that said, I think we’ve taken a good overview of both these platforms, Unstract, as well as LLM Whisperer.

00:28:25.000 –> 00:28:35.000
Now, we’ll move on to the next section of this webinar, where we’ll cover specific capabilities offered by both these platforms that enable accuracy in document extraction and processing.

00:28:35.000 –> 00:28:43.000
So, you’ll, uh, during our demo sections, you’ll also get a better understanding of how these capabilities work in action.

00:28:43.000 –> 00:28:52.000
So here’s a list of accuracy-enabling capabilities the platforms have to offer. We have layout preservation, confidence score, bounding box, human in the loop.

00:28:52.000 –> 00:29:02.000
Llm Challenge, grammar, and finally, preamble and post-amble, which are all the different capabilities… is designed to combat the extraction accuracy challenges we’d looked at earlier.

00:29:02.000 –> 00:29:15.000
So now, I’ll move on and start covering these capabilities, and I’m sure you’d be able to make a link between the challenges we covered earlier and how these capabilities really offer a seamless solution for you to overcome them.

00:29:15.000 –> 00:29:24.000
So up first, we have layout preservation. Now, LLNs, as I mentioned earlier, consume information in a format that is best understood by humans.

00:29:24.000 –> 00:29:29.000
Which is why they require no prior training or rules to be defined.

00:29:29.000 –> 00:29:33.000
They can also process a document for the first time and still arrive at results.

00:29:33.000 –> 00:29:49.000
However, what is required for them to arrive at accurate results Especially on documents that may contain bad scans or complex layouts, is to first pre-process these documents into a format that is LLM-ready. And this is the secret sauce behind LLM Whisperer.

00:29:49.000 –> 00:30:00.000
Willem Whisperer is able to extract raw text from the document by preserving each and every text as well as maintaining the layout and even the spacing of the original document.

00:30:00.000 –> 00:30:16.000
And this is a way in which we can overcome the challenge of LLMs not receiving the full context of your document as Uh, we ensure during the pre-processing itself that each and every text is extracted and the entire context is passed, without compromising on any information.

00:30:16.000 –> 00:30:22.000
So for you to now better understand this, let me take you to the platform itself.

00:30:22.000 –> 00:30:27.000
So, uh, what you see over here, folks, is the LLM Whisperer Cloud deployment.

00:30:27.000 –> 00:30:38.000
So, this is the playground that I was talking about earlier, where you can upload your documents on your own and see how LLM Whisperer is able to extract the raw text and pre-process the documents into an LLM-ready format.

00:30:38.000 –> 00:30:44.000
Or you can also take a look at these pre-uploaded documents that we have over here of different types.

00:30:44.000 –> 00:30:48.000
So, just to show you how, uh, you know, to show you around the playground.

00:30:48.000 –> 00:30:56.000
Let me open up a sample document. What you have over here is a loan application, so we have… perform with some details filled in by hand.

00:30:56.000 –> 00:31:06.000
So, uh, this is again a scanned form, and let’s see how the platform is able to, um, basically retrieve this text and preserve the layout while it does it.

00:31:06.000 –> 00:31:13.000
So you can see that LLM Whisperer has been able to retrieve all the text, including the details filled in by hand.

00:31:13.000 –> 00:31:21.000
So in this case, we have the applicant name, the social security number. You can see that in the marital status, it’s checked as married.

00:31:21.000 –> 00:31:27.000
So, let’s see how this is retrieved, um, when the text is extracted, so we have the applicant name over here.

00:31:27.000 –> 00:31:50.000
The social security number, as well as the marital status that is extracted accurately. So this is how LLM Whisperer pre-processes your documents and creates it in a format that is LLM ready. So this is the version that will be later on sent to the LLM to perform document extraction, and in this way, we are able to preserve the complete context of the input document.

00:31:50.000 –> 00:31:56.000
Now, just to show you certain other documents that we have.

00:31:56.000 –> 00:32:09.000
So in this case, we have… a pretty condensed table with a lot of content, so let’s see how LLM Whisperer is able to preserve the layout while it extracts the text.

00:32:09.000 –> 00:32:24.000
All right, so in this case, you can see how LLM Whisperer has been able to maintain the layout and also retrieve The text from this particular document And you can also define certain parameters for your extraction, like the horizontal or vertical lines that you see over here.

00:32:24.000 –> 00:32:34.000
So these are parameters that you can define, and in case you choose not to enable them, how it would look would be, um… let me just open it for you.

00:32:34.000 –> 00:32:48.000
So in this particular example, I would… I haven’t specified… we haven’t specified the vertical and horizontal lines as a parameter, so you can basically have control over how your data is extracted and, um, how you want to pre-process this.

00:32:48.000 –> 00:32:52.000
Now, we also spoke about how LLM Whisperer can be deployed as an API.

00:32:52.000 –> 00:32:56.000
So let me quickly take you there to show you how this deployment looks.

00:32:56.000 –> 00:33:02.000
So this is a Postman call that I’ve made. I’ve uploaded a sample document, which is the loan application.

00:33:02.000 –> 00:33:11.000
And, uh, you also have the various parameters that I was talking about earlier over here. So I can also get down to the level of defining the median filter, the Gaussian blur.

00:33:11.000 –> 00:33:15.000
And you can see that, in this case, we’ve enabled to mark the vertical and horizontal lines.

00:33:15.000 –> 00:33:24.000
So, to give you an idea of what this document is, let me just go back

00:33:24.000 –> 00:33:32.000
So this is the document that we’re dealing with. This document has two pages. The first page is what you’d seen earlier, which is the loan application with the details filled in by hand.

00:33:32.000 –> 00:33:39.000
And on the second page, we have an ID card scan that is disoriented, and it’s also not a very great scan that we’re looking at.

00:33:39.000 –> 00:33:47.000
So let’s see how LLM Whisperer is again able to extract text from this document. So we’ve, uh, we have a post called to upload this document.

00:33:47.000 –> 00:33:53.000
We can check the status of this extraction. I’ve already run these calls, so I’m just quickly taking you through them.

00:33:53.000 –> 00:34:04.000
And finally, over here, we have the extracted text, so as we’d seen earlier, we have the name, the social security number, and all the rest of the information that is extracted while preserving the layout.

00:34:04.000 –> 00:34:13.000
Which, again, to reiterate, is the secret sauce behind preserving the entire context of the document, which really helps in boosting the accuracy of your results.

00:34:13.000 –> 00:34:28.000
And on the second page, you can see the driver license, or the applicant’s ID card. So it was a disoriented scan that we’d seen earlier, and you can see how the text is retrieved while preserving the layout, and this is then later passed to the elements.

00:34:28.000 –> 00:34:35.000
Now, we have time and again stressed the need for preprocessing and for, you know, preserving the layout of your documents.

00:34:35.000 –> 00:34:45.000
But just to show you how this actually works in real time, we have performed a small exercise Where I’ve taken this very same, um, document that you have over here.

00:34:45.000 –> 00:34:55.000
And, um, I’ve passed it through leading LLMs that we have in the market today, like Gemini 2.5 Flash, ChatGPT+, and also Cloud 3.7 Sonnet.

00:34:55.000 –> 00:35:02.000
So, we’ll be extracting some information from this particular document through all of these LLM models when we pass it directly.

00:35:02.000 –> 00:35:09.000
And we’ll see the difference in the output when we first pre-process this document and then pass it on to the LLM model.

00:35:09.000 –> 00:35:15.000
So just to give you an idea of what were the, uh, what was the information that we’re looking for, or what are the prompts that we’ve passed.

00:35:15.000 –> 00:35:28.000
So we’ll be retrieving the personal information. And, uh, that’s the first prompt we have over here, so information like the name, the social security number, or marital status is all from the first page of the document in the form that you had seen.

00:35:28.000 –> 00:35:37.000
And again, we’ll also be retrieving some other data, like the hair color, height, weight, and sex, which are all information from the Uh, second page of the document.

00:35:37.000 –> 00:35:43.000
Moving on, in the second prompt, we have some customer contact information we’re looking to retrieve.

00:35:43.000 –> 00:35:47.000
In the third prompt, we’re looking to retrieve some applicant address.

00:35:47.000 –> 00:35:51.000
And, uh, fourthly, we are looking to see if the applicant is self-employed or a business owner.

00:35:51.000 –> 00:35:59.000
And finally, the fifth prompt is to retrieve the gross income and the rent as specified in the document.

00:35:59.000 –> 00:36:06.000
So firstly, we passed this same document along with the various prompts that you had seen right now to Gemini.

00:36:06.000 –> 00:36:12.000
That the model is actually able to retrieve some of the information. You have the personal information, contact details.

00:36:12.000 –> 00:36:20.000
And all the other information that you’re looking for. But upon close inspection, what we do see is that in this case.

00:36:20.000 –> 00:36:27.000
Under personal information, we have the sex retrieved as null. So just to compare this with what it looked like on LLM Whisperer.

00:36:27.000 –> 00:36:33.000
You can see that this is basically Unstrac’s interface, this is the prompt studio that I was talking about.

00:36:33.000 –> 00:36:42.000
Where you can upload your documents, have them extracted using the text extraction service, so this is a pre-processed LM-ready format.

00:36:42.000 –> 00:36:47.000
And later, the LLM works on this and runs these prompts against this particular, um.

00:36:47.000 –> 00:36:55.000
Document, or the raw text that you have over here. So, upon doing this, we see over here, under personal information.

00:36:55.000 –> 00:37:12.000
That sex is actually retrieved as female. So, just to take a look at what it is on the original document, let me just take you back to the document, and when we look closer on the ID card, you can see that the sex is, in fact, actually given as female, which was possible because of this

00:37:12.000 –> 00:37:22.000
Pre-processing stage that we had performed over here using LLM Whisperer, and which is why you do not have it on Gemini when the document was directly passed to the LLM.

00:37:22.000 –> 00:37:32.000
Now, this is a small, incorrect output that we see over here, but in case we’re using this for further business operations, this might have severe business repercussions as well. So this is what we’re looking at.

00:37:32.000 –> 00:37:50.000
And especially when we’re dealing with enterprise use cases, you might be, um, dealing with hundreds and thousands of documents, so it becomes impractical for you to sit and nitpick on each of these outputs, because Again, another limitation that we saw earlier was LLMs currently do not offer a way to flag these incorrect outputs as well.

00:37:50.000 –> 00:37:53.000
So just to take you through another example that we’ve seen.

00:37:53.000 –> 00:38:01.000
You can see over here, under gross monthly income, that the value is given as $8,000, And the rent is given as $4,300.

00:38:01.000 –> 00:38:14.000
Upon comparing this with the original document. You can see that while the monthly income is correct, it’s given as $8,000 over here, the monthly rent over here is $1,300.

00:38:14.000 –> 00:38:24.000
So this is, uh, the model probably misinterpreted the 1 for a 4, depend… because of the handwriting that it’s given in, uh, over here. So, let’s just compare this with LLM Whisperer.

00:38:24.000 –> 00:38:34.000
And you can see that when you first pre-process this document, you have the text retrieved as 1,300, as you can see highlighted over here on the screen.

00:38:34.000 –> 00:38:43.000
And, uh, furthermore, when you use this particular view for your extraction, the… I mean, the prompt that is run and the extracted data is also correct.

00:38:43.000 –> 00:38:49.000
So this is the impact that we’re looking to create when we first pre-process and preserve the layout of your original document.

00:38:49.000 –> 00:38:55.000
So just to compare this with how this performed on ChatGPT Plus and Clot 3.7 Sonnet.

00:38:55.000 –> 00:39:01.000
So over here, we have the same prompts as well. What we saw with ChatGPT, as I’ll show you right now.

00:39:01.000 –> 00:39:24.000
Is when we upload the document to the system. The model actually does not, um… taken the document as well. It flags this as an error, and it’s not able to accept the input document because of its scanned nature, which is, again, a common characteristic we saw with both ChatGPT as well as Claude, that they were not able to perform well on a lot of scanned documents, and many cases.

00:39:24.000 –> 00:39:35.000
The input itself was not taken in. So, moving on to Claude again, we have the same document over here, along with the prompts, and as you can see, Claude has actually flagged this document as empty.

00:39:35.000 –> 00:39:40.000
And it is not able to retrieve any data from this particular document.

00:39:40.000 –> 00:39:54.000
So this, again, folks, reiterates the need for you to really preserve the layout of your document in order to actually achieve correct and accurate extraction Uh, from the, um, input document.

00:39:54.000 –> 00:40:02.000
So, with that, we’ll be moving on to the, uh, next accuracy-enabling capability, that is Confidence Core.

00:40:02.000 –> 00:40:10.000
Now, it’s true that with just layout preservation, LLMs can extract data from any document, no matter the format.

00:40:10.000 –> 00:40:21.000
However, they are still prone to… they can be sometimes still prone to inaccurate extractions, and as we looked at earlier, there is still currently no way for you to verify whether the extraction is correct or not.

00:40:21.000 –> 00:40:28.000
So this is where the confidence score comes into the picture, which is another capability supported by LLM Whisperer.

00:40:28.000 –> 00:40:36.000
So with the confidence core, the system basically generates a score for each of the text extraction done based on how accurate it thinks the extraction is.

00:40:36.000 –> 00:40:45.000
So this is a system-generated score, which gives users a meter, or a standard, to benchmark output quality, and also validate the extraction.

00:40:45.000 –> 00:40:49.000
So we’ll take a look at how this does. How the system does this in the demo.

00:40:49.000 –> 00:40:58.000
But I just quickly wanted to cover another accuracy enabler as well before we move on to the demo, since we’ll be looking at both of these in the same space.

00:40:58.000 –> 00:41:11.000
Now, along with the confidence score for each of the texts, the users will also have access to metadata on which line or coordinate on the original document a particular text was extracted from. So this is called the bounding box.

00:41:11.000 –> 00:41:16.000
Which can later help users locate the extracted portions from the source document.

00:41:16.000 –> 00:41:26.000
So, it would look somewhat like this, what you see over here in this screenshots, but we’ll again take a deeper look at both the conference code and the bounding box in action.

00:41:26.000 –> 00:41:30.000
So let me go back to my, um… LLM Whisperer deployment as an API.

00:41:30.000 –> 00:41:36.000
And let me just enable this capability, and I will be performing this call again.

00:41:36.000 –> 00:41:45.000
So you can see that I’ve uploaded this document. And we’ll now check the status of this, so you can see that the document is still processing.

00:41:45.000 –> 00:41:58.000
With just a couple of seconds, you’ll see, uh, we’d be able to get the metadata and the confidence score.

00:41:58.000 –> 00:42:05.000
So over here, you can see that both the pages have been extracted successfully, and we also have the time taken for the extraction.

00:42:05.000 –> 00:42:18.000
And thirdly, when I… Uh, yeah, so over here you can see that, uh, upon running this call, I get the confidence score of each of the texts extracted from the document, along with the bounding box.

00:42:18.000 –> 00:42:35.000
Or the exact coordinates where that particular text was located in the source document. So this confidence score, in this case, we have 0.875, is, uh, could be used as a benchmark. It is system-generated, and I can refer to this in order to see how accurate a particular extraction is.

00:42:35.000 –> 00:42:49.000
So similarly, we have the confidence scores for all of the other texts that was present in the document as well And you also have line metadata that’s basically the coordinates of each of the lines from the document that you had uploaded.

00:42:49.000 –> 00:43:02.000
So, using this metadata, you can also perform source document highlighting as one of the downstream operations that I’ll be taking, um, that I’ll be taking you to right now to take a look at how that works.

00:43:02.000 –> 00:43:11.000
So, again, folks, over here, what you see is the unstruct interface. This is the prompt studio. However, over here, we have a different document, which is a credit card statement that we have uploaded.

00:43:11.000 –> 00:43:17.000
And you can see that this is also first been extracted or pre-processed.

00:43:17.000 –> 00:43:27.000
And these, what you see on the left-hand side, are the various prompts that we have run on this particular document. So, we’re looking to extract the customer name from this credit card statement, the address.

00:43:27.000 –> 00:43:44.000
The spend line items and some payment information as well. And because I have deployed highlighting, and I have made use of the bounding box metadata that you’d seen earlier, what I can achieve with This data is… with this metadata is basically, when I click on the output over here for the customer name.

00:43:44.000 –> 00:43:50.000
You can see that the system automatically highlights the specific place from which this particular output was fetched.

00:43:50.000 –> 00:44:04.000
So you have the different places from the document highlighted over here from which this particular customer name was fetch from. And this is one of the powerful capabilities that will really support when you come to the human in the loop.

00:44:04.000 –> 00:44:08.000
Feature that we’ll be looking at in a little while. So, similarly, if I click on, let’s say, the city.

00:44:08.000 –> 00:44:16.000
You again have the exact portion from the original document from where this particular data was extracted.

00:44:16.000 –> 00:44:22.000
So that sums up, uh, confidence score, bounding box, and how it can be deployed in downstream operations as highlighting.

00:44:22.000 –> 00:44:29.000
Let me move back to the presentation. And up next, we’ll be covering Human in the Loop.

00:44:29.000 –> 00:44:34.000
Now, here comes one of the most distinctive and highly valued features of Unstract.

00:44:34.000 –> 00:44:44.000
So, human in the Loop enables humans to review, verify, and also, when necessary, modify the results of your document extraction projects.

00:44:44.000 –> 00:44:52.000
Now, this feature not only helps meet compliance requirements as certain industries require a manual check while they are performing document extraction.

00:44:52.000 –> 00:44:58.000
But this also plays a critical role in enhancing the output accuracy by allowing manual validation.

00:44:58.000 –> 00:45:03.000
And the notion of the feature itself is to ensure the accuracy of LLM output.

00:45:03.000 –> 00:45:12.000
But again, let’s say you are processing hundreds and thousands of documents, so how do I decide which of these documents are sent for a human review?

00:45:12.000 –> 00:45:16.000
So to find the answer to all these questions and learn more about the capability.

00:45:16.000 –> 00:45:24.000
Let me quickly go back into the demo segment.

00:45:24.000 –> 00:45:38.000
All right. So, what you see over here, uh, is the Prom Studio that I was, uh, that we looked at earlier. We have the credit card project over here. So, after I specify all the prompts in this particular project, and I have also tested the output, it’s working well for me.

00:45:38.000 –> 00:45:44.000
I can export this project as a tool. So, once I export this project as a tool.

00:45:44.000 –> 00:45:54.000
I, uh, we can move on… move on to the workflows, where you can actually deploy it as an API, an ETL pipeline, a task pipeline, and also as a human-in-the-loop deployment.

00:45:54.000 –> 00:46:00.000
So in this case, we have deployed an ETL workflow, which is an ETL pipeline, and we have also enabled human in the loop.

00:46:00.000 –> 00:46:11.000
So, as you can see, this is the input configuration where I get this particular file from. So I can define any of the file, um… the connectors, so I can get it from any of these sources.

00:46:11.000 –> 00:46:23.000
And after getting in the file. Uh, this particular credit card parser tool that I had, um, exported from the project will be run on that particular document, and the relevant data will be extracted.

00:46:23.000 –> 00:46:31.000
So, you can simply drag and drop the tools as required from the tools pane that you see in the, uh, on your right side to the workflow chain.

00:46:31.000 –> 00:46:39.000
And once the tool that you want is run on the file system, the data extracted would be sent to a DB of your choice.

00:46:39.000 –> 00:46:48.000
So in this case, we have a DB that we’ve connected with, along with the table. So, over here is where you can enable human in the loop.

00:46:48.000 –> 00:46:54.000
So, this, to answer the question that we had raised earlier, how do I decide which of the documents pass through the manual review?

00:46:54.000 –> 00:47:04.000
So you can define the percentage of documents that pass through the manual review over here. So in this case, if I say, let’s say, 30%, I want 30% of it to pass through manual review.

00:47:04.000 –> 00:47:10.000
And I can also define the logic, so how am I going to shortlist the documents that go for this review?

00:47:10.000 –> 00:47:26.000
So, uh, over here, I can add my rules. For instance, let’s say that If the confidence score of the payment information is… Um, lesser than, let’s say, 0.5, then I want this particular document to go into a manual review.

00:47:26.000 –> 00:47:37.000
Or let’s say I have a high-ticket client. And, uh, their name, they go by the name Richard. So I want this particular document also, again, to go by, uh, go through the human in the loop.

00:47:37.000 –> 00:47:45.000
So this is how you can really get down to the level of controlling what document you want sent to the human in the loop.

00:47:45.000 –> 00:47:58.000
And you also have further settings over here, so after approval of the document, where do you want to send it? So, do you want to send it to the destination VB, which is the database that we have connected with over here, or do you want to send it back to the queue for another round of review?

00:47:58.000 –> 00:48:07.000
So these are the other settings that you can define. And this is only where you set the conditions for this feature, and how you can control how this works.

00:48:07.000 –> 00:48:12.000
To show this to you in action. Let me take you through the, uh, review.

00:48:12.000 –> 00:48:18.000
Um, so this is the interface where you’d be reviewing… this is how it would look when you’re reviewing your documents.

00:48:18.000 –> 00:48:24.000
So you have the original document over here on the left, and you have all the output from this document on the right.

00:48:24.000 –> 00:48:33.000
So, to understand how accurately the document has extracted this output, as I mentioned earlier, we can deploy highlighting, where, for instance, if I click on the New Balance output over here.

00:48:33.000 –> 00:48:40.000
The system immediately highlights the corresponding spot from where it has got this particular output on the source document.

00:48:40.000 –> 00:48:53.000
Now, not only this, but in case, let’s say, I want to, um, edit a particular output, let’s say that this particular output that is extracted over here is not how I want it to be stored in the DB. I can just double-click on this.

00:48:53.000 –> 00:49:08.000
And I can edit it according to what I want. So this, again, helps you edit the, um, output as well, and you have a lot more features to control access permissions on who would be able to perform the review and approve this.

00:49:08.000 –> 00:49:22.000
Before it gets stored into the DB. So with that, we’ve covered human in the Loop, which is, again, one of the core pillars of Um, the accuracy, um, capabilities that we have in Unstract.

00:49:22.000 –> 00:49:34.000
And now we’ll move on to another powerful capability which enhances accuracy, which is LLM Challenge. This is specifically designed to combat LLM hallucinations and incorrect outputs.

00:49:34.000 –> 00:49:38.000
Now, while Human in the Loop is a method to verify LLM outputs.

00:49:38.000 –> 00:49:44.000
Llm Challenge is designed to prevent the occurrence of an input… incorrect output in the first place.

00:49:44.000 –> 00:49:47.000
So this is done through an LLM as a judge implementation.

00:49:47.000 –> 00:49:52.000
Where we have two LLMs to run on the same set of prompts.

00:49:52.000 –> 00:50:01.000
So one is the extraction LLM, and the other one is the Challenger LLM, and only if these two models arrive at a consensus on the result is the output displayed to the user.

00:50:01.000 –> 00:50:07.000
So LLM Challenge finds use especially when deployed in production for at-scale use cases.

00:50:07.000 –> 00:50:13.000
And let’s understand this capability again in more detail by going on… moving on to the demo.

00:50:13.000 –> 00:50:23.000
So over here, again, we’ll be exploring this feature in the prompt studio, so we have the credit card statement, and where you can enable LLM Challenge would be under the Prom Studio settings.

00:50:23.000 –> 00:50:30.000
So, under LLM Profiles, which is the first settings you have over here, you can define what is the LLM model you’re looking to use.

00:50:30.000 –> 00:50:33.000
Along… along with the text extractor that you’re looking to use.

00:50:33.000 –> 00:50:44.000
So, um, once you define these. This profile, you also have the NLM Challenge, where you can define what is the Challenger LLM you’re looking to deploy in this particular project.

00:50:44.000 –> 00:50:57.000
And once you enable LLM Challenge, what happens is. What the LLMs, the extractor LLM in this case is anthropic, that you have defined as well as a Challenger LLM, which is a GPT model, would run on this particular… each of these particular prompts.

00:50:57.000 –> 00:51:03.000
And only after they arrive at a consensus would this output be given to the user.

00:51:03.000 –> 00:51:07.000
So, um, to further understand this, you also have access to an LLM Challenge log.

00:51:07.000 –> 00:51:14.000
Where you can see how each of these extractions have run, and what is the result brought by each of the extractions.

00:51:14.000 –> 00:51:22.000
And you also have a score, uh, out of 5 on how far the two models, extraction LLM and the challenger LLM, agree with one another.

00:51:22.000 –> 00:51:36.000
So this is, again, a way for you to prevent incorrect output, and a good tip Or, uh, technique would be to define your extraction LM and the challenger LM from completely different vendors and have them as flagship models.

00:51:36.000 –> 00:51:47.000
So in spite of them working completely differently on the documents, you would still… if you still arrive at an output which is… which the two models agree with, you can be sure that it is, in fact, the accurate extraction.

00:51:47.000 –> 00:52:00.000
So this is how, again, LLM Challenge is used to prevent LLM hallucinations and incorrect outputs in the extraction, and it is also one of the differentiating capabilities that puts Anstract out there in the market.

00:52:00.000 –> 00:52:11.000
So going back to the presentation. I’ll move on to the final set of capabilities that we have over here, that is grammar, preamble, and postamble.

00:52:11.000 –> 00:52:21.000
Now, the grammar feature in Unstract allows you to define custom synonyms for words that may appear differently or in different forms across various documents.

00:52:21.000 –> 00:52:26.000
So this really helps in combating the industry-specific jargon we looked at earlier.

00:52:26.000 –> 00:52:32.000
This enables the system to recognize and interpret industry jargon, because you are, uh, first.

00:52:32.000 –> 00:52:36.000
Training it with the synonyms that you want the system to first recognize.

00:52:36.000 –> 00:52:41.000
Moving on, the preamble and post-amble are features that help ensure smooth prompt engineering.

00:52:41.000 –> 00:52:55.000
So we looked at one other challenge earlier, where the system might not be able to, um, retrieve accurate output when the prompts are, again, um, not defined Clearly. So this is a capability to aid in accurate prompt engineering.

00:52:55.000 –> 00:53:00.000
So the preamble is basically a common prompt that is added at the prefix of all the prompts that you define.

00:53:00.000 –> 00:53:04.000
So this is defined to guide the LLM’s approach to extraction.

00:53:04.000 –> 00:53:14.000
The post-tamble, similarly, is a prompt that is appended at the end of all the prompts that you define in the project, so it specifies how the responses should be formatted.

00:53:14.000 –> 00:53:18.000
So let me just go, uh, move on into the demo for, uh, for you to get a better idea of this.

00:53:18.000 –> 00:53:21.000
So again, you can find these features in the Prom Studio settings itself.

00:53:21.000 –> 00:53:30.000
So you have grammar over here. So as you can see, I have defined two synonyms, so I have defined withdraw for debit, and the word deposit for credit.

00:53:30.000 –> 00:53:44.000
So this will help my system understand and interpret the text better. Moving on to preamble and post-ample. So each of these prompts would be either added, uh, the preamble would be added at the beginning, and post-tamble at the end of all the other prompts that you saw in the prompt studio.

00:53:44.000 –> 00:53:56.000
So, um, you can either define this on your own, or the system also pre-populates it, and that’s what you have over here, so you can either use this as is, or you can edit it or define a completely new preamble and postamble as well.

00:53:56.000 –> 00:54:06.000
So with that said, we have covered the final leg of capabilities that we had to enable better accuracy in LLM-driven document data extraction.

00:54:06.000 –> 00:54:11.000
And as we move on to the summary, or the confusion of this webinar.

00:54:11.000 –> 00:54:17.000
So, we saw layout preservation, confidence score, bounding box, human in the loop.

00:54:17.000 –> 00:54:22.000
As well as LLM challenge grammar, Preamble, and finally, uh, the preamble and post-amble capabilities.

00:54:22.000 –> 00:54:27.000
And, um, just to, you know, have… check the pulse of this audience over here.

00:54:27.000 –> 00:54:34.000
I wanted to run a poll to see which of these capabilities would be most useful for, um.

00:54:34.000 –> 00:54:38.000
The audience that we have over here. So let me just quickly run this poll.

00:54:38.000 –> 00:54:48.000
And you should be able to, um… vote your answers.

00:54:48.000 –> 00:54:58.000
Okay, I hope I’ve launched the poll and you can see the questions.

00:54:58.000 –> 00:55:03.000
So I’ll give you a few seconds for you to finish answering.

00:55:03.000 –> 00:55:11.000
So each of these capabilities are, again, specifically designed to handle different accuracy challenges at different stages of document extraction.

00:55:11.000 –> 00:55:19.000
And what we have seen so far is that features like layout preservation, human in the loop, and LLM challenge are really the game changers.

00:55:19.000 –> 00:55:34.000
And, um, yeah, it looks like layout preservation is leading And that is followed by the confidence score, and uh… next, I think we have human in the loop and LLM Challenge as well.

00:55:34.000 –> 00:55:46.000
All right. So, I’ve just ended the poll, and I hope you can see the results as well.

00:55:46.000 –> 00:55:56.000
All right. So, moving on to the conclusion of this webinar.

00:55:56.000 –> 00:56:04.000
We saw the various factors contributing to document extraction inaccuracy. We also introduced unstract and LLM Whisperer as a solution to these problems.

00:56:04.000 –> 00:56:09.000
And finally, we also saw a set of accuracy enablers that really help you overcome these challenges seamlessly.

00:56:09.000 –> 00:56:21.000
And through these solutions, what we have observed is that our users are able to achieve an accuracy rate as high as 99%, and that is really the reason for this webinar.

00:56:21.000 –> 00:56:25.000
So, uh, while we have focused only on the accuracy-related features in this session.

00:56:25.000 –> 00:56:38.000
There is so much more that Unstract also has to offer, and you can explore various other capabilities in the Prom Studio as well, and choose from the other deployment options we had, like API deployment, or task pipeline, and so on.

00:56:38.000 –> 00:56:46.000
So, for you to understand the remaining capabilities in more detail and see how these accuracy-enabling capabilities go hand-in-hand with them.

00:56:46.000 –> 00:57:04.000
You can either sign up for a free trial, I’ll ask my team to drop the links in the chat, so you can sign up for a free trial where you can explore the platform’s end-to-end capabilities on your own. We have a 14-day free trial. Or you can also refer to our previous webinars blogs, and also the extensive documentation that we have.

00:57:04.000 –> 00:57:12.000
And another, um, commonly, uh, pursued vein in which we see our users understand the platform is to schedule a free demo.

00:57:12.000 –> 00:57:23.000
Where one of our experts would be able to sit with you and understand your specific business needs and see how you can customize unstruct or LLM Whisperer for your specific needs.

00:57:23.000 –> 00:57:30.000
So, uh, in case you’re interested in this demo, please do leave your email IDs in the chat. We’ll be able to reach out to you proactively.

00:57:30.000 –> 00:57:44.000
And that really brings me to the end of this session. Let’s move on to the Q&A in case we have any questions left.

00:57:44.000 –> 00:57:54.000
Okay, looks like all the questions have been answered already. All right. So that brings us to the end of the session, folks. Thank you so much for joining today.

00:57:54.000 –> 00:58:01.000
Please do leave your feedback on your way out, you’ll be redirected to a feedback form, and I’m really looking forward to seeing you at our upcoming events.

00:58:01.000 –> 00:58:31.000
Have a great day, thank you.

How to achieve 99% accuracy in LLM-driven unstructured data ETL

Developers

Industries

Tools

Resources

Stay in touch