How intelligent chunking makes LLMs better at extracting long documents

[00:00:00] 

Hi everybody. Uh, thank you so much for joining our session today. I’m your host Mahashree from the marketing team here at Unstract. Now in this session, we will be exploring a key aspect of high performing LLM document workflows that make them what they are. So when we look at real world documents, especially in document heavy industries like finance or banking, some of the most commonly processed documents would include contracts, loan agreements, financial reports, and so on.

[00:00:30] 

Now, these documents have traditionally. Posed challenges for automation because of, uh, three main reasons. First one, they arrive in massive volumes. Number two, they exist in varied formats and layouts. And this is exactly why many businesses are turning to LLMs for document processing today. Because LLMs are by design, they naturally overcome many of these traditional limitations that we used to face with document, uh, with previously available document extraction tools.

[00:01:00] 

However, if there’s still one limitation that even the most advanced elements continue to grapple with, even today, it would be document length. So when a document runs into hundreds of pages, extracting the right information accurately becomes difficult, and it can also get computationally expensive as well as time consuming.

So that is where chunking comes in as an answer, and that will be the primary focus of our webinar today. So in today’s session, we will pull back the curtains on how 

[00:01:30] 

Untract uses advanced chunking and retrieval strategies to transform lengthy document context. Into actionable, usable data. So with that, let’s explore the agenda for today.

So here’s what we’ll be covering. We will start by going over what is chunking and why we need it. Then we’ll cover some of the popular retrieval strategies that enable chunking today. Moving on from there, we have probably the most interesting segment of this webinar. We’ll watch how 

[00:02:00] 

Unstract actually deploys these strategies in action within the platform.

And we’ll also take a comparison of how document extraction works. When we use chunking versus without using chunking. And finally, we’ll conclude this session with a live q and a where one of our experts will be on air to answer your questions. So that said, here are a few session essentials or ground rules that I’d like to lay out for this session.

So firstly, this webinar will be on in listen only mode, so all attendees will automatically be on

[00:02:30] 

mute. You can post your questions at any time in the q and a tab, and we’ll be able to get back to you with the answers via text. Now if your question is left unanswered, not to worry, we’ll be taking it up towards the end in our q and a.

Thirdly, you can interact with fellow attendees. Let us know where you’re joining from using the chat tab. And this is also where you let us know in case you run into any technical difficulties during this session. And as a final point, when you exit this webinar, you’ll be redirected to a feedback form where we request you to leave a review on so that we can 

[00:03:00] 

continue to improve our sessions going forward.

 

So that said, let’s get started with our webinar. So firstly, our opening discussion would be what is chunking? So as you can see right here on the screen, the simple definition of chunking would be that it is a process of breaking down lengthy documents into smaller, more manageable sections that an LLM can process with more efficiency and accuracy.

So this is as simple as it gets, uh, it’s a 

[00:03:30] 

pretty easy concept to understand, but why do we need it and why is it very crucial to document extraction? So here are all the reasons why. Firstly, one of the most reasons why we need chunking, especially in LLM enabled document extraction is because of LLM, the LLM context length limits that we have.

So what this means is that any LLM that you work with has a context limit, which means it can only process a number of tokens at any given point in time, 

[00:04:00] 

which means it can only handle a certain number of words or characters at once, and this is why, when. Your documents exceed this limit. We are required to deploy chunking to make sure that they are sent in the proper segments and they, the chunks are basically in the sizes that the LLMs can handle.

This is one of the four more foremost reasons that we see why we need chunking next. Another key reason why chunking is required is to preserve contextual 

[00:04:30] 

accuracy. So when a document is too long, the model might sometimes confuse sections or mis important connections while retrieving data. So chunking solves this by dividing the documents into smaller, more meaningful sections.

So each part is processed with its local context impact. So this way the model understands the content better and is also able to provide you with more accurate answers. And that brings me to the third point. It’s because it can, again, give us better retrieval position because when you preserve the context of the 

[00:05:00] 

document, the, um, it, it makes it easier to perform search q and a tasks and also more reliable so it scans.

This is easier since we are focusing on a specific part rather than scanning the entire text blog. Fourthly, chunking again reduces processing costs, so feeding a massive document to an LLM repeatedly is computationally expensive. Chunking enables selective processing, so only the necessary chunks are passed to the model.

And this cuts down heavily on token usage as well as latency. 

[00:05:30] 

And finally, not only do we reduce processing costs with chunking, but the overall speed of the process is also improved when we chunk our documents. And with that, we’ve looked at all the reasons why we need chunking, and now we’ll take a look at what are the different retrieval strategies that can be used to enable chunking.

So these. Strategies are different ways of chunking your documents. But before we move to that, I just wanted to take a quick, quick assessment of the audience that we have here today. So would you be 

[00:06:00] 

able to answer this question in the chat folks? What are the documents that you are looking to perform chunking on?

And if you can, you can also mention what is, what are the data fields that you’re looking to extract from these documents. So this would give us a better idea of the audience we’re talking to today. And maybe I can also focus on the retrieval strategies that would be more useful for you. So I’ll give you a minute or two for you to finish answering.

[00:06:30] 

All right, thanks James for your response. So we are seeing legal contracts, invoices,

financial documents, statements. So for each of these documents, depending on the document type, the document size, as well as the query. So what is the data that you’re looking to 

[00:07:00] 

extract? The chunk, the retrieval strategy you choose would differ. So before I get into the retrieval strategies, and uh, we’d also be looking at how Unstract deploys them within the platform.

So before I get to all of that, let me actually take a quick detour and I would introduce Unstract to those of you that are completely new. So this would, uh, take you maybe two minutes. So Unstract is an LLM powered unstructured data, ETL platform. If I had to briefly put all the capabilities of the platform, 

[00:07:30] so, uh, we’d have three, three main buckets.

That is the text extraction phase, the development phase, and then finally the deployment phase. So what happens in Ract when you first upload a document is that the platform extracts the raw text from the document in a format that is LLM ready. So what we mean by LLM Ready is that LLMs basically consume information very much similar to how humans would, so the best way in which you can pass the context of your document to the LLM is to preserve the 

[00:08:00] 

layout of the original document.

So that is what we are looking to achieve in the text extraction phase. We’d be deploying a text extractor, which would be able to extract the raw text as well as preserve its layout. This before passing it to the LLM, and we have a couple of text extractors that you can integrate with within the platform.

However, our suggested extractor would be LLMWhisper up, which is known to do this wonderfully well. It’s quite popular with our customers and it’s also our in-house text extractor, which goes hand in hand with the 

[00:08:30] 

platform overall. LLMWhisperer is again available as a standalone solution that you can explore depending on your needs.

So once the first stage is done, once you extract the raw text, you move into the development phase where you enter prompts and natural language. Which specify, one, what data you’re looking to extract from the documents, and two, what is the schema of extraction that you’re looking to follow. So you can enter your prompts and test these prompts across various document variants no matter what the layout or formatting is, and the platform will be 

[00:09:00] 

able to work, um, in, in spite of all of that.

Again, within the development phase is where you can also enable or test chunking. So you can set the chunk size, the retrieval strategy, and all these different features that we’ll actually be getting into more detail when we explore the demo segment. And finally, once you have, uh, specified your queries, your prompts, and you’ve also run the different projects, and you are getting the output that you desire, you can deploy this project.

In different ways, depending on your ne uh needs. 

[00:09:30] 

So we have API deployment ETL Pipeline, Task Pipeline and Human in the Loop, which are our native deployment options available within the platform or for more advanced use cases. You can also go for n8n or MCP servers. We have a nodes for both Unstract as well as LLMWhisperer, and we also have MCP servers for both these platforms.

So that then gives you a gist of what the platform does and what can be achieved using it. So if I had to throw out a few numbers, uh, giving you an idea of where we stand today, we 

[00:10:00] 

have over 5.9 K stars on GitHub, a 1000 plus member Slack community. And currently we’re processing over 9 million pages per month by paid users alone.

So here are the different ways in which you can deploy the platform. So Unstract is available as a open source edition where you can, uh, explore it completely on your own terms. And we do have certain limited features over here, though. And, uh, you can access the end-to-end capabilities as a cloud edition or also as an on-prem version coming to LLM sra.

[00:10:30] 

So, LLMWhisperer supports a playground where you can, uh, test the end-to-end capabilities of the platform, and you can also upload over a hundred pages for free on an everyday basis. And you can, this is a wonderful way for you to actually test how your particular business, uh, documents work when it comes to LLMWhisperer.

And if you’re looking to deploy it, we have a cloud offering an API deployment, uh, Python client as well as. JavaScript client as well. And as I previously mentioned, we also have innate nodes 

[00:11:00] 

and an MCP server as well. And both these platforms are again compliant with all the major regulations like ISO, GDPR, SOC 2, and HIPAA.

So that said, let me actually go to the platform and I’ll give you a quick tour around and we’ll explore a little bit before we come back to the slides to explore the retrieval strategies. So what you see over here, folks, is the abstract interface. And if I were logging in for the first time, what I’ll have to do is set up my connectors over here on the uh, under settings.

So my connectors include the 

[00:11:30] 

LLM models that I’ll be working with, the vector dbs, embedding models and the text extractors. So we have, uh, integrations available with all the popular LLMs out there that you can use, and you can see that I’ve already integrated with a couple of models over here. So similarly, you can integrate with a range of vector dbs, embedding models, as well as text extractors.

And over here is where you would find LLMWhisperer as well. And there are again, other text extractors that are available in the market. I’m 

[00:12:00] 

sorry. So once this setup is done, once you have connected with all the connectors that you require, you can get started with the prompt engineering phase. So in this webinar, for the want of time, I would be briefly going into one of the prompt studio projects I have already set up.

So this is just to give you an idea of how this works and what, uh, you can do with it. So what you’ll have to do, as I mentioned this is, uh, once you upload your document, the, uh, platform first extracts the raw text from your docu, uh,

[00:12:30] 

from the uploaded document. So this project that you’re seeing over here is a credit card statement passer, where we are basically looking to extract some details like the customer name, the customer address, the spend line items, and the payment information from, uh, the various credit cards uploaded.

So you can see a credit card statement from American Express over here. It’s a few pages long. And you can, again, test these prompts with various other credit card statements as well. So if I have to look at the statement from the Bank of America, we have that over here as 

[00:13:00] 

well. So this is basically how you can upload, uh, different document variance for testing folks.

And uh, as I mentioned, the first step that you do when you upload the document is to extract the raw text and, uh, you deploy the text extractor for this. In this case, we have deployed LLMWhisperer. And this is basically the, uh, extracted context. So this is the output that LLMWhisperer gives you. And as you can see, it has extracted the text while preserving the layout of the original document.

[00:13:30] 

So something even as small as the logo, the American Express logo has some spacing between the two words, and even that is maintained in the extracted text. So this is the level of position we are looking at when we extract the text along with the layout. And this is basically the context over which the LLM will be working to get all the, uh, desired output that you have mentioned in the prompts.

So coming to the prompts, you can see that under each of these prompts, I’ve given details of what data I want to extract as well as 

[00:14:00]

the schema for this extraction. And in this case, I’ve deployed two different elements to work on this. And you can also specify a number. Uh, I mean, you can specify an output type from the range that we have over here.

And, uh, when you click on coverage, you again have the output that you have, uh, for this particular prompt from across all the documents that you’ve uploaded in this particular project. So this is basically how Prompt Studio works folks. You can see we, again, have a JSON output over here for the various spend line items.

And once you’re happy with this, 

[00:14:30] 

with these, uh, various outputs, you can then deploy this project in any of the deployment options that we looked at earlier. So while this is a brief, uh, on how Prompt Studio works coming to chunking, you can enable that under prompt studio settings. So as you can see over here, we have a number of features under settings, which, um, and we wouldn’t be getting into the details of all these different features for the want of time.

You can again, access the links that we’ve posted in chat where we have 

[00:15:00] 

webinars on these features or blog posts, as well as an extensive documentation as well. Just to like take you over how chunking looks, how you can, uh, set up chunking within Pro Studio. So what you see over here is the LLM profile settings.

So as you had seen earlier, I can connect with a number of elements, a number of vector dbs, text extractors, and so on. So what combination of these connectors am I going to be working with for this particular project? So in that case, I can, uh, set up an LLM profile over 

[00:15:30] 

here. Where I mention what is the LLM Vector DB and all the other details.

And along with this I also have the option of setting the chunk size. So what we mean by the chunk size folks is what is, uh, the, going to be the size of one particular chunk. So I can mention the number of tokens that I want to set. It can be 3000, it can be 2,500. It really depends. On the document that I’m working with, as well as the query that I enter up.

So the best practice over here is if you are working 

[00:16:00] 

with smaller documents that, um, like invoices. Then in those cases, the chunk sizes are usually set a little smaller, whereas if I’m working with more elaborate documents, then um, the best practice is to go for a bigger chunk size. And what another important factor to consider is the query type or the data that you’re looking to extract.

So if you already know what data you want to extract, for instance. You have a specific keyword in mind and you’re looking to extract data for that particular field, then the best practice is to go for 

[00:16:30] 

a smaller chunk size. Where, whereas if you’re looking to get an analytical output or uh, a summarization of a context, then in those cases the model might require larger context, in which case you’ll have to go for a larger chunk size.

So it really depends on your particular use case. And there is no hard and fast rule to this. Uh, you’ll have to upload your documents and uh, test it accordingly to see which. Best for you, uh, which is most optimal for you? So that is the chunk 

[00:17:00] 

size over here. And just right below that we have the overlap feature.

The overlap, which I can set to a value, which is um, is basically. Let’s say if I am, uh, chunking a document and it has three chunks that are adjacent to each other, A, B, and C. So when I split the document between A and B, you would have an edge over there where some context might be lost. So in those cases, when you specify an overlap, some of the con context from A would be present at the beginning 

[00:17:30] 

of B, and some of the context that is present at the beginning of B would be there at the end of a.

So this is where you ensure that no. Um, context is being lost when you send the chunks to the models for extraction. So you can set an overlap size over here. This is basically the number of tokens that I want to overlap between the two different chunks that I have. So once you’ve set these, you have the various.

Retrieval models or the retrieval strategies over here. So these are the different strategies using which you can 

[00:18:00] 

enable chunking in your documents. So over here we have a num, uh, we have seven major strategies that we currently support, and you can see that, um, in the platform itself, we’ve given some details on what it is best used for, uh, the best use case, as well as the performance impact.

So the number of tokens that it consumes, the cost impact and so on for each of these strategies. So while deploying them, you can go through this, and this is what I’m coming to next. I’ll be going back to my slides and explaining each of these strategies and their, uh, use cases in more detail.

[00:18:30] 

So once I set this, there is again, the matching count limit or setting up the top K value. So what this means is how many chunks out of all the chunks that I get out of this document am I going to use to consider and get the output? So for example, if a document is split into 10 different chunks, if I set the top K value as three, then I would be using the top three ranking chunks to get the desired output.

I can set this as two, I can set this as four. It really 

[00:19:00] 

depends on your particular use case. So that is basically how you set chunking up in the prompt studio folks. And, uh, with that, let me go back to the slideshow and I’ll be going over all the different chunking strategies that we just looked at so you get an idea of what to enable for your particular document or use case.

So over here you can see that we have all these seven major chunking strategies. So each one of 

[00:19:30] 

them could be, uh, deployed depending on the query type. Again, that, uh, and the data that you’re looking to extract. So we’ll go over them one by one, starting with the simple retrieval strategy. So this is the most basic and commonly used approach, the simple vector retrieval.

And this method basically works. It starts by, again, splitting the document into chunks based on the chunk size and the overlap size that was defined. And then it retrieves the chunks with the most semantic, which matches most 

[00:20:00] 

semantically with the query. So whichever of these chunks match perfectly well with the, um, query or the data that you’re looking to retrieve.

It analyzes these chunks and retrieves the top key values depending on what you have mentioned. And that’s about it. It’s pretty direct. And this is a strategy that is useful mostly when you’re looking to retrieve data from a specific section or a specific subheading in your, uh, document. So if you are looking to retrieve, uh, data for the same prompt or for the same query from different segments of the document, 

[00:20:30] then this is not the strategy.

You would be going for, ideally. This is for very simple retrieval. It’s fast, it’s effective for short and direct queries. And again, this may miss deeper context since chunks are processed independently and, uh, it might not really be the best for, as I mentioned earlier, when you’re looking to extract data from multiple points in the document.

So this is the simple vector retrieval, and next comes the fusion retrieval, also known as Rag Fusion. 

[00:21:00] 

So like the simple vector retrieval, it again starts by, uh, getting, getting the documents chunked based on the chunk size and the overlap that was specified. However. This method is useful, especially when you’re looking to retrieve data that is present in multiple locations on the document because it is able to basically, um, rank the different chunks that the document is split into.

And depending on the query, depending on what. Are the various data points that you’re looking to extract. It merges 

[00:21:30] 

all the relevant chunks and retrieves the answer from this merged chunk. And that gives you a context rich answer. And for it to retrieve these chunks, it uses multiple strategies. It can use a semantic similarity match or keyword match.

 

So it depends on, um, it has its own algorithm for ranking. And this is again, ideal for complex queries that span multiple parts of a document. And this works again, best for long interlinked documents like loan agreements, contracts, or insurance policies. 

[00:22:00] 

Up next, we have the sub-question retrieval. So this strategy is again, used when you’re looking to extract data from the different, um, locations of a document.

So how is it different from the previous, uh, strategy that we looked at? So the difference between rag fusion and sub-question retrieval, although the end goal might be similar, the approaches are very different. So while the fusion model looks at merging the chunks. Sub-question retrieval basically gets your query and 

[00:22:30] 

breaks it down further into smaller and more focused sub-questions.

And then it works upon each of these sub-questions to re retrieve the relevant chunk. So for example, if you ask what are the loan amount, repayment terms and eligibility conditions for this particular. The system will create three sub-questions, one for the loan amount, one for the repayment terms, and one for the eligibility.

It then retrieves relevant chunks for each of these sub-questions and synthesizes the information into a complete answer. 

[00:23:00] 

So that is the difference in approach that we have. And again, it ensures no detail is missed. And this is more useful when you know exactly the details that you’re looking to extract from your documents.

So while RAG Fusion could be more, uh, useful when you want to extract multiple data points, but it it’s more ambiguous in nature. Sub-question retrieval can be used when you can specifically name what is the exact data point that you’re looking to extract. What is the exact, exact keyword or the term?

And again, this is suitable for documents 

[00:23:30] 

like loan agreements, policies where answers span multiple sections. Up next we have recursive retrieval. So this strategy is designed for queries that require deep context or layered reasoning. So what happens over here is instead of retrieving the information in a single pass, recursive retrieval works iteratively.

It retrieves an initial set of chunks and then uses the information to refine the refine or guide the model for the next set of, uh, the next 

[00:24:00] 

iteration to retrieve the, uh, data. So, for example, if you ask. Model. To summarize all the conditions and clauses related related to the loan repayment across this agreement, the system might first retrieve sections mentioning repayment schedules, then refine it, search to capture related conditions in other sections.

And finally combine all the relevant chunks to produce a complete answer. So the key difference from sub-question retrieval and, uh, the recursive model is in how the 

[00:24:30] 

query is handled. While sub-question retrieval breaks a complex query into multiple small questions and retrieves separately for each of these sub-questions, recursive retrieval by contrast.

Keeps their query intact, but refines the retrieval in multiple steps. So this builds context progressively and is used to handle layered or nested information. So again, this is also useful when you’re looking to, uh, retrieve data from different segments of the document.

And when you also have complex 

[00:25:00]

queries, when you, and this is again useful when you are dealing with layout heavy documents where you might have infographics or slightly difficult formats that you’re de dealing with.

Next up we have the router based retrieval. So this is a strategy that is designed for document sets that contain multiple types of, uh, of content or domains. So the documents are very different by nature or by their domain. So what the router based retrieval does is that it uses an element to design which of the retrieval.

[00:25:30] 

Strategies is best fit for, for that particular document or that particular query. So it gets in all the documents and it, and then it routes it to the best, um, fit strategy depending on the domain. And this is, this again, ensures specialized handling of multi-domain or mixed document collections. And it improves the retrieval strategy by matching queries to the, uh, logic that you’re looking for.

Up next we have the keyword table retrieval. So this is a strategy which is 

[00:26:00] 

designed to give priority to keywords that are, that are present in structured formats within the document. So this could be, um, invoices, spreadsheets, forms that contain tables, and so on. So instead of relying purely on the semantic similarity.

The system basically searches for specific keywords across the document that you’re looking for. So if you have a specific keyword in the query, it would search for that in tables or in other structured content and then retrieve it based on how many times a keyword 

[00:26:30] 

appears in a particular chunk that position the proximity to relevant tables.

So depending on all these different factors, it would be able to rank the chunks that are most relevant for that particular query, and then retrieve the data from that particular chunk. So for this ranking, it uses, as I mentioned, again, it uses the frequency, it uses various scoring models, like it could take in consideration the frequency, the proximity, the position, and so on.

And again, this is ideal for extracting numeric or label data from 

[00:27:00] 

tables and forms. And this is highly accurate, especially when you’re looking to perform search or q and a with specific terms included in the query. And finally, let’s talk about auto merging. So this is basically a strategy that is designed to retrieve information that spans across adjacent chunks.

So there would be a main, uh, chunk that uses it would detect the main chunks that, uh, that has the data that probably is required for your query. And it also does a 

[00:27:30] 

semantic similarity, uh, search across the adjacent chunks just to see if it can include more context for better accuracy. So that is how, uh, auto merging works.

And this is, again, ideal for documents like contracts, policies, and agreements. So we can see that many of these document, uh, strategies can be used across, uh, similar documents. However, depending on your query type, you’ll have to test which one of these work. Best for your particular case. And again, we’ve given some best practices on what can be used for 

[00:28:00] 

what use cases.

So that said, let me go into, uh, untraced again, and I just wanted to take you through how chunking can be deployed within the platform. So over here we have a sample prompt Studio project. We have a financial report that runs 241 pages long. And, um, this, we are basically looking to extract some, uh, information from this particular report.

So you can see that it’s a very lengthy document with a lot of data and we’ve used two 

[00:28:30] 

different retrieval strategies for each of these prompts. And we’ve also compared how these two retrieval strategies work when, uh, used for. Different prompts. So let’s get started. The prompts that we’ve used in this particular project, we’ve given it right there at in the heading.

So we’ve used the keyword table strategy as well as the recursive strategy, and, um. The first prompt that I have over here is to get some data from this particular table. So this is the keyword that I’ve entered, the consolidated 

[00:29:00] 

cash flow statement. So I’m looking to extract the depreciation, amortization, and impairment value, which is a field that is present within the consolidated cash flow statement.

So right now I have. Given it the exact terms of where I’m looking to extract data from, and it is from a structured format within the document that is a table. So let’s take a look at how this table looks in this document.

So as you can see, we have an exact match. We have an 

[00:29:30] 

exact table, so I already know where I want the data to come from. And we have the, uh, field that we are looking at right here, the depreciation, amortization and impairment. So it’s the first, uh, row that I’ve highlighted over here. And we are basically looking to get the details for the, these three years.

And if you have to compare it, you can see that, uh, while we have used two different chunking strategies, you can see in this one we’ve used the keyword table. You can check it out under the profile name. And over here we’ve used the recursive strategy. 

[00:30:00] 

So these are the profile names. We’ve basically given to the LLM profiles that we have defined.

So we have two different LLM profiles and each of them are using two different chunking strategies. One uses the recursive table strategy that I’ve set over here. The other one uses the re uh, I mean, one uses a keyword table strategy and the other one uses a recursive uh, retrieval strategy. And you can see that we’ve given the chunk size and the overlap size over here as well.

So using these two strategies, we have, uh, basically performed the same extraction. 

[00:30:30] 

And you can see that while keyword strategy has been able to, um, retrieve the data accurately, we do not have an accurate retrieval when we use the recursive. Retrieval strategy. So that is a difference. Uh, that is why you have to be able to use the right retrieval strategy depending on your query.

Whereas if you go to the second prompt. Over here we’re asking the system to basically compare the performance of this particular business in 2021 with 2022. So it’s a pretty ambiguous, 

[00:31:00]

um, query that I have over here, and I’ve, again deployed two different strategies. The, uh, first one being the keyword table strategy, and the other one, the recursive retrieval.

 

And you can see, let’s check out where this information is present in the document first. So we have the information right here. It is, again, given as an infographic. It’s not exactly a table. It does not exactly have text specifically telling you, uh, what is the difference. Whereas we have the 2022 highlights, financial highlights given 

[00:31:30]

in a big blue font, whereas the 2021 numbers are given relatively smaller right below the uh, 20, 22 numbers for each of these different aspects.

So you can see that this is an infographic. It is layout heavy, and the recursive model has been able to work on this. It’s able, it’s able to get the meaning that is carried through proximity and design, and it’s again, able to, because of the iteration, it’s able to link the meaning and it basically brings to you the complete answer that you have over here.

So you can see for each 

[00:32:00] 

of these aspects, we’ve basically retrieved the numbers for 2022 as well as 2021. So this is what we are looking to achieve, and this again, throws light on why we need to, um, go for the right, uh, retrieval strategy depending on our query type. Just for your understanding right now, we’ve used, uh, chunking strategies to retrieve data.

I mean, get data from, uh, this particular document, but if I were to specify the same prompts and not use any chunking 

[00:32:30] 

strategy, let’s see how the system runs. And, uh, what you can expect is to get a context length limit, um, error because this is a pretty long document and without. Chunking it. The will not be able to pass it, so I’ll just run it over here.

And that is basically the error that you would be thrown. So we have it right here. The prompt is too long, and this, again, this was a foremost reason that we had covered when we looked at why we need chunking in document extraction. So that said, we have another 

[00:33:00] 

document where we have two other, uh, retrieval strategies that I wanted to cover.

So this document over here is basically an official safety certificate for a firefighting form system that has been tested and approved. So it has various details on how it’s been tested and what was the approval and all of that. So the first query that I have over here is to extract the system description section of this form system.

So it’s a very simple query, it’s very direct, and I’m even mentioning that it’s from a particular subheading or a particular 

[00:33:30] 

section. So that is why it is, uh, probably the best thing to go with the simple strategy because it again, uses less computationally, it’s less expensive, and it gets the work done.

So this is the output that I have for the simple retrieval. And, uh, if I just have to show you where it’s fetched, this, uh, output from the document, this is basically the system description that was fetched. And again, you have the option to view the chunk used. So this is true with the previous prompts that we looked at as well.

You can go through what was the exact chunk 

[00:34:00] 

that was used out of this document for this particular retrieval. And in the second, uh, prompt we have. We are looking to extract a number of details like the product description, the terms of validity, certificate number, and so on. So while some of these data points, like the certificate number could be directly present within the document, for details like the product description, the model might have to do some, uh, semantic search and put it together.

 

So we’ve deployed auto merging for this purpose, and you can see that, uh, the one 

[00:34:30] 

that has auto merging deployed over here has extracted all the details while the one with the simple strategy, you can see that certain details over here have been given as null because it was not able to retrieve it out of the, out of all the chunks that, that particular retrieval strategy or, um, that, uh, the model con, uh, considered using that particular retrieval strategy.

So again. When we talk about retrieval, I mean, I can even try using subquestion retrieval for this particular prompt or recursive retrieval and it, and it might 

[00:35:00] 

work, uh, for those, uh, retrieval strategies as well. So there is no hard and fast rule on what is the strategy you’re looking to deploy. So you’ll have to do it.

Um, for your queries and see which one gives you the most desired output that you can use in downstream operations. And that is how you go about it. So, uh, again, this is how we, uh, deploy chunking in abstract folks, and that actually brings me to the end of the session. So let’s go back to the slides. So 

[00:35:30] 

in this ves uh, in this webinar, we looked at what is document chunking, why it is needed, the various retrieval strategies, and finally how it is deployed in Unstract.

So let’s conclude this session with a bunch of best practices that you can follow. So firstly, maintain logical. Always split chunks at natural boundaries, such as paragraphs, sections or tables. So this avoids getting, uh, sentences or tables cut midway and also helps preserve the meaning that you have in your chunks.

[00:36:00] 

Secondly, optimize chunk size. So this is something we touched upon earlier. Choose a chunk size that suits your document type. So for shorter structured documents, like invoices or forms, smaller chunks might work best. And for longer narratives or contracts or slightly larger, uh, or reports, slightly larger chunks would help retain the context.

Thirdly, use overlaps wisely. So add overlaps between chunks to prevent the context loss at edges. However, the overlap should be just large enough to capture

[00:36:30]

the connecting ideas, but not so much that it become, that it creates a redundancy in the chunk. And then moving on, we have preserving structural elements.

So keep related tables, lists, headers, and all these related structural elements together in the same chunk, especially when we are dealing with financial or legal documents. So this ensures that numerical data isn’t separated away from its labels or titles. Fifth three, adapt chunking to query type. So for direct lookup queries, like, um, or direct search queries, smaller 

[00:37:00] 

chunks could improve the position.

Whereas when we are looking at analytical or summarization queries, we, uh, the larger chunks would again, help retain the broader meaning that is present in the document. And finally, monitor token limits. So ensure chunks fit within the model’s context window. This avoids truncation during the retrieval and also ensures that the entire chunk is processed.

So with that, folks, uh, in case we have any questions, we will be moving into the q and a.

[00:37:30] 

I think we have one open question. Um, let me invite Nareen, uh, to take over the q and a.

Yeah. Um, exercise chunking versus document based, chunking versus sematic chunking. It’s difficult for users to decide which strategy works best. Would involve a lot of testing to understand what any plans and, uh. 

[00:38:00] 

Yeah. Yeah. So, uh, yes, Nathan, uh, we are actually working on a lot of exciting updates. Uh, so basically, uh, the idea is like what you have mentioned, uh, to do this without user input.

Uh, so basically we dynamically decide. By looking at the document. So that’s something, uh, we are, uh, working on. And in this quarter we are also working on like a few other, uh, updates. Like, um, like we are introducing something called 

[00:38:30] 

Agent Prompt Studio, where even the extraction itself, uh, is done in a kind of a agent way, which is in which this is one of the things that we are, uh, handling.

And also like we are introducing something called a golden set where, uh, you can also verify, um, like, um, the, the extractions for a particular document set. If you, uh, kind of set, uh, make them as the golden set, then uh, you can cross verify, uh, across variance how the extraction quality is and all that. 

[00:39:00] 

Um, particularly when you are changing the prompts and all that.

Including prompt versioning and all that. So anyway, uh, we will, uh, be announcing more on this. But yeah, to answer your question, yes, we, uh, would that, that would be in the future. But for now, we are trying to create as many guides as possible because again, each unique use, each use case is unique and, uh, uh, we need to do this, uh, depending on the use case.

I hope that was useful there. Any technic anything from the same document? How will they 

[00:39:30] 

said what strategy to use? Yeah, so you pick the, uh, strategy. Um, uh, yeah. Uh, so, uh, we have like all the strategies that are available as part of the, um, I mean, uh, the. Uh, prompt studio, uh, settings. So you would be picking the strategy and that’s why it’s easy for you to, uh, test this across documents.

There is a user interface, which is pro studio, where you can like switch between different, uh, you, you can even compare 

[00:40:00] 

side by side, right? So basically you can create multiple profiles within a pro studio and each profile can have different strategies and, uh, you can run the same prompt across. The different documents across different strategies and compare side by side what is working best so that, uh, it’s very easy, uh, for you.

Yeah,

no, uh, LLMWhisperer is not open source. Uh, it, 

[00:40:30] 

uh, it is, uh, our proprietary, uh, text extraction service. Yeah, I mean the, the Unstract, the platform, the no-code interface, the ua, uh, that is open source, which is pro, which includes Pro Studio, but yeah. LLMWhisperer, uh, is, uh, not open source.

What if the chunk is unknown as a document section? So. So this is something you would decide, uh, based on, uh, that document itself, uh, Shylee. So, 

[00:41:00] 

um, you, you would, uh, I mean, uh, it, it depends on the, uh, document, right? So that’s why it’s, you need to like, definitely test. Before you can actually go to production.

 

That’s why we enabled Intract to make it easier. And, uh, the chunk size also, uh, the way you, you do the chunking also depends on the model you’re using. Because these days, like the, the recent models are like more, uh, the input context window is also growing. So you could actually 

[00:41:30] 

accommodate, uh, much longer documents.

And, uh, you could actually, uh, like do the extractions, uh, um, like, uh, better these days. So it depends on the document and, uh, yeah. Uh, and we have a chunking guide, which we can share, uh, which talks about like, how to go about like choosing the chunk size depending on the, uh, model you’re choosing. Uh, it’s, it’s available in our documentation.

I’ll try and share the link as well, uh, in the roundup. Yeah. 

[00:42:00] 

From the same document, there may be different types of questions, so would use the same strategy for all the types of questions. So K, you could, you could actually, uh. Uh, you could have different, uh, strategies across, like different, uh, even across different prompts, like, so you can, you can set it up however you like.

So depending, it, it all depends on the use case. So if there are like different types of questions, even though there are different types of questions, you will, uh, by looking at the document, uh, you will know like 

[00:42:30] 

how to set this up, right? So you can. Um, actually have separate prompt studio projects or within the prom studio projects have individual prom cells where each cell can have like a different element profile connected to it.

And, uh, you could have like different questions so you can mix and match however you like. Uh, yeah, I mean typically if the document you are, you know, what kind of document or extracting from. Then, um, uh, this is more easier, uh, because for example, if you’re dealing with, let’s say, a 10 Q or a 10 K document from 

[00:43:00] 

SEC, uh, typically like a, uh, sub-question retrieval works, works well for like most of the, uh, like financial, uh, questions that you are, you would ask for example.

So, uh, the nature of questions, what, what type of questions you’re trying to answer or what, what is the type of, uh, extraction you’re trying to do and what is the type of document. So these two things, once you kind of have a kind of a good idea, then you can narrow down on certain strategies that will work better in in certain areas.

That’s exactly what Ma Street tried to 

[00:43:30] 

co co cover in today’s, uh. Webinar. Yeah, see we have a, uh, slack channel, uh, uh, community Slack channel where please feel free to join there and ask specific questions because of course I’m giving a very generic answer here. But, uh, again, we are more than happy to help.

Uh, if you need any guidance on like which type of chunking to use or like want to talk more about our use case, I mean about your specific use case, we are more than happy to help you, um, like kind of solution this. Yeah.

[00:44:00] 

So how do we extract, uh, table data across multiple pages? Let’s say a hundred page column headers are only the first page. So money, uh, for this, you don’t need to do any chunking. Uh, so typically on, uh, I mean, depends on the document. Let’s say for example, if it’s a 10 page document and uh, from page three to page eight, there is like a table where the header is only in page three.

You don’t even need to chunk it, right? 

[00:44:30] 

It’s a smaller document, but this is a very common use case that we see again and again. So in that case, uh, you can like actually, uh, the LLMWhisperer just out of the box can work. There is also a table mode, which is in LLMWhisperer, which can also, uh, be, uh, super helpful.

And apart from this, if you are, uh, we also launched something called the API hub. Um, so there we have like ready made APIs. We, you could also use, uh, some of those. So, uh, for example, we have a all table extractor. So if you 

[00:45:00] 

want to extract tables, uh, across multiple uh, documents. Uh, I’m sorry. I mean, uh, across, let’s say it’s like a one 50 page or 200 page document and there are like 10 tables that are like spread in, in like different places in the document.

You could use this endpoint, uh, to actually like first scan the tables, get a list of tables. And then do the extraction. I mean, there’s a, it’s a two part API. The first is the, um, um, the scanning, I mean the 

[00:45:30] 

discover tables endpoint where you send the document and then you get, um, uh, like list of tables on the document and then you pick, let’s say table number three, and then we just return that table alone.

So that is, uh, possible. Yeah. If it doesn’t fit in the context, limit it, then you need to chunk it. Uh, and then you need to use like, uh, one of the, uh, like, uh, retrieval strategies that we talked about. Uh, money. Yeah. So you, you can take, check out the, 

[00:46:00] 

uh, all table extractor. I think we have implemented a lot of, like, uh, even if it doesn’t fit into the context limit, right?

Uh, in, if you’re using all table extractor, uh, it may not fit into the context limit. We have implemented something called the Rolling Window. So basically we kind of do overlapping, um, like, uh, sections we send to the LLM and then we kind of like extract and then we kind of de-dupe and then we stitch and return back the results.

So if you’re using all table extractor, uh, you, it’s, it doesn’t matter even 

[00:46:30] 

if the, uh, uh, if the, if it doesn’t, if the document doesn’t fit into the con, uh, context limit. Check it out. All table extractor. And also PF splitter is also something that we see, uh, no, uh, shish, no, it’s not open source. It’s a, uh, it’s a done for you kind of a service.

Uh, I mean, it’s a pay as you go model. So there is no upfront commitment. You just like, uh, all table extractor, you just like. Uh, pay for, uh, the number of pages you process through the api. It’s like a 

[00:47:00] 

LLMWhisperer. Um, and, uh, it’s a pay as you go, uh, model, so you can, uh, it, it, it works really well for even like multiple tables across docu, uh, I mean, spread across a document.

Uh, it works pretty well. You should check it out. Yeah.

Awesome. Thank you very much. Uh, folks, really appreciate you taking the time to attend our webinar today. Thank you. 

[00:47:30] 

Thank you folks. And also, I think someone’s raised their hand. Uh, would you like to drop your question in the chat?

We’ll just wait for a minute.

The Slack community link is there. I mean, join Slack. Join slack. slack.com with has to drop and chat like, uh, please join there so that, uh, if you have any questions, 

[00:48:00] 

uh, we are more than happy to, uh, answer there’s a support, uh, channel specifically there so you can, uh, ask your questions there. Yeah.

Alright. Okay. Uh, thank you everybody and uh, we’ll be sharing the recording for the session shortly. We really, uh, look forward to seeing you at our upcoming. Thank you so much. Thank you.