[00:00:00]
So hi everybody. I am Mahashree, product marketing specialist at Unstract, and I’m really glad you could make it to the session and join this conversation. So in this session, I’ll be conversing with Shuveb, Co-Founder and CEO of Unstract.
[00:00:21]
So I have a couple of questions pre-populated based on our interactions with users of Unstract and also other prospects that we’ve spoken to.
So I’m sure and for anybody who is curious about document extraction, which is LLM driven and where the industry is heading, and how Unstract especially plays a role in this space, I’m sure it would be useful for you.
[00:00:39]
Hi Shuveb, how you doing today? Good, Mahashree. Thank you for having me here. All right let’s get started with the session, and before I move on into the q and a segment, I just wanted to quickly go over a few things.
So we have over here a few session essentials, which are basically the ground rules that we’d like to follow in this session.
[00:01:00]
So firstly, all attendees will automatically be on mute throughout this entire segment. And you could always post your questions in the q and a tab at any time during the session, and our team will be able to get back to you with the answers via text.
You can also use the chat tab to interact with fellow attendees. This is also where you’d let us know in case you run
[00:01:20]
into any technical difficulties during the session. And finally, once we close this you’d be redirected to a feedback form that I request you to leave a review on so that we can continue to improve our sessions going forward.
So that said here’s introducing Shuveb. Shuveb is co-founder and CEO of Unstract for
[00:01:40]
over three years now. Previously he was VP engineering of Platforms Engineering at FreshWorks and Nasdaq Listed Global has player and he’s also a serial entrepreneur. His career spanning more than two decades.
He’s co-founded various internet startups and has also. Had the opportunity of working in startups that deal with petabytes of
[00:02:00]
data and billions of requests every year. I’m actually interested in knowing his takes on the questions that we have for him today. Now, before I run into that, I can see that some of you are new to the platform.
And I just wanted to quickly take you through what Unstract does and set the tone of this session so that you’d have better
[00:02:20]
context when I pose the questions to Shuveb. Unstract is an LLM powered unstructured data, ETL platform. So if I were to briefly bucket the capabilities of the platform, I’d have two buckets over here.
So that is the development phase as well as the deployment phase. So in the development phase is where you would be
[00:02:40]
uploading your documents and also specifying what is the data that you’re looking to extract and the schema of the data that you want to be extracted. So this is done in a prompt engineering environment called Prompt Studio.
One of the key capabilities over here is that we would also be deploying a text extractor tool like LLMWhisperer, which is
[00:03:00]
Unstract’s in-house text extractor, to basically extract the raw text from your document and also convert it into an LLM consumable format. So once this stage is done, you can get started with defining your prompts and also check how the data is being extracted, whether it is in the format that you’re looking for.
So once you’re happy with the extraction, you can move on
[00:03:20]
to deploy this particular Prompt Studio project. So we currently support major four major ways. Of deploying your project. So we have a API deployments, so this is when you get your files from let’s say an application, you would want to process it and send it, send the output data to another application.
And similarly, we also have ETL pipeline where
[00:03:40]
you would get the document from a file system and then you would have to process it and send it to another, database or a data warehouse. And similarly, we also have Task Pipeline and Human the loop deployments as well. So let me briefly take you to the platform and I’ll quickly take you through how this works so that you’d get better context.
[00:04:00]
So hope you can see my screen now. Is it visible? Yeah. Alright, so this is the Unstract interface that I have over here and if you were to sign in for the first time, you would intuitively be guided to set up your connectors, which are basically certain prerequisites that you will require to
[00:04:20]
get started.
You’ll be required to get started. So this is an LLM driven platform and you can connect with various elements over here that you can use for data extraction. We also have vector dbs that you’ll have to set up. Embedding models as well as text extractors. So this is where you’d also find LLMWhisperer along with
[00:04:40]
other text extractors in the market that you can choose.
So once you set up these prerequisites, you can then move on to creating your prompt studio projects. So this is where you’d be mentioning, you’d be uploading your documents and also mentioning the data that you want to be extracted. So for the want of time, I would actually be getting into
[00:05:00]
one of the existing projects over here.
What I have over here is a credit card parcer project where we’ve specified a couple of prompts over here to extract key information or data from credit card statements. So you can see that in this particular project, we’ve uploaded three credit card statements. What over here is the statement
[00:05:20]
from American Express, and over here on the prompts we have various prompts.
One is extracting the customer name. We have we’re also extracting the customer address. Spend line items as well as the payment information, and you can see how the output is structured. So you would also be giving the description as to how you want the output to be structured,
[00:05:40]
and you have access to various output data types that you can choose from as well.
So this is how you can test how your prompts are working for the credit card statements. And just to show, give you an idea of how this would work across other statements. Let me just maybe. Check out the statement of Bank of America. So you
[00:06:00]
can see here that we have a completely different credit card statement with a different layout, but the prompts are working perfectly well and we are able to extract the required out data as an output as well.
So this is how you can upload multiple documents and test your prompts in the prompt studio. And more importantly, under Raw view, this is
[00:06:20]
basically what the text extractor works on. So it is able to extract the text from the original document and preserve the layout of the original document when it passes when it passes this to elements because elements are known to consume information very much similar to human.
Very much similar to how humans would. The best way to send the
[00:06:40]
information from your document would be to preserve the original layout. This is the context that will later be sent to the LLM based on which these prompts are being executed. So again, there are various features in Prompt Studio, which I will not be getting into in too much detail in this session.
So once you basically set up. This project, you can export it
[00:07:00]
as a tool and deploy it in any of the four deployment options that we’d seen. So for instance, if I were to deploy it as an API, I would also get the API endpoint along with a downloadable postman collection that I can use. So this, I think folks sums up what I wanted to show you briefly as to how the platform functions.
And
[00:07:20]
let me just go back to my slide and I think we are ready to get started with the questions. Are we good to go? Shuveb? Yeah, we are good to go. Mahashree. Alright the first question I had for you, or the opening question would be who is Untract for? Because you are involved in speaking to various prospects on a daily
[00:07:40]
basis.
So who are these people? What are their requirements and why do they come to us? Very interestingly after large language models became available and then. People figured out that, hey, we can use them to automate various very complex processes that involve humans that previously weren’t really automatable using
[00:08:00]
existing technology.
So businesses are figuring out, figuring it out, and then business leaders are reaching out to. Essentially they’re engineering teams and it’s almost always that we spoke. We speak only to two personas, right? We either speak to engineering leaders or engineers. Or we sometimes
[00:08:20]
speak to product owners who want to automate certain things out, certain aspects of their product, right?
So these are the two two folks that we usually speak to, right? All, almost a hundred percent of the time. It’s split between the two. We very rarely speak to line of business owners or users. That’s very rare, of course. Some
[00:08:40]
of them could be highly technical, so they might have kind of some kind of curiosity in figuring out.
What large language models can essentially do for them. So sometimes they do reach out to us, but then when we make a sale, it’s always to the engineering team or the product team sometimes. Very there. Yeah. Alright. So the product is also,
[00:09:00]
you can deploy it as a cloud offering, as an on-prem version and it’s also open sourced.
Yes. So could you let us lemme know a little bit about why the founders decided to open source Unstact. Of course. I think number one many of the, much of the founders, the founding team, right? Arun and me, we are, we come from an engineering background of Coursre
[00:09:20]
also comes from an engineering background, right?
At least for and me, we have a very strong open source background, right? So we grew up on Linux and, technologies like that. And we have a very strong affinity to open source. We know hey, that, that pretty much all of the infrastructure today stands on the shoulders of open source.
So that’s of course, that’s a good reason. But I’d be lying if I
[00:09:40]
said that. It’s, there is also not a business decision behind, behind it startups by the very nature, right? Imagine the kind of customers we have. The customers that we have are typically enterprise class customers.
Of course we have other startups. We have medium sized company, all sorts of companies. There’s this, there’s some element of comfort that people get when they see that
[00:10:00]
software is open source, right? Because there’s GitHub stars are a social currency, right? And also we see that, hey, unfortunately if startups go down.
People know that okay, the open source project can continue. So there’s multiple reasons reasons for that. But also for other entrepreneurs starting out open source can also be like a channel for
[00:10:20]
leads so that there are multiple reasons behind it.
Yeah. Alright. So you spoke about different kinds of customers as well. So is there any segment that or any particular industry where we see most of our customers coming in from more than other industries? Oh yeah, totally. If you take by volume, I would
[00:10:40]
say that about 60 to 70% of the volume we process.
Should be from BFSI or sometimes also known as banking, financial services and insurance. So typically these tend to be, regulated industries. And in regulated industries, what happens is that, you’ll have to deal with all sorts of documents,
[00:11:00]
and then it’s very difficult to automate those processes if not for a solution like Unstract.
So that’s where, we come in and help. But it, it also I have to say that. There are several SAAS companies that use Unstract because sometimes their customers upload documents and then, essentially otherwise what would’ve been a very manual
[00:11:20]
process. Now they can essentially just call an output, Unstract API and then, make the UX a lot better.
So this a see change in, in, in the kind of UX that can happen whenever documents are involved. While using Unstract. But at the same time, the majority of the volume that we process today is from the BFSI segment companies.
[00:11:40]
Okay. The companies that you had mentioned that, as you said, are highly regulated industries as well.
Yeah. What are the security or privacy concerns that they have because they might be processing sensitive data as well. I know that we have customers ranging from startups to Fortune 500 enterprises. So what are the common security
[00:12:00]
concerns that you see these customers raising?
Yeah, sometimes they don’t want to send the data to the cloud. That’s like a common concern they have. And earlier when we started there was there used to be another concern, which was how can I send my data to a large language model? They used to be worried about that. Now we don’t see
[00:12:20]
that they worry about it anymore because I think there’s some kind of a trust.
So LLMs have become yet another cloud service. So it’s like a database or like S3 or like some storage or it doesn’t matter, right? So they simply send, and I think more, more recently, all of the cloud providers have made it very clear that if you use
[00:12:40]
their large language model services, they will not train their large language models based on your data.
So of course those enterprise agreements also give or rather contracts give. Customers that comfort level. But it doesn’t stop people from let’s say, trying to install large language models on premise and things like that.
[00:13:00]
There are some extreme cases, but then again most of them are okay to send, number one, their data to large language models.
But when they feel that, they absolutely need. Some sort of data security. They do ask us to give them single tenant installations of Unstract and LLMWhisperer which can be installed
[00:13:20]
essentially on their cloud or sometimes called on-premise, but it’s not really on premise. It’s basically on their cloud.
Yeah. Okay. Alright. So earlier in your earlier answer you’d also hinted about how the UX of Unstr makes it easier to adopt it for data extraction. Could you explain why Unstract
[00:13:40]
exists when engineers also have the choice of using other LLM related dev tools, or they could also build this up from scratch within their organizations?
Could you explain if there are any other factors to this? That’s a great question. So see, what happens is that, there’s a lot of frameworks available which allow you to build agents, which allow you to
[00:14:00]
extract data from unstructured documents. But at the same time, it’s a call that the customer needs to take.
Do I need to, essentially. Deal with all sorts of document types and structuring and prompt engineering. There’s so many things that, that go into making a system that can
[00:14:20]
vary with high accuracy and consistency. Convert unstructured data into structured data, right?
Now when that really needs to happen would you, would customers want to build more of a vertical solution and solve their business problem, or they want to really get their hands dirty and solve a very, I, what
[00:14:40]
I would call a low level problem or platform level problem? So that is a question that the customers have to answer.
No, I, yeah, there are some customers who are really good at this. They don’t mind spending the time and energy. Into building a solution like this. But then again, they very quickly realize that it’s not as easy and then it’s just a big headache to, maintain. But
[00:15:00]
yeah some folks do it.
I have seen that. Alright. Okay. When it comes to document extraction, of course accuracy is something that’s very important and how good are elements as compared to previous gen tools or how can elements go wrong when it comes to document extraction? Yeah,
[00:15:20]
yeah, they’re better for sure.
Compared to OCR technology LLMs can do a really good job of structuring data and things like that, right? So definitely they can do that, but at the same time, things can go wrong. For example LLMs can hallucinate. Very unfortunately, there is no programmatic way of figuring out if a particular LLM has hallucinated during a
[00:15:40]
particular API call.
So it becomes super difficult to do it. So that’s the reason why we have technologies like LLM Challenge where we use. Two large language models, they have to come to a consensus un unless they come to a consensus, we do not accept the result of a particular extraction. And then you can also have some kind of a rule engine and send it to a human in the loop.
So
[00:16:00]
we have all sorts of technologies that allow, users to ensure that. The results that they’re getting from a large language model is really good. So this is another problem that when people build a solution on their own, this is something that they have to take care of o of on their own, right?
Yeah. All right. So you mentioned about the human in the loop
[00:16:20]
technology, which is also a common requirement across industries in certain industries where it is a re requirement that you have Yeah. Human in the loop. So how, can you elaborate a little bit more on it how we are deploying it?
Sure, of course. So human in the loop is in some places can be required by law, for example, in, in some, in many countries. In fact, you cannot make insurance
[00:16:40]
decisions purely using ai. So you need to have a human in the loop. And sometimes it also happens that, you have very bad input.
For example, you may have fax copies or doctors written notes, all that kind of stuff. And then human oversight might actually help there, right? So for that reason of course, we show the document on the left and the extracted
[00:17:00]
data on the right. But then the document can be very long, right?
Verifying whether a particular extracted value is correct or not can be very painful. That’s the reason why in, in the human and the group technology that we have. Whenever you click on a particular extracted value, we will scroll to the right location and the document and also highlight the area from
[00:17:20]
where the LLM actually picked up the data from.
Right? So what happens is this allows reviewers to very efficiently review. And this is critical, right? So to enable this, we have worked pretty much from the ground up to make sure that, this source document highlighting and can happen so that efficient reviews can
[00:17:40]
happen, right?
So this is this, this is like multi-pronged problem, right? It can be legal requirement, it can be hey, required. Because like I said, a document can be of very bad quality sometimes. And then humans generally have a lot of context around the document for them to figure out hey, is something really going wrong?
Things like that.
[00:18:00]
Yeah. So that really helps. Alright. So talking about bad quality documents one way of handling it is probably with the manual check the human in the loop. But we also have LLMWhisperer, which I think handles this to a certain extent where it’s able to extract the text and also pre-process it into an LLM
[00:18:20]
consumable format.
So do you think it works maybe in a way to fix that quality input documents? And can you also elaborate more on LLMWhisperer this. Because we also offer it as a standalone tool, and it’s doing pretty well as well. So LLMWhisperer is a service that can take documents and then provide raw text that can be fed to a large language model.
So
[00:18:40]
it’s, from a user perspective, there are a lot more users of LLM than Unstract because, there’s a lot of other use cases for LMMWhisperer. So people may get text from LLM, but I use it in rag kind of application. So RAG is very different. That used usually some sort of question, answer kind of use cases where the output of the large language model is
[00:19:00]
consumed more by a human being, right?
Whereas Unstract output is more structured data, which you can use to automate use cases downstream, things like that, right? So LLMWhisperer is super important because vision models are still not there. As far as accuracy is concerned. So you cannot throw some document at efficient models
[00:19:20]
and ask it to give you structured data.
In many real world use cases, it’ll fail. For example, you may also have to deal with documents like Excel sheets and all sorts of tough images and very challenging documents like Word documents and then documents that are handwritten and documents that have checked boxes, radio buttons, smartphone clicked images.
[00:19:40]
All sorts of complex documents. LLMWhisperer is a modular separate service. That solves this problem and solves it really well. We also have this technology called layout preservation, which can look at the layout of the input document and preserve the same layout in the output document. So when you have multi
[00:20:00]
column layout for, documents and forms that have, again, inherent multi columns and then tables and check boxes, radio buttons, it can deal with all of these things.
Then that way, right? We do as Unstract, we don’t have to worry about what sort of output are we input, are we getting for the large language model to process and then
[00:20:20]
convert the data into JSON that the user is asking for, right? So we don’t have to absolutely worry about that. And that’s also the reason why a lot of engineers use LLM, their own customer applications that involve large language models.
Okay. All right. So you, we’ve spoken a little bit about untraced and capabilities within the platform, like LLM Challenge, human in the Loop,
[00:20:40]
and we also now spoke about LLMWhisperer. So I, there’s an understanding of what the platform provides as a whole, but if someone were to embark on an LLM project for the first time what should they, how should they approach it?
What are the pitfalls that they should be wary of or the best practices for them to run it effectively on the first goal?
[00:21:00]
This is a, it’s a very important question because what happens is that when you introduce large language models into your organization, and if that project fails, then essentially, right?
People are not going to trust that they can do automation using large language models for a long time, or at least for a time until they forget that this failure occurred, right? So it’s a good
[00:21:20]
idea to start to get really good balance between. A project that can be automated where, essentially you can, let’s say some human is involved and then you want, it’s very difficult to for example, automate because there’s some sort of large document involved.
So we can go take a look at, hey, projects that are not very complex at the
[00:21:40]
same time, projects that will have a significant impact on the organization if automated. It’s very important to hit this balance, right? Otherwise, what happens is that hey, you, you will get for example a project that is not very impactful and then you automate it, then there is no point.
Or then you get, take, you take a very complex project and then it
[00:22:00]
fails straight away, and then people still lose faith in, in the large language models. So you don’t want to have that either, right? So it’s, that’s why it’s very important to find, find find that balance and then you go from there.
Alright. Okay. What does the future, because right now we are seeing that people are moving into agent workflows and setting
[00:22:20]
up these kind of workflows in their businesses. So how does this speak in terms of document extraction and Unstract? How do we factor in this entire s system?
What you mean is how we are working. Can you repeat the question? Sorry. Yeah, sure. So what does the future look like for agent pick automation when Oh, we talk about, yeah. Unstucks, right?
[00:22:40]
So Unstract, as is today either an API or a ETL pipeline, right? So essentially what happens is that, hey, we can take documents from.
Some storage and then structure it and then push that structure JSON data into a database or you call an API, you send a document and then we send back JSON data more and more. Where we are going is into
[00:23:00]
workflows, right? Can we directly read emails? And then can we push the data into some kind of a vertical system like some CRM that that customers already have?
So we are the more, with a few customers, we’re already working on such use cases where we are building agent workflows that involve unstructured documents for
[00:23:20]
sure. But then again, it goes well beyond an API or an ETL pipeline. We are basically able to do a lot more portion of the workflow.
This is essentially work that. The engineers we are previously doing, now it’s becoming part of the platform. So essentially think of these as more agentic workflows. Yeah.
[00:23:40]
All right. So this, would you say it would democratize the entire process and bring down the dependency that people have on IT teams when it comes to document extraction as well?
I think so. I think so, because today most of the heat is faced by the business teams, and then they have a lot of dependence on the engineering teams, then we do believe that because of how good
[00:24:00]
large language models have become as far as instruction following is concerned as far as the ability for them to follow prompts and build workflows is concerned, use tools is concerned.
So these capabilities something that s will be able to leverage and that’s what we want to expose to users for automation as well. Okay. So as
[00:24:20]
a final, we coming we’re coming towards the final set of questions. Where do you see LLMs heading? Because there’s, again, a lot of buzz about MCPs and so on.
So where does this, where are LLMs heading when it comes to document extraction? And what does the future roadmap for Unstract Look in terms of the capabilities that we are looking to integrate
[00:24:40]
into the platform. I think alums are becoming better and and especially I would like to highlight what is known as instruction following ability.
Now the model providers are especially training large language models to follow instructions very closely. And this is important because when it comes to building workflows, when it comes to extracting structured data from very
[00:25:00]
complex documents where a lot of business rules are involved.
We are seeing, for example, prompts that are close to a thousand lines. And out of the thousand lines, there could be hundreds of instructions that are given to the large language model in extracting data, in transforming data as the data is extracted or following certain rules of extraction and things like that,
[00:25:20]
right?
So unless these large language models have amazing ability to follow rules they wouldn’t succeed in these particular use cases. Like I mentioned, for example, mainly because large language models are used in coding related use cases, these these model providers do train them specifically for in, for the instruction following ability that they
[00:25:40]
have, right?
This I believe, will only get better and as agent use cases are becoming more and more popular this this ability will actually be. I believe become better and better. And also the cost is falling. If you take a look at the past 18 months, I think even just for open ai, the cost of the large language models has fallen
[00:26:00]
by more than 90%.
And that is that is like pretty significant, right? So combining all of this I do believe that a lot more people will try agent workflows. Data extraction workflows, unstructured data extraction workflows. So I do believe that we will get to see a lot more automation happening in, in enterprises using large
[00:26:20]
language models.
Okay. I think that sums up the questions I had for you, Shuveb. Was there anything that you would like to tell us as a closing note before we end the session? We are super excited and like I said these models are becoming better and we are taking on more and more challenging tasks that a lot of even we thought sometimes that we could not automate
[00:26:40]
using large language models.
So that is changing and I think that is exciting and this can only get better and we, I can only imagine. How large language models, one a year from now, two years from now, the what kind of capabilities they’ll have and we are waiting on it. I think it’s not about, cutting jobs or anything, but it’s about a, how humans can do more interesting work.
And
[00:27:00]
then the mundane work can be absolutely cut down, no doubt about that. Alright. Okay then. So thank you so much for joining today in this conversation. It was really nice talking to you and hopefully we can connect sometime in the future as well. Thank you. Thank for having me here.
And thank you everybody as well for joining the session. So in if you would like to
[00:27:20]
learn more about Unstract and how you can deploy this in your own organization, you can always sign up for a free demo with one of our experts as well. So I’ll just ask my team to drop the link in chat and you can check that out.
So yeah, thank you so much. Thank you then.
[00:27:40]
Bye.
Privacy policy | Terms of service | DPA | 2025 Zipstack.Inc. All rights reserved.
We use cookies to enhance your browsing experience. By clicking "Accept", you consent to our use of cookies. Read More.