HU
Unstructured document processing strategies

Unstructured document processing strategies

2024.12.05

In our opinion, there are 3 different approaches worth considering when looking for a solution to help us process unstructured documents:

Cloud based document processing services
Proprietary LLM’s
Open source LLM’s

Cloud based document processing services
Cloud providers are offering specialized tools for extracting data from documents:

  • AWS (Textract)
  • Azure (AI Document Intelligence)
  • Google (Document AI)

These solutions are similar, but we are recommending AWS Textract, as it is the most mature and complex. These services have many features like OCR, table extraction, filters, etc bui in, and they support many document formats out of the box. We can integrate these services into Dbase through API’s, so the source documents must be submitted to the cloud.

Pros:

  • Getting started is easier
  • Accuracy
  • Scalability
  • No local server infrastructure is needed

Cons:

  • Vendor lock-in
  • Limited flexibili
  • Data privacy concerns

Proprietary LLM’s
The well known AI providers are offering APIs to their Large Language Model. When using this approach, we must submit the prepared document to the model through its API, and the processing is taking place in the provider's infrastructure.

Proprietary LLM’s:

  • OpenAI GPT
  • Anthropic Claude
  • Google Gemini
  • Cohere CommandR+

We are recommending OpenAI, because it is still the state of the art AI solution on the market. OpenAI is using Microsoft Azure data centers, and it is possible to opt out of data retention and EU data centers can be requested.

Pros:

  • Flexibility
  • Better understanding of the context
  • Rapid model development
  • Multimodal (text and image input)
  • Can be used for additional features like customer service chat bot, etc.
  • No local server infrastructure is needed

Cons:

  • Data privacy concerns
  • Vendor lock-in (however OpenAI’s API is used by many other LLMs)
  • Acts as a black box

Open source LLM’s
An open source Large Language Models is a generative AI solution which can be deployed and trained locally. The basic model has a thorough initial training that enables it to understand the context, while it is possible to train the model in a specific fi

Open source/weight LLM’s

  • Meta Llama
  • Google Gemma


We are recommending Meta’s Llama model, which is developed rapidly and it has multimodal capabilities (Llama3.2 can process images).
Pros:

  • No data privacy concern
  • No vendor lock-in
  • Unique solution (on the long term by training the model)
  • Code transparency
  • Full control

Cons:

  • Higher upfront costs because local server infrastructure is needed
  • Accuracy depends on the size of the modell, a bigger model needs higher server performance
  • Scalability
  • Server utilization

Conclusion
It is difficult to predict the performance of an AI model for a specific use case. On the one hand the models are extremely complex and non-deterministic but on the other hand the input PDF documents are heterogeneous: some have been scanned, and also formatting can be different. We have to take into account that this is an area that changes very quickly, so it is not worth committing to one supplier or solution.

We propose a hybrid solution:

  1. Use AWS Textract for document preprocessing and text retrieval. 
    It makes sense to take advantage of Textract's excellent OCR and text processing capabilities.
  2. Then use a LLM to extract the final information and for additional features like question answering, summarization tasks, etc. 
    Efforts should be made to integrate the language model in an easily interchangeable way, so that the benefits of new versions or more advanced models that are releas later can be more easily utilized and thus reducing the dependency on vendors.

by András Kenéz, Head of Software Development, Software Architect

Unstructured document processing strategies

Unstructured document processing strategies

2024.12.05

In our opinion, there are 3 different approaches worth considering when looking for a solution to help us process unstructured documents:

Cloud based document processing services
Proprietary LLM’s
Open source LLM’s

Cloud based document processing services
Cloud providers are offering specialized tools for extracting data from documents:

  • AWS (Textract)
  • Azure (AI Document Intelligence)
  • Google (Document AI)

These solutions are similar, but we are recommending AWS Textract, as it is the most mature and complex. These services have many features like OCR, table extraction, filters, etc bui in, and they support many document formats out of the box. We can integrate these services into Dbase through API’s, so the source documents must be submitted to the cloud.

Pros:

  • Getting started is easier
  • Accuracy
  • Scalability
  • No local server infrastructure is needed

Cons:

  • Vendor lock-in
  • Limited flexibili
  • Data privacy concerns

Proprietary LLM’s
The well known AI providers are offering APIs to their Large Language Model. When using this approach, we must submit the prepared document to the model through its API, and the processing is taking place in the provider's infrastructure.

Proprietary LLM’s:

  • OpenAI GPT
  • Anthropic Claude
  • Google Gemini
  • Cohere CommandR+

We are recommending OpenAI, because it is still the state of the art AI solution on the market. OpenAI is using Microsoft Azure data centers, and it is possible to opt out of data retention and EU data centers can be requested.

Pros:

  • Flexibility
  • Better understanding of the context
  • Rapid model development
  • Multimodal (text and image input)
  • Can be used for additional features like customer service chat bot, etc.
  • No local server infrastructure is needed

Cons:

  • Data privacy concerns
  • Vendor lock-in (however OpenAI’s API is used by many other LLMs)
  • Acts as a black box

Open source LLM’s
An open source Large Language Models is a generative AI solution which can be deployed and trained locally. The basic model has a thorough initial training that enables it to understand the context, while it is possible to train the model in a specific fi

Open source/weight LLM’s

  • Meta Llama
  • Google Gemma


We are recommending Meta’s Llama model, which is developed rapidly and it has multimodal capabilities (Llama3.2 can process images).
Pros:

  • No data privacy concern
  • No vendor lock-in
  • Unique solution (on the long term by training the model)
  • Code transparency
  • Full control

Cons:

  • Higher upfront costs because local server infrastructure is needed
  • Accuracy depends on the size of the modell, a bigger model needs higher server performance
  • Scalability
  • Server utilization

Conclusion
It is difficult to predict the performance of an AI model for a specific use case. On the one hand the models are extremely complex and non-deterministic but on the other hand the input PDF documents are heterogeneous: some have been scanned, and also formatting can be different. We have to take into account that this is an area that changes very quickly, so it is not worth committing to one supplier or solution.

We propose a hybrid solution:

  1. Use AWS Textract for document preprocessing and text retrieval. 
    It makes sense to take advantage of Textract's excellent OCR and text processing capabilities.
  2. Then use a LLM to extract the final information and for additional features like question answering, summarization tasks, etc. 
    Efforts should be made to integrate the language model in an easily interchangeable way, so that the benefits of new versions or more advanced models that are releas later can be more easily utilized and thus reducing the dependency on vendors.

by András Kenéz, Head of Software Development, Software Architect