From Unstructured to Understandable: Enhancing LLMs with RAG using Unstructured API

Lakshmi narayana .U
Stackademic
Published in
10 min readMay 5, 2024

--

image generated by author and DALL.E-3

Role of Retrieval-Augmented Generation in Enhancing Large Language Models and the Importance of Data Preprocessing

Retrieval-Augmented Generation (RAG) is a novel approach designed to enhance the capabilities of Large Language Models (LLMs). LLMs are artificial intelligence systems that can comprehend and generate text. RAG augments these systems by integrating a search function that can retrieve the most relevant and up-to-date information from extensive databases. This feature enables the AI to utilize the latest information when answering queries or making decisions.

RAG was initially proposed in a research paper by Facebook in 2020. The concept of RAG was conceived as a blend of two types of memory: one resembling the AI’s existing knowledge, and the other akin to a search engine. This combination allows the AI to access and use information more effectively, particularly when answering complex questions. This concept has since been developed further and is now employed in many AI applications.

In the domain of Natural Language Processing (NLP), which is about enabling computers to understand and generate human language, RAG has been a game-changer. Traditional language models could generate text, but they often couldn’t incorporate extra, specific information while generating text. RAG addresses this problem by merging the search capabilities of retrieval models with the text-generating abilities of generative models. This has unlocked new possibilities in NLP, making RAG an essential tool for tasks that require detailed, informed responses.

RAG operates using two main components: the retrieval model and the generative model. The retrieval model acts like a librarian, extracting relevant information from databases or collections of documents. This information is then handed over to the generative model, which functions like a writer. The generative model uses the retrieved data to compose coherent and informative text, ensuring that the responses are accurate and full of context.

While RAG significantly enhances the capabilities of LLMs, it’s important to acknowledge its dual nature. On one hand, RAG mitigates issues like false information generation and data leakage, improving the trustworthiness of AI interactions. However, the quality of RAG’s responses heavily depends on the quality of the retrieved data, underscoring the need for robust and reliable data sources.

This highlights the essential role of data processing in AI and ML applications, such as RAG. The quality, accuracy, and relevance of the data retrieved by the retrieval model directly influence the output generated by the generative model. Hence, data must be carefully processed and curated, ensuring it is accurate, relevant, and free from bias or inconsistencies.

In this context, the course on “Preprocessing data” at DeepLearning.AI becomes particularly relevant. This course delves into the crucial aspects of data preprocessing, which is a fundamental step in any AI or ML project. It covers various techniques such as data cleaning, normalization, transformation, and feature extraction, which help in preparing the data for further analysis and modeling. By effectively preprocessing data, one can enhance the performance of AI models, making them more accurate and reliable. This course, therefore, provides essential knowledge and skills for anyone working in the field of AI or ML.

Simple Implementation of Unstructured (Open-Source) with Langchain

Langchain is an essential tool in the implementation of Retrieval Augmented Generation (RAG) models, particularly in the context of unstructured data. It performs a critical role in the ingestion phase of the RAG pipeline, assisting in the transformation of raw, unstructured data into a format that can be effectively utilized by the model.

The first step in this process is loading the data, which is where dataloaders come into play. Dataloaders are responsible for reading the unstructured data and dividing it into manageable chunks or ‘batches’. This is crucial for efficiency, as it allows for parallel processing of data, significantly speeding up the training process.

Once the data is loaded, Langchain then assists in the preprocessing of the data. This involves normalizing the data and converting it into a format that can be understood by the RAG model. This step is particularly important when dealing with unstructured data, as it can come in many different forms, such as text, images, tables, and more.

After preprocessing, Langchain helps to generate embeddings for each chunk of data. These embeddings are a numerical representation of the data that encapsulates their semantic meaning. The embeddings are then offloaded to an index, which is a view of a storage system. This index serves as a searchable database that the RAG model can query to retrieve relevant chunks of data.

Langchain’s document loaders are designed to import data from various sources, offering a “load” method for instant data transfer and a “lazy load” function for gradual memory loading. This versatility enables the Unstructured package to handle a wide range of file types, including text files, PowerPoints, HTML, PDFs, and images. The Unstructured library offers open-source components for ingesting and preprocessing both images and text documents, transforming unstructured data into a format suitable for LLMs. These modular functions and connectors streamline the data processing workflow, making it adaptable to different platforms and efficient in producing structured outputs.

Simple implementation of the open source version of Unstructured and Langchain.

Data loading of different file formats- .txt, .pptx etc

from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader("./example_data/state_of_the_union.txt")
docs = loader.load()
loader = UnstructuredPowerPointLoader(
"some_test.pptx", mode="elements"
)

Basic RAG with the source as a .pdf file

The code snippets below demonstrate a straightforward implementation of RAG. It utilizes the open-source Unstructured library to process a basic PDF file and subsequently enables chat based on the content of the file.

# Warning control
import warnings
warnings.filterwarnings('ignore')
##
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.pdf import partition_pdf
from unstructured.staging.base import dict_to_elements
##
import chromadb
##

# Pre-process the pdf file
from langchain_community.document_loaders import UnstructuredFileLoader
loader = UnstructuredFileLoader(
"bigita.pdf", strategy="fast", mode="elements"
)
docs = loader.load()
docs[:5] #sample check
filename = "your_pdf.pdf"
pdf_elements = partition_pdf(filename=filename)
##

# Load the Documents into the Vector DB
elements = chunk_by_title(pdf_elements)
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
documents = []
for element in elements:
metadata = element.metadata.to_dict()
del metadata["languages"]
metadata["source"] = metadata["filename"]
documents.append(Document(page_content=element.text, metadata=metadata))
##
embeddings = OpenAIEmbeddings(api_key="your_key")
vectorstore = Chroma.from_documents(documents, embeddings)

## Set-up the retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 6}
)
from langchain.prompts.prompt import PromptTemplate
from langchain_openai import OpenAI
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain

## prompt template
template = """You are an AI assistant for answering questions about the Bhagavad Gita document.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
If the question is not about the document, politely inform them that you are tuned to only answer questions about the Bhagavad Gita.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
prompt = PromptTemplate(template=template, input_variables=["question", "context"])
llm = OpenAI(api_key="your_key",temperature=0.7)

doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
question_generator_chain = LLMChain(llm=llm, prompt=prompt)
qa_chain = ConversationalRetrievalChain(
retriever=retriever,
question_generator=question_generator_chain,
combine_docs_chain=doc_chain,
)
qa_chain.invoke({
"question": "What is the summary of Gnana Yoga? ",
"chat_history": []
})["answer"]

New Features and Learning Resources for Unstructured Data Processing

Unstructured has recently introduced an API, offering users the flexibility to make direct API calls from their GitHub repository. They’ve also launched a beta version of their Chipper model, which deliver superior performance when processing high-resolution, complex documents.

Few of the related concepts of Unstructured from the deeplearning.ai course:

Note: Please check References section for related technical papers

Data Handling Techniques to manage different file types and data formats such as numeric data in Excel, reports in Word or PDF, and presentations in PowerPoint.

Data Normalization for parsing and normalizing data to ensure accessibility and usability for LM RAG systems.

Metadata Utilization to enhance the system’s ability to retrieve and interpret information effectively.

Document Image Analysis and the application of vision transformers to understand complex document layouts and tables.

Detailed Implementation Handling Multiple Data Types with Unstructured API and Langchain

Note: Please sign up for a free Unstructured key and API URL. As per its Github page it should serve upto 1000 pages.

For this implementation, we will use a .pdf file of the phi-3 technical report along with its press release note saved in a .md file format.

# Warning control
import warnings
warnings.filterwarnings('ignore')

##
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.chunking.title import chunk_by_title
from unstructured.partition.md import partition_md
from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements

import chromadb
# Set-up Unstructured API credentials
DLAI_API_KEY = "your_key"
DLAI_API_URL = "https://self-yourvalue.api.unstructuredapp.io/general/v0/general"
s = UnstructuredClient(
api_key_auth=DLAI_API_KEY,
server_url=DLAI_API_URL,
)

Now, we are utilizing a vision transformer, ‘Yolox’, through the Unstructured API to process the elements of the .pdf file. This snippet execution might take a few minutes of time based on the size of the file.

filename = "2404.14219v2.pdf"

with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)

req = shared.PartitionParameters(
files=files,
strategy="hi_res",
hi_res_model_name="yolox",
pdf_infer_table_structure=True,
skip_infer_table_types=[],
)

try:
resp = s.general.partition(req)
pdf_elements = dict_to_elements(resp.elements)
except SDKError as e:
print(e)
pdf_elements[0].to_dict()
# Output
{'type': 'Title',
'element_id': '9c5b86014d80d0c58dbb38f6d0c9fcf4',
'text': 'Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone',
'metadata': {'filetype': 'application/pdf',
'languages': ['eng'],
'page_number': 1,
'filename': '2404.14219v2.pdf'}}

We can verify the total number of elements and their category, which in this case is ‘Table’, as shown below.

tables = [el for el in pdf_elements if el.category == "Table"]
len(pdf_elements)
# ouput = 103
len(tables)
# ouput = 2
table_html = tables[0].metadata.text_as_html
# prints the formatted table for our reference.
from io import StringIO
from lxml import etree

parser = etree.XMLParser(remove_blank_text=True)
file_obj = StringIO(table_html)
tree = etree.parse(file_obj, parser)
print(etree.tostring(tree, pretty_print=True).decode())

Further processing…

reference_title = [
el for el in pdf_elements
if el.text == "References"
and el.category == "Title"
][0]
reference_title.to_dict()
#output
{'type': 'Title',
'element_id': 'a0d7deccf89e42d02a9d66b0c1889689',
'text': 'References',
'metadata': {'filetype': 'application/pdf',
'languages': ['eng'],
'page_number': 8,
'filename': '2404.14219v2.pdf'}}
references_id = reference_title.id
for element in pdf_elements:
if element.metadata.parent_id == references_id:
print(element)
break
# ouput
# [AI23] Meta AI. Introducing meta llama 3: The most capable openly available llm to date, 2023.#
pdf_elements = [el for el in pdf_elements if el.metadata.parent_id != references_id]
len(pdf_elements)
# ouput = 94

Preprocessing the phi-3 launch note

# Actual weblink: https://azure.microsoft.com/en-us/blog/introducing-phi-3-redefining-whats-possible-with-slms/
filename = "Introducing Phi-3 Redefining what’s possible with SLMs.md"
md_elements = partition_md(filename=filename)

Loading the Documents into the Vector DB and setting up for chat

elements = chunk_by_title(pdf_elements+md_elements)
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
documents = []
for element in elements:
metadata = element.metadata.to_dict()
del metadata["languages"]
metadata["source"] = metadata["filename"]
documents.append(Document(page_content=element.text, metadata=metadata))
embeddings = OpenAIEmbeddings(api_key="your-key")
vectorstore = Chroma.from_documents(documents, embeddings)
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 6}
)
from langchain.prompts.prompt import PromptTemplate
from langchain_openai import OpenAI
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chains.qa_with_sources import load_qa_with_sources_chain
template = """You are an AI assistant for answering questions about the phi-3 model technical document.
You are given the following extracted parts of a long document and a question. Provide a conversational answer.
If you don't know the answer, just say "Hmm, I'm not sure." Don't try to make up an answer.
If the question is not about phi-3 model, politely inform them that you are tuned to only answer questions about phi-3 model.
Question: {question}
=========
{context}
=========
Answer in Markdown:"""
prompt = PromptTemplate(template=template, input_variables=["question", "context"])
lm = OpenAI(api_key="your-key",temperature=0.7)

doc_chain = load_qa_with_sources_chain(llm, chain_type="map_reduce")
question_generator_chain = LLMChain(llm=llm, prompt=prompt)
qa_chain = ConversationalRetrievalChain(
retriever=retriever,
question_generator=question_generator_chain,
combine_docs_chain=doc_chain,
)
qa_chain.invoke({
"question": "How did phi-3 perform on academic benchmarks?",
"chat_history": []
})["answer"]
##
# output
# ' Phi-3 models performed well on academic benchmarks, particularly in reasoning and logic capabilities. However, they did not perform as well on factual knowledge benchmarks due to their smaller model size. It is worth noting that further investigation is being conducted on some benchmarks. \nSOURCES: Introducing Phi-3 Redefining what’s possible with SLMs.md, 2404.14219v2.pdf'
# Limiting the source to one file type
filter_retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 1, "filter": {"source": "Introducing Phi-3 Redefining what’s possible with SLMs.md"}}
)
filter_chain = ConversationalRetrievalChain(
retriever=filter_retriever,
question_generator=question_generator_chain,
combine_docs_chain=doc_chain,
)
filter_chain.invoke({
"question": "What is Krishi Mitra and what role did phi-3 play in it?",
"chat_history": [],
"filter": filter,
})["answer"]
#
# # Ouput
# ' "Krishi Mitra is a farmer-facing app that reaches over a million farmers. Phi-3 is being used as part of the collaboration between ITC and Microsoft for the Krishi Mitra copilot, with the goal of improving efficiency while maintaining the accuracy of a large language model."\nSOURCES: Introducing Phi-3 Redefining what’s possible with SLMs.md'

In conclusion, the integration of the Unstructured API with the Yolox vision transformer and the Vector Database forms a powerful toolchain for handling various data types in machine learning applications. By processing a diverse set of documents, from PDF files to markdown notes, we can extract useful information and load it into the Vector Database. This setup not only allows for efficient data retrieval but also serves as a robust infrastructure for developing sophisticated applications such as chatbots. Using these advanced tools and techniques, we can effectively navigate the complexities of unstructured data, paving the way for more impactful AI solutions.

--

--