Advanced Text Analysis and Retrieval System Using Python Libraries
Building an advanced text analysis and retrieval system can be a complex task, but with the right Python libraries, it becomes much more manageable. In this article, we’ll explore how to leverage libraries like langchain, pymupdf, cohere, pinecone-client, PyPDF2, openai, datasets, and ragas to create a powerful system for processing and analyzing textual data.
Step 1: Install and Import Libraries
The first step is to install and import the necessary libraries. This ensures that all the required tools are available for the analysis and retrieval tasks.
# pip install langchain pymupdf cohere pinecone-client PyPDF2 openai datasets ragas
# pip install --upgrade --quiet langchain-google-genai pillow
# pip install python-dotenv
import os
import random
from pinecone import Pinecone, ServerlessSpec
from langchain.vectorstores import Pinecone as PineconeStore
from PyPDF2 import PdfReader
from langchain.embeddings import CohereEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from datasets import Dataset
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from ragas import evaluate
from langchain_openai.chat_models import AzureChatOpenAI
import google.generativeai as genai
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)
Step 2: Create Pinecone Index and Load PDF
The next step involves initializing a Pinecone index for efficient vector search and loading a PDF document for analysis. This sets up the data storage and retrieval system.
load_dotenv('keys.env')
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
COHERE_API_KEY = os.getenv('COHERE_API_KEY')
INDEX_NAME = "quickstart"
pc = Pinecone(api_key=PINECONE_API_KEY)
embeddings = CohereEmbeddings(model = "embed-multilingual-v3.0", cohere_api_key=COHERE_API_KEY)
pc.delete_index(INDEX_NAME)
if INDEX_NAME not in [index.name for index in pc.list_indexes()]:
pc.create_index(name=INDEX_NAME, dimension=1024,
spec=ServerlessSpec(cloud='aws', region='us-west-2')
)
pdf_file = open('example.pdf', 'rb')
reader = PdfReader(pdf_file)
heb_pages = reader.pages
pages = ""
for page in reader.pages:
pages += page.extract_text()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(pages)
text_content = [doc for doc in texts]
docsearch = PineconeStore.from_texts(text_content, embeddings, index_name=INDEX_NAME)
else:
text_field = "text"
index = pc.Index(INDEX_NAME)
docsearch = PineconeStore(
index, embeddings, text_field
)
Step 3: Create Large Language Model Using Azure
This step leverages Azure’s capabilities to create a large language model, which will play a key role in processing and understanding the textual content from the documents.
GPT_DEPLOYMENT_NAME="chatgpt_16k"
AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY')
AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')
llm = AzureChatOpenAI(
openai_api_version="2023-05-15",
azure_endpoint=AZURE_OPENAI_ENDPOINT,
azure_deployment=GPT_DEPLOYMENT_NAME,
model='azure',
validate_base_url=False
)
Step 4: Retrieve Random Documents and Generate Questions
In this step, we retrieve random documents from the index and generate questions whose answers lie within the document’s content. These questions are stored in a list for later evaluation.
retriever = docsearch.as_retriever()
random_documents = random.choices(texts, k=2)
questions = []
documents = []
for doc in random_documents:
template = "Generate a question in hebrew who's answer is within the following text: {doc}"
prompt = ChatPromptTemplate.from_template(template)
rag_chain = (
{"context": retriever, "doc": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
questions.append(rag_chain.invoke(doc))
documents.append(doc)
Step 5: Create Chat Prompt Using Gemini
Using the document as context, we send prompts requesting answers to each question through Gemini. The responses are saved in a list, showcasing the interaction with the model and its understanding of the context.
model = "gemini-pro"
genai.configure(api_key=GOOGLE_API_KEY)
generation_config = {
"temperature": 1,
"top_p": 1,
"top_k": 1,
"max_output_tokens": 1024,
}
safety_settings = [
{"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
{"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_ONLY_HIGH"},
{"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_ONLY_HIGH"},
{"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_ONLY_HIGH"},
]
gemini = genai.GenerativeModel(
model_name=model,
safety_settings=safety_settings,
generation_config=generation_config,
)
answers = []
chat=gemini.start_chat()
for i in range(len(documents)):
question = questions[i]
document = documents[i]
template = f"""You are an assistant for question-answering tasks.
Use the following pieces of retrieved context and question below to answer the question in hebrew.
Use two sentences maximum and keep the answer concise.
Question: {question}
Context: {document}
Answer:
"""
response = chat.send_message(template)
answers.append(response.text)
Step 6: Prepare Dataset for Metrics Evaluation
Here, we prepare a dataset that will be used to evaluate the effectiveness of our retrieval and question-answering system. This sets the stage for assessing performance and accuracy.
ground_truths = []
for answer in answers:
list = []
list.append(answer)
ground_truths.append(list)
contexts = []
answersb = []
for query in questions:
contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])
response = chat.send_message(query)
answersb.append(response.text)
data = {
"question": questions,
"answer": answersb,
"contexts": contexts,
"ground_truths": ground_truths
}
dataset = Dataset.from_dict(data)
Step 7: Evaluate Metrics Using Dataset
Finally, we evaluate various metrics using the prepared dataset. This evaluation helps us understand the strengths and weaknesses of our system, providing insights into areas for improvement.
from IPython.display import display
result = evaluate(
llm =llm,
dataset = dataset,
metrics=[
context_precision,
context_recall,
answer_relevancy,
faithfulness
],
)
df = result.to_pandas()
display(df)
Conclusion
Building an advanced text analysis and retrieval system using Python libraries is a powerful approach to processing and analyzing textual data efficiently. By leveraging libraries like langchain, pymupdf, cohere, pinecone-client, PyPDF2, openai, datasets, and ragas, we can create a robust system capable of extracting insights and retrieving relevant information from documents. This step-by-step guide provides a solid foundation for understanding the process and implementing your own text analysis and retrieval system.