Advanced Text Analysis and Retrieval System Using Python Libraries

אפריל 14, 2024

30 minutes free Consultation

Learn how to automate manual processes

Building an advanced text analysis and retrieval system can be a complex task, but with the right Python libraries, it becomes much more manageable. In this article, we’ll explore how to leverage libraries like langchain, pymupdf, cohere, pinecone-client, PyPDF2, openai, datasets, and ragas to create a powerful system for processing and analyzing textual data.

Step 1: Install and Import Libraries

The first step is to install and import the necessary libraries. This ensures that all the required tools are available for the analysis and retrieval tasks.

# pip install langchain pymupdf cohere pinecone-client PyPDF2 openai datasets ragas
# pip install --upgrade --quiet langchain-google-genai pillow
# pip install python-dotenv

import os
import random
from pinecone import Pinecone, ServerlessSpec
from langchain.vectorstores import Pinecone as PineconeStore
from PyPDF2 import PdfReader
from langchain.embeddings import CohereEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser
from datasets import Dataset
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from dotenv import load_dotenv
from ragas import evaluate
from langchain_openai.chat_models import AzureChatOpenAI
import google.generativeai as genai
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

Step 2: Create Pinecone Index and Load PDF

The next step involves initializing a Pinecone index for efficient vector search and loading a PDF document for analysis. This sets up the data storage and retrieval system.

load_dotenv('keys.env')

PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
COHERE_API_KEY = os.getenv('COHERE_API_KEY')

INDEX_NAME = "quickstart"

pc = Pinecone(api_key=PINECONE_API_KEY)

embeddings = CohereEmbeddings(model = "embed-multilingual-v3.0", cohere_api_key=COHERE_API_KEY)

pc.delete_index(INDEX_NAME)

if INDEX_NAME not in [index.name for index in pc.list_indexes()]:
    pc.create_index(name=INDEX_NAME, dimension=1024,
    spec=ServerlessSpec(cloud='aws', region='us-west-2')
    )
    pdf_file = open('example.pdf', 'rb')
    reader = PdfReader(pdf_file)
    heb_pages = reader.pages

    pages = ""

    for page in reader.pages:
      pages += page.extract_text()

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(pages)
    text_content = [doc for doc in texts]

    docsearch = PineconeStore.from_texts(text_content, embeddings, index_name=INDEX_NAME)
else:
  text_field = "text"
  index = pc.Index(INDEX_NAME)
  docsearch = PineconeStore(
      index, embeddings, text_field
  )

Step 3: Create Large Language Model Using Azure

This step leverages Azure’s capabilities to create a large language model, which will play a key role in processing and understanding the textual content from the documents.

GPT_DEPLOYMENT_NAME="chatgpt_16k"

AZURE_OPENAI_API_KEY = os.getenv('AZURE_OPENAI_API_KEY')
AZURE_OPENAI_ENDPOINT = os.getenv('AZURE_OPENAI_ENDPOINT')

llm = AzureChatOpenAI(
    openai_api_version="2023-05-15",
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
    azure_deployment=GPT_DEPLOYMENT_NAME,
    model='azure',
    validate_base_url=False
)

Step 4: Retrieve Random Documents and Generate Questions

In this step, we retrieve random documents from the index and generate questions whose answers lie within the document’s content. These questions are stored in a list for later evaluation.

retriever = docsearch.as_retriever()
random_documents = random.choices(texts, k=2)

questions = []
documents = []

for doc in random_documents:
  template = "Generate a question in hebrew who's answer is within the following text: {doc}"
  prompt = ChatPromptTemplate.from_template(template)
  rag_chain = (
      {"context": retriever,  "doc": RunnablePassthrough()}
      | prompt
      | llm
      | StrOutputParser()
  )
  questions.append(rag_chain.invoke(doc))
  documents.append(doc)

Step 5: Create Chat Prompt Using Gemini

Using the document as context, we send prompts requesting answers to each question through Gemini. The responses are saved in a list, showcasing the interaction with the model and its understanding of the context.

model = "gemini-pro"
genai.configure(api_key=GOOGLE_API_KEY)
generation_config = {
    "temperature": 1,
    "top_p": 1,
    "top_k": 1,
    "max_output_tokens": 1024,
}
safety_settings = [
    {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_ONLY_HIGH"},
    {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_ONLY_HIGH"},
    {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_ONLY_HIGH"},
    {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_ONLY_HIGH"},
]

gemini = genai.GenerativeModel(
    model_name=model,
    safety_settings=safety_settings,
    generation_config=generation_config,
)

answers = []
chat=gemini.start_chat()

for i in range(len(documents)):
  question = questions[i]
  document = documents[i]

  template = f"""You are an assistant for question-answering tasks.
  Use the following pieces of retrieved context and question below to answer the question in hebrew.
  Use two sentences maximum and keep the answer concise.
  Question: {question}
  Context: {document}
  Answer:
  """

  response = chat.send_message(template)
  answers.append(response.text)

Step 6: Prepare Dataset for Metrics Evaluation

Here, we prepare a dataset that will be used to evaluate the effectiveness of our retrieval and question-answering system. This sets the stage for assessing performance and accuracy.

ground_truths = []

for answer in answers:
    list = []
    list.append(answer)
    ground_truths.append(list)

contexts = []
answersb = []

for query in questions:
    contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])
    response = chat.send_message(query)
    answersb.append(response.text)

data = {
    "question": questions,
    "answer": answersb,
    "contexts": contexts,
    "ground_truths": ground_truths
}

dataset = Dataset.from_dict(data)

Step 7: Evaluate Metrics Using Dataset

Finally, we evaluate various metrics using the prepared dataset. This evaluation helps us understand the strengths and weaknesses of our system, providing insights into areas for improvement.

from IPython.display import display

result = evaluate(
    llm =llm,
    dataset = dataset,
    metrics=[
        context_precision,
        context_recall,
        answer_relevancy,
        faithfulness
    ],
)

df = result.to_pandas()
display(df)

Conclusion

Building an advanced text analysis and retrieval system using Python libraries is a powerful approach to processing and analyzing textual data efficiently. By leveraging libraries like langchain, pymupdf, cohere, pinecone-client, PyPDF2, openai, datasets, and ragas, we can create a robust system capable of extracting insights and retrieving relevant information from documents. This step-by-step guide provides a solid foundation for understanding the process and implementing your own text analysis and retrieval system.

Accelerate Your Career with Our Data and AI Course - Enroll Today

Transform your career with our immersive data and AI course. Acquire practical skills, learn from industry leaders, and open doors to new opportunities in this dynamic field. Secure your spot now and embark on a journey towards success

Click Here

Surprising Benefits of Working Online for People with Disabilities

אפריל 17, 2024

7 Incredible Ways Tech Education is Transforming Lives of People with Disabilities

אפריל 17, 2024

Revolutionize Your SQL Server Data with Python

אפריל 17, 2024

30 minutes free Consultation

Ready to revolutionize your career? Schedule a consultation meeting today and discover how our immersive data and AI course can equip you with the skills, knowledge, and industry insights you need to succeed.

Click Here

Advanced Text Analysis and Retrieval System Using Python Libraries

30 minutes free Consultation

Advanced Text Analysis and Retrieval System Using Python Libraries

Step 1: Install and Import Libraries

Step 2: Create Pinecone Index and Load PDF

Step 3: Create Large Language Model Using Azure

Step 4: Retrieve Random Documents and Generate Questions

Step 5: Create Chat Prompt Using Gemini

Step 6: Prepare Dataset for Metrics Evaluation

Step 7: Evaluate Metrics Using Dataset

Conclusion

Accelerate Your Career with Our Data and AI Course - Enroll Today

More From My Blog

Surprising Benefits of Working Online for People with Disabilities

7 Incredible Ways Tech Education is Transforming Lives of People with Disabilities

Revolutionize Your SQL Server Data with Python

30 minutes free Consultation