Q/A information retrieval, also known as Question-Answering systems (QA systems), is a type of technology that focuses on understanding the content of a user’s question and retrieve relevant information or knowledge from a textual database to generate concise and accurate responses. Some use cases where Q/A information retrieval can be employed are:
- Customer Support : many businesses use Q/A information retrieval to power their customer support chatbots. Customers can ask questions, report issues, or seek assistance, and the chatbot responds with relevant answers or guides them to the appropriate resources.
- Finance and Investment : financial analysts and investors can use Q/A information retrieval to quickly access financial reports, market data, and investment insights. This enables them to make informed decisions and respond to market changes more effectively.
- IT Technical Support : IT professionals can employ Q/A information retrieval to troubleshoot technical issues and provide solutions to common problems. This reduces downtime and improves overall IT support efficiency.
For our use case — Q/A system applied for ESG data extraction, we implemented this technique, in order to extract useful information for ESG (Environmental, Social, Governance) assessment from the database of ESG reportings. For more comprehensive information about our Q&A information retrieval application, please visit the article “Information retrieval system for ESG data extraction”.
In this article, we will provide a detailed explanation of the implementation process for Q&A information retrieval within the context of this use case.
Setting Up a Pinecone Vector Database:
To create a Pinecone vector database for the first time, you will need to supply both an API key value and an environment value to the Pinecone server. You can obtain these necessary credentials by visiting the Pinecone sign-in page.
Once the connection is successfully established, you can proceed to create a new Pinecone index with a name of your choice. For example, you can name it ‘project_esg.’
Managing the Index:
To examine the properties of the index, you can execute the following command:
To add new data, you can utilize the index.insert( ) method. For each document, it must first undergo transformation into both a dense vector and a sparse vector before the data becomes eligible for insertion into the index:
After the data insertion process is completed, you have the option to recheck the index for verification:
For code refinement, we propose the creation of two distinct classes: one for the dense vector embedding model and another for the sparse vector embedding model.
Additionally, we suggest the development of a “Retriever” class to define essential functions, including ‘reset_index_namespace,’ ‘delete_index,’ ‘upsert_batch,’ and ‘retrieve_with_query.’
Q&A Information Retrieval Model:
In our Q&A information retrieval system, we employ the GPT-3.5 model. To facilitate this process, we have defined a class called “ChatDoc,” which offers a range of useful functions. One such function is “prompt_with_context(),” which enables the submission of a prompt text question to the API and subsequently retrieves the corresponding answer:
Pipeline for Q&A Information Retrieval:
Once all the necessary classes are defined, we can assemble them into a comprehensive pipeline:
Creating an Instance for the Retriever and uploading data into its Index:
The input data is prepared in the adapted format (list of dictionary containing the champs: content, metadata)for the method upsert_batch
Creating a Q&A Model for Information Retrieval:
The Q/A model provides the answer for the given question along with the provided context:
Here is one example of a query containning the questions for seeking ESG information and their answers provided by Q/A system.
In this article, we have explained how to utilize two valuable Python libraries, Pinecone and OpenAI, to construct a Q/A information retrieval pipeline. With straightforward code, we can rapidly establish a benchmark for the Q/A system. These modules are developed in a flexible manner, i.e., the choice of the model for the ChatDoc model can be easily modified.
Several critical choices must be made when configuring the pipeline, including parameters like EMBEDDING_MODEL, SPARSE_MODEL_FILE_NAME, and OPENAI_MODEL. In our specific use case, these values were carefully selected by optimizing the results of the answers retrieved from our testing dataset. Additionally, we chose the parameter alpha = 0.5 for Pinecone’s index hybrid search algorithm, giving equal importance to dense and sparse vectors. For new use cases, we recommend conducting tests with different models for embedding dense and sparse vectors.
Regarding text prompt techniques, we employed both the one-shot prompt and few-shot prompt techniques to formulate questions. It is worthwhile to explore other techniques, such as the chain-of-through prompt or iterative prompt, when dealing with complex information retrieval challenges.
Thanks to our colleagues Laura RIANO BERMUDEZ, Lu WANG and Jean-Baptiste BARDIN for the article review.
About the Author
Alexandre DO is a senior Data Scientist at La Javaness and has conducted many R&D projects on NLP applications and interested in ESG topics.