Information retrieval system for ESG data extraction
Introduction:
ESG (Environmental, Social, and Governance) criteria are a framework used to assess an organization’s performance on various sustainability and ethical issues in the respective areas of environmental, social, and governance. Some important topics include carbon footprint of organization’s bussiness, employee ethical treatment, maintaining fair relationship with stakeholders and partners.
One of the most challenging aspects of ESG assessment is how to collect the data in an integral and exclusive manner since we have a ton of available sources for ESG topics, and they are often unstructured, contain massive amount of information.
In this article, we will explain how to leverage AI to extract useful information from ESG data sources for the construction of ESG databases.
The AI system can be viewed as a tool to help individuals derive value from questions related to the ESG topic using contextual datasets. There are two main parts of this program:
- Constructing an adaptive database to store ESG data.
- Constructing a Question-Answering model that can effectively comprehend the content of the data and the questions posed, enabling the extraction of accurate answers.
The historical dataset and new coming data must be transformed and stocked in the database at step 1. Then, for each question, the Q/A model interacts with all samples in the database to find out the most relevant information to answer the question. For detailed information about of the model pipeline, we encourage you to visit the article “Q/A information retrieval system implementation with the GPT-3.5”.
Vector database:
First, we need to establish an indexed data storage system. Each indexed data storage technique provides a proper way to organize and retrieve the documents. Well-known indexed data storage techniques include ElasticSearch, Solr, MongoDB.
In the latest AI revolution, such as large language models, generative AI, the need for efficient data processing has become more crucial than ever. Recently, a new indexed database technique called the vector database has emerged. Vector databases excel in adapting to the requirements of AI deep-learning models. This is because the data is pre-transformed into vectors, often referred to as embedding vectors. Such a representation seamlessly aligns with the input needs of deep-learning models and facilitates predictions for semantic similarity searches.
Popular techniques for vector databases include Pincone, Chrome, and Wiavate. In our demonstration, we used the Pinecone solution. Pinecone enables us to effortlessly establish a database and conduct searches within it.
Question-Answering for ESG Assessment:
The process of extracting values for ESG assessment can be likened to a Question-Answering or Document Chat application. For example, when seeking information about ethical work conditions, such as gender parity among employees or average salaries based on gender and education level, we can frame these criteria as questions for the Q/A model to extract answers:
Q: What is the gender parity among employees at company XX, with reference to context YY?
A: The gender parity at the company is ZZ%.
In this process, the model encodes the question to understand the inquiry and subsequently retrieves the relevant paragraph or text from the document, which pertains to the context of the question. This stage is known as data retrieval or context retrieval. There are two types of Q/A models:
- Retriever — Reader model: after retrieving the context, this model reads the context and extracts a portion of it. The model then returns this context as the answer to the question. Various techniques, such as Named Entity Recognition (NER) and prediction models for identifying the start and end positions of the answer in the text, are employed. This technique is referred to as the “Reader” model because it reads the context and provides an answer that is a part of the context.
- Retriever -Generative model: following the retrieval of context, this model uses the context as an optional, informative source to generate the answer in a free-form manner, using its own choice of words. Well-known techniques for generative models include transformer-based models or GPT (Generative Pre-trained Transformer) models. The answer is generated by the model based on the input context and is not necessarily an excerpt from the context itself.
In our approach, we employ a generative model. Specifically, we utilize a GPT-3.5-based model for the Q&A information retrieval process.
The result of the test:
To put our pipeline to the test, we employed a dataset comprising approximately 10 ESG reports, specifically Universal Registration Documents. We selected organizations at random and formulated a list of 50 questions covering various ESG topics. Our evaluation encompassed three levels:
- +++ The model returned a correct answer.
- ++ The model returned a partially correct answer.
- + The model returned an incorrect answer.
Additionally, we assessed the complexity of each task on three levels:
- + Easy task: The answer could be obtained with a straightforward prompt text question.
- ++ Intermediate task: The answer required providing some examples within the prompt text question.
- +++ Difficult task: The answer was either not found or only partially discovered, even with extensive prompt text rewriting.
Our findings highlighted that the performance of the Q/A Engine is strongly influenced by the complexity of the question and the context. The model exhibited a high level of accuracy when dealing with straightforward questions and simple contexts, such as raw text. However, for more intricate questions and complex contexts, such as those involving calculations within text tables or images, the model’s performance diminished, and the effort required for effective prompting increased.
Perspectives on improving model performance:
To enhance the performance of our Q/A model, several strategies and ideas warrant consideration:
- Re-training the LLM Model with Specific Domain Examples:
This approach involves retraining the Language Model (LLM) with a curated set of domain-specific examples. By exposing the model to a broader range of domain knowledge, it can provide more accurate answers to questions that pertain to specific areas of expertise within the ESG domain. This method aims to make the Q/A model more familiar with nuanced domain-specific terminology and contexts. - Re-writing Question Technique:
The technique of re-writing questions is a valuable tool to improve the Q/A model’s understanding of questions, particularly complex ones. It involves reformulating questions to make them more explicit and easier for the model to comprehend. This can be especially useful for questions that reference past discussions or involve proper names, which might require additional context for accurate answers.
These strategies aim to address the challenges associated with complex questions and diverse data types, ultimately leading to enhanced model performance and more accurate information retrieval in the context of ESG assessment.
Conclusion
In this post, we have delved into key aspects of Question-Answering (Q/A) information retrieval techniques. We explored how to leverage innovative methods like the vector database, such as Pinecone, and harnessed the power of cutting-edge AI developments like GPT-3.5 models to extract answers to questions.
For our specific needs, we employed these techniques to extract valuable insights related to ESG topics. This approach proved invaluable in the creation and maintenance of a robust and automated ESG database.
Aknowledgements
Thanks to our colleagues Laura RIANO BERMUDEZ, Lu WANG and Jean-Baptiste BARDIN for the article review.
About the Author
Alexandre DO is a senior Data Scientist at La Javaness and has conducted many R&D projects on NLP applications and interested in ESG topics.