Context and Motivation
Companies increasingly disclose their greenhouse gas emissions through initiatives like CDP. However, Scope 3 emissions, and in particular Category 1: Purchased Goods and Services remain highly uncertain. Reported data are often incomplete, inconsistent, or methodologically ambiguous, with errors ranging from boundary misinterpretations to simple typos. These inconsistencies limit comparability across firms and reduce confidence in corporate carbon data.
Limitations of Current Approaches
To compensate for missing or unreliable disclosures, estimation methods often rely on sector-level emission intensity factors, applying average values across companies. While pragmatic, this approach ignores firm-level differences such as size, capital intensity, or supplier engagement strategies, all of which can strongly influence emissions. The result is a process that is simple but lacks explanatory power and precision.
Methodology Overview
This project proposes a data-driven framework to improve the accuracy, scalability, and interpretability of Scope 3 Category 1 emission estimates. It combines:
1. Machine Learning (ML) techniques to model emissions based on company-level variables.
2. Large Language Models (LLMs) to extract additional insights from unstructured textual data.
Leveraging LLMs for Variable Extraction
Many relevant variables, such as supply chain integration, sourcing strategies, or decarbonization practices, are described in unstructured text (e.g., sustainability reports, supplier documents, ESG disclosures). To capture this missing information, LLMs are employed to extract high-signal qualitative
indicators that can then be integrated into the ML models. This step aims to enrich the feature space with contextual and behavioral dimensions that conventional datasets overlook, enhancing the model’s explanatory depth and predictive precision.
Expected Outcomes and Impact
By integrating structured and unstructured data through ML and LLMs, this project seeks to establish a more reliable and scalable methodology for estimating Scope 3 Category 1 emissions. Beyond improving numerical accuracy, it demonstrates how modern tools can strengthen corporate carbon transparency, foster better comparability between firms, and ultimately support more effective climate action across global supply chains.