Industrial emissions
A new gold-standard dataset from LMU’s Social Data Science & AI Lab (SODA Lab) and collaborators offers exactly that—and exposes just how threadbare many sustainability reports still are.
Across the EU, large companies are legally obliged to report their GHG emissions—and procurement teams, consultancies, regulators and banks increasingly pipe those figures into risk models, due-diligence workflows and climate disclosures.
But the source material is usually a sprawling, unstructured PDF, uploaded to a corporate website rather than a registry. To cope at scale, teams are turning to Large Language Models (LLMs) to “read” reports and extract tables of Scope 1, Scope 2 and (sometimes) Scope 3.
That speed comes with a catch. As project coordinator Dr Malte Schierholz puts it, “With automatic extraction methods, it’s easy to fully trust the LLM’s output and overlook measurement errors that occur frequently.”
In other words: if you don’t have a trustworthy ground truth, you can end up automating your way into precisely-worded nonsense.
Enter the Greenhouse Gas Insights and Sustainability Tracking (GIST) group’s new benchmark, published in Scientific Data.
It’s a carefully curated “gold standard” for extracting company-level emissions from sustainability reports sampled across the MSCI World Small Cap index and Germany’s DAX.
On paper, the task sounded simple: pull total company emissions, by scope, into a tidy table. In practice, it was anything but. The LMU–Deutsche Bundesbank team built a multi-stage pipeline.
LLM-assisted extraction (RAG over report pages) to propose values.
Dual non-expert annotation to accept, correct or flag those values against strict rules (whole-company totals, scope-consistent, absolute CO₂e, not sub-components).
Paired expert review wherever the non-experts disagreed or were unsure.
In-person expert adjudication to settle the hard cases.
The result is a dataset with explicit assumptions, transparent decision trails and—crucially—documented uncertainty. It’s designed to validate both human workflows and automated systems, enabling fair comparisons between methods rather than leaderboard theatre.
For environmental data users, a few findings should focus the mind.
About half the reports contained no usable GHG data at all under the project’s quality criteria. That’s not a quibble over formatting; it’s an indictment of the current reporting landscape.
Where data exists, it’s heavily skewed to Scope 1 and Scope 2 (location-based). Scope 2 (market-based) is far less consistently reported, and Scope 3 is rarely complete—exactly where financed emissions, supply chains and product use dominate the footprint.
Ambiguity is routine, not exceptional: many entries required expert discussion to classify correctly due to inconsistent protocols, missing context or departures from Greenhouse Gas Protocol guidance.
Even sophisticated LLM pipelines hallucinate structure or misread graphics without strong guardrails; the error taxonomy compiled by annotators will be gold dust for anyone tuning prompts or building validators.
Sustainable finance researcher Dr Andreas Dimmelmeier’s verdict is blunt: the knottiest problems “stem not only from complex and partly inconsistent reporting protocols, but also from missing context and incomplete disclosures in company reports.”
This is the core provocation for our sector. We are quick to say “AI will scale what we already do,” but if what we already do is extract inconsistent numbers from ambiguous PDFs, then AI will scale inconsistency and ambiguity—faster.
The LMU benchmark is valuable precisely because it separates two problems that are often conflated:
Information extraction risk (can a system correctly lift what’s on the page?).
Accounting/standards risk (is what’s on the page the right thing, at the right boundary, in the right units, for the right scope?).
Without a gold standard, you can’t tell which failure you’re looking at. With one, you can tune your pipeline to the former while lobbying, specifying or auditing for the latter.
Here’s a pragmatic playbook for instrumentation users, data teams and compliance leads integrating automated GHG extraction into their monitoring stacks:
Treat PDFs as hostile data sources. Assume tables are images, captions are misleading, footnotes change the meaning and units vary by page. Your pipeline should extract text, numbers and provenance (page, figure/table ID, display type).
LLM/RAG propose candidates.
Two human reviewers validate against a clear rulebook (whole-entity, scope-consistent, absolute CO₂e).
Expert escalation on disagreement or flagged ambiguity.
Keep an audit trail: candidate value, correction, reason code, annotator IDs.
Normalise units and scopes deterministically. Maintain a controlled vocabulary and a normalisation dictionary (e.g., tonnes vs tCO₂e, kilo- vs mega- prefixes). Default Scope 2 to location-based unless market-based is explicitly stated.
Score uncertainty per record. Don’t output a single number; output the value plus an uncertainty label (e.g., “gold” = dual-review agreement; “silver” = single-review with heuristics; “bronze” = LLM-only). Downstream models should weight accordingly.
Exploit the error taxonomy. If an LLM repeatedly fails on “graphic misread” or “sub-category not total,” fix the prompt, chunking, or add structure-aware extractors (table parsers, OCR on vector graphics). Don’t accept recurring classes of error as “noise.”
Push suppliers for machine-readable reports. In RFPs and supplier codes, require CSRD-aligned, GHG-Protocol-consistent, machine-readable emissions with page-level citations. A simple CSV/JSON appendix with scope, year, unit and calculation method would eliminate half the pain.
Design for Scope 3 scarcity. Where Scope 3 is missing, define a policy: model with transparent assumptions, use sectoral intensity proxies, or treat as unknown and escalate. Don’t silently backfill.
Make validation a team sport. Your sustainability, finance and legal teams should agree the annotation rules up front. Publish them internally. If you change the rules, version them and re-run your pipeline.
This benchmark also strengthens the case for reporting infrastructure, not just reporting obligations. Three policy-facing asks our readers can champion:
Central registries with schemas. If CSRD is to work, filings should be lodged to a registry in a common, machine-readable schema (XBRL-like for GHG), with public page-linked PDFs as evidence, not the other way round.
Scope 2 alignment by default. Make dual reporting (location- and market-based) routine and clearly tagged; penalise ambiguous labelling.
Scope 3 phase-in with guardrails. Prioritise high-material categories per sector, mandate method notes and require uncertainty disclosures. “Not estimated” should be an allowed value—with consequences.
For those building AI-assisted monitoring workflows, the LMU/GIST dataset is a welcome north star. It won’t make corporate reports less messy overnight.
But it does something arguably more important: it makes our own systems accountable. By externalising rules, documenting disagreements and publishing adjudications, it turns “AI did it” into “here’s exactly how we decided.”
That’s the cultural shift environmental monitoring needs as AI moves deeper into compliance and risk: less mystique, more method. Automate by all means; just make sure you can explain yourself when it matters.
IET 36.2 Mar/Apr 2026