← All engineering posts

SEC EDGAR and XBRL: Financial Document AI at Scale

By Zakaria El Houda

SEC EDGAR and XBRL: Financial Document AI at Scale

XBRL was supposed to solve financial data comparability. The SEC mandated it for financial statements starting in 2005, with Inline XBRL (iXBRL) required for all filers by 2018. Every number in every financial statement now carries a machine-readable tag: concept name, reporting period, units, dimensional qualifiers. Structured, standardized, comparable.

Except it isn't. An analysis of 32,240 SEC filings found that for any three distinct companies' XBRL financial statements, only 11.92% of their data is directly comparable1. Even restricting to standard taxonomy elements, interoperability reaches just 17.35%. The problem isn't the format. The problem is what companies do with it.

We built a fund intelligence platform that needed comparable financial metrics across thousands of fund filings. XBRL covered about 67% of the structured data points, but the raw tags were not comparable without a mapping layer. Every section of a filing, Statement of Operations, Financial Highlights, notes, fee tables, required its own extraction and mapping strategy. This post walks through one of them in detail: how we used NLP to map taxonomy variation on the Statement of Operations. It also covers how XBRL serves as a validation layer for document AI extraction, and the pipeline that handles everything XBRL doesn't reach.

Why 12% Comparability After Two Decades of XBRL

Three structural problems keep XBRL from delivering on its comparability promise.

Extension proliferation. When the standard US GAAP taxonomy doesn't have a concept for a specific line item, companies create custom extensions. Company A reports cloud revenue using the standard us-gaap:RevenueFromContractWithCustomerExcludingAssessedTax. Company B creates companyb:CloudServicesRevenue. Both report the same economic reality, but a machine sees two unrelated concepts. Across EDGAR, there are hundreds of thousands of unique custom extensions, many duplicating standard concepts with different names.

Label variation on standard tags. Even when companies use standard taxonomy elements, the same economic concept gets tagged with different elements. The FASB taxonomy contains thousands of elements2, and accounting teams pick whichever one feels closest. Revenue alone maps to at least four common tags: RevenueFromContractWithCustomerExcludingAssessedTax, Revenues, RevenueFromContractWithCustomerIncludingAssessedTax, and SalesRevenueNet. In the fund universe, management fees on the Statement of Operations show the same pattern: four fund families, four different XBRL tags for the same line item.

Granularity compounds the problem. Company A reports OperatingIncomeLoss as a single line item. Company B reports GrossProfit, SellingGeneralAndAdministrativeExpense, and ResearchAndDevelopmentExpense separately but never tags an operating income total. Comparing operating income requires reconstructing it from Company B's components, and those components may include or exclude items differently than Company A defines them.

Bag-of-Words Taxonomy Mapping: Statement of Operations

Each section of a fund filing presents different extraction challenges. Financial Highlights have multi-period tables with per-share calculations. Notes contain nested disclosures mixing narrative and numbers. Fee tables follow SEC-mandated structures but vary in how fund families break down expense categories. Each requires its own mapping and extraction strategy.

To show what solving one of these looks like concretely, here's a walkthrough of the Statement of Operations. Its vocabulary is finite: management fees, distribution fees, interest income, realized gains, unrealized appreciation. The line items repeat across fund families, but every family tags and structures them differently. A pre-LLM NLP approach works here: tokenize the XBRL element names and match them against a canonical concept dictionary.

FINANCIAL_STOP_WORDS = {
    "total", "net", "gross", "other", "additional", "certain",
    "related", "and", "the", "of", "for", "from", "including",
    "excluding", "current", "noncurrent", "accumulated",
}

def tokenize_xbrl_element(element_name: str) -> set[str]:
    """Split CamelCase XBRL element into lowercase tokens."""
    tokens = re.sub(r"([a-z])([A-Z])", r"\1 \2", element_name).lower().split()
    return {t for t in tokens if t not in FINANCIAL_STOP_WORDS}

def map_to_canonical(element_name: str, label: str, concepts: dict) -> tuple:
    """
    Score element against canonical concept dictionary using
    Jaccard similarity on token sets. Returns (concept, score).
    """
    element_tokens = tokenize_xbrl_element(element_name)
    label_tokens = {t.lower() for t in label.split()} - FINANCIAL_STOP_WORDS

    # union of element name tokens and human label tokens
    combined = element_tokens | label_tokens

    best_concept, best_score = None, 0.0
    for concept_name, concept_tokens in concepts.items():
        jaccard = len(combined & concept_tokens) / len(combined | concept_tokens)
        if jaccard > best_score:
            best_concept, best_score = concept_name, jaccard

    return best_concept, best_score

The canonical concept dictionary maps each target concept to its expected token set. "Management Fees" maps to {"management", "advisory", "fees", "investment", "adviser"}. Here's what that resolution looks like across fund families on a single line item from the Statement of Operations:

XBRL ElementHuman LabelTokens (after stop words)Canonical ConceptJaccard
InvestmentAdvisoryFeesPaid"Investment advisory fees"advisory, fees, investment, paidManagement Fees0.57
ManagementFeeExpense"Management fee"management, fee, expenseManagement Fees0.40
AdvisoryFeePaid"Advisory fee paid to adviser"advisory, fee, paid, adviserManagement Fees0.57
fundname:AdviserCompensation"Compensation to adviser"adviser, compensationManagement Fees0.22
DistributionFees12b1"12b-1 distribution fees"distribution, fees, 12b1Distribution (12b-1) Fees0.60
fundname:ServiceFeesPaid"Shareholder service fees"shareholder, service, fees, paidShareholder Servicing Fees0.40

The first four rows show four fund families tagging the same economic concept with four different XBRL elements. The mapper resolves all four to "Management Fees." The last two rows show how the same technique separates distribution fees and servicing fees from management fees, even when the token sets partially overlap.

The low Jaccard score on fundname:AdviserCompensation (0.22) is where the mapper needs help. When multiple concepts score below a confidence threshold, the XBRL structure itself disambiguates. A tag appearing under "Operating Expenses" in the presentation linkbase is an expense concept regardless of its name. The calculation linkbase provides further signal: if a tag rolls up under Total Expenses with a weight of 1.0, it's an expense. These structural cues resolve ambiguities that text similarity alone cannot.

The fifth step is temporal consistency. If the same fund family used InvestmentAdvisoryFeesPaid in their last four filings and it mapped to "Management Fees" each time, the mapping carries forward without re-scoring. Fund families are remarkably consistent in their tag choices. Once a mapping resolves correctly for one filing, it holds across that family's entire history.

This approach mapped line items across fund families with near-perfect accuracy on the Statement of Operations. The consistency surprised us. Despite the XBRL standardization problems across the broader EDGAR corpus, within the fund universe, the same economic concepts recur with predictable tag variations. Management fees, distribution fees, interest income, realized gains: the vocabulary is finite, and the bag-of-words mapper covered it.

The edgartools library later systematized a similar approach across the full EDGAR corpus. By analyzing 32,240 filings, the project mapped 2,770 distinct XBRL tags to 234 standardized financial concepts1. Each mapping carries confidence scores, company coverage counts, and temporal consistency metrics. 42 tags have industry-dependent meanings, resolved through Fama-French 48-industry classification with 769 industry-specific overrides.

XBRL tag standardization pipeline: 2,770 tags mapped to 234 concepts through confidence scoring and Fama-French industry disambiguation

An LLM could replace the bag-of-words layer entirely. Given the element name, its label, the presentation linkbase context, and a few example values, a model resolves even opaque custom extensions like fundname:FeeWaiverReimbursement reliably. But the bag-of-words mapper processes thousands of tags per second at zero marginal cost, runs deterministically, and produces the same mapping every time. For high-volume pipelines during filing season, that matters. We use LLMs for the long tail of genuinely ambiguous extensions. The bag-of-words layer handles the bulk.

XBRL as Validation Layer

The tag-mapping problem is worth solving because XBRL, once mapped to canonical concepts, becomes the strongest validation source for document AI extraction on financial filings.

Post 3 described database cross-validation: checking LLM-extracted values against source-of-truth databases like drug formularies and CUSIP registries. XBRL is the equivalent for financial filings. If an LLM extracts "Management Fees: 0.75%" from a fund prospectus table, and the XBRL tag InvestmentAdvisoryFeesPaid for the same fund and period reports 0.75%, you have cross-validation. If the LLM returns 0.57%, the XBRL value flags it.

This works because XBRL and document AI extract from different representations of the same data. The XBRL value comes from a tagged element in the filing's structured layer. The document AI value comes from visual or textual extraction of the rendered document. They are independent measurements of the same fact.

XBRL filings are not error-free. The SEC's Data Quality Committee publishes 177 validation rules for detecting common XBRL errors, and roughly 12% of filings contain at least one negative-value error or invalid axis combination3. Cross-validation catches errors in both directions: LLM hallucinations flagged by correct XBRL tags, and XBRL filing errors flagged by correct LLM extraction.

On the fund intelligence platform, the extraction pipeline runs in three tiers:

TierData SourceExtraction Method% of Data Points
XBRL + mappingFinancial statementsBag-of-words mapping + edgartools~55%
TablesSupplementary schedules, exhibitsTable detection + layout extraction~25%
NarrativeMD&A, risk factors, compensationLLM extraction + XBRL cross-validation~20%

XBRL serves double duty: extraction source for tier 1 and validation layer for tiers 2 and 3. When the LLM extracts a revenue figure from the MD&A narrative, the pipeline checks it against the XBRL-tagged revenue for the same period. A mismatch goes to review. An agreement raises confidence. For EDGAR filings specifically, you skip the OCR layer entirely (the documents are HTML, not scanned images) and send text directly to the model. This eliminates coordinate cross-validation from Post 3. You still cross-validate against XBRL wherever both sources cover the same data point.

Frequently Asked Questions

How do you make XBRL financial data comparable across companies?

Raw XBRL tags are not standardized enough for direct comparison. You need a mapping layer that resolves different tags to canonical concepts. A bag-of-words approach (tokenize element names, score against a concept dictionary) handles the bulk automatically with high accuracy. LLMs resolve the long tail of ambiguous custom extensions. edgartools provides a pre-built mapping of 2,770 tags to 234 concepts.

What percentage of a 10-K filing does XBRL actually cover?

XBRL tags the financial statements: balance sheet, income statement, cash flow statement, and notes. By volume, that's less than 10% of the document. A typical 10-K is 7.6% actual text and 55% HTML markup4. By structured data points, XBRL covers roughly 55-67% of what a financial analysis pipeline needs. The remaining data sits in narrative sections, supplementary schedules, and exhibits that require document AI extraction.

Can XBRL replace document AI for financial data extraction?

No. XBRL covers financial statements only. Management Discussion & Analysis, risk factors, legal proceedings, executive compensation, and supplementary schedules contain critical data points that XBRL does not tag. XBRL's strongest role is as a validation layer: cross-checking document AI extractions against the tagged values to catch errors in both the AI output and the XBRL filing itself.

Why do companies create custom XBRL extension tags?

The standard US GAAP taxonomy cannot cover every line item for every industry. When a company has a disclosure concept not represented in the standard taxonomy, it creates a custom extension. The problem is that many extensions duplicate standard concepts with different names, and custom extensions are not comparable across companies. This drives the need for NLP-based taxonomy mapping.

How accurate are XBRL filings themselves?

Roughly 12% of SEC filings contain at least one XBRL error: negative values where positives are expected, invalid dimensional combinations, or incorrect period references. The SEC's Data Quality Committee publishes 177 validation rules for detecting these errors. Using XBRL as a cross-validation layer catches errors in both directions: AI extraction errors flagged by correct XBRL, and XBRL filing errors flagged by correct AI extraction.

References

Footnotes

  1. EdgarTools: XBRL Standardized Financials 2

  2. FASB US GAAP Financial Reporting Taxonomy

  3. XBRL US Data Quality Committee Validation Rules

  4. Notre Dame SRAF: Stage One 10-X Parsing Documentation

SEC EDGARXBRLFinancial Data ExtractionDocument AIPython