Why your AI pricing model will lie to you: a field guide to procurement data cleanup.

Seven dirty data problems kill AI pricing and cost-intelligence projects: currency drift, unit-of-measure inconsistency, vendor aliases, country-specific tax handling, period misalignment, supplier-hierarchy collapse, and missing-not-at-random gaps. Here is the order to fix them in — before you spend a cent on models.

Pricing intelligence built on dirty data is more dangerous than no pricing intelligence. A model trained on procurement records full of currency drift, supplier aliases, and country-specific tax handling will confidently predict the wrong price — and the team that trusted it will price the next bid accordingly.

Seven dirty data problems consistently kill SMB AI pricing projects. Below: the taxonomy, the order to fix them in, and how long each one takes when you've never done it before.

The seven problems, in priority order

1. Currency drift

A line item recorded in EUR in 2022 and another in EUR in 2024 are not directly comparable, even before you normalise to a base currency. Inflation, FX shifts, and supplier-side currency clauses all compound. Counter: convert to a single base currency at the date of the transaction; track the FX rate used; do not back-convert.

2. Unit-of-measure inconsistency

Concrete is m³ here, yards³ there, bagged units somewhere else. Doors are recorded by leaf, by frame, by aperture. Counter: a UOM dictionary. Every line item normalises to a canonical unit. Items the dictionary cannot resolve get flagged for human review — never silently mapped.

3. Vendor aliases

"Acme Co", "Acme Corp", "Acme Limited", "ACME, Ltd." — the same supplier in four spellings produces four supplier records, four price histories, and a model that thinks the supplier base is denser than it is. Counter: a supplier-resolution table built once and maintained continuously.

4. Country-specific tax handling

A line item in one country is gross of tax; the same item in another is net. A model that treats them as the same will conclude the second country is cheaper. Counter: normalise to net-of-tax across the whole dataset, with country-specific rules documented and reviewed annually.

5. Period misalignment

Some prices are quoted, some are contracted, some are paid; each has a different effective date. Counter: pick one timestamp as canonical (we usually use date-of-contract) and use it consistently. Document the choice.

6. Supplier-hierarchy collapse

A supplier conglomerate appears as five separate vendors; consolidation muddies the price history. Counter: maintain a parent-subsidiary map at the supplier level. Decide explicitly whether models train on the parent or the subsidiary view.

7. Missing-not-at-random gaps

Cheap deals get logged; failed deals get ignored. Models learn from biased samples. Counter: audit the data-collection pipeline; document the missing-data mechanism; correct for sample bias when modelling.

Build the cleanup before you build the model. The cleanup will outlive the model.

The order matters

Currency, UOM, and supplier resolution are foundational — fix them first. Period normalisation and tax handling are the next layer. Hierarchy collapse and missing-data corrections come last because they require the foundations to be stable. Most teams want to start with hierarchy collapse because it is the most intellectually interesting problem; that is the wrong order.

Realistic timing for a 50-200 person firm: six months for a first cleanup pass, then continuous quality monitoring forever. Less and the cleanup is incomplete; more and you have built an ETL platform when what you needed was a data-quality discipline.

Or skip ahead and talk through it directly

The seven problems, in priority order

1. Currency drift

2. Unit-of-measure inconsistency

3. Vendor aliases

4. Country-specific tax handling

5. Period misalignment

6. Supplier-hierarchy collapse

7. Missing-not-at-random gaps

The order matters

More from the same beat.

The AI project brief template every operations buyer should have on file.

Prompting, RAG, or fine-tuning: a decision tree for SMBs who can't afford to pick wrong.

Claude Code vs OpenClaw: which coding agent should an SMB engineering team standardise on?

Want a custom brief for your industry?