Seven dirty data problems kill AI pricing and cost-intelligence projects: currency drift, unit-of-measure inconsistency, vendor aliases, country-specific tax handling, period misalignment, supplier-hierarchy collapse, and missing-not-at-random gaps. Here is the order to fix them in — before you spend a cent on models.
Pricing intelligence built on dirty data is more dangerous than no pricing intelligence. A model trained on procurement records full of currency drift, supplier aliases, and country-specific tax handling will confidently predict the wrong price — and the team that trusted it will price the next bid accordingly.
Seven dirty data problems consistently kill SMB AI pricing projects. Below: the taxonomy, the order to fix them in, and how long each one takes when you've never done it before.
The seven problems, in priority order
1. Currency drift
A line item recorded in EUR in 2022 and another in EUR in 2024 are not directly comparable, even before you normalise to a base currency. Inflation, FX shifts, and supplier-side currency clauses all compound. Counter: convert to a single base currency at the date of the transaction; track the FX rate used; do not back-convert.
2. Unit-of-measure inconsistency
Concrete is m³ here, yards³ there, bagged units somewhere else. Doors are recorded by leaf, by frame, by aperture. Counter: a UOM dictionary. Every line item normalises to a canonical unit. Items the dictionary cannot resolve get flagged for human review — never silently mapped.
3. Vendor aliases
"Acme Co", "Acme Corp", "Acme Limited", "ACME, Ltd." — the same supplier in four spellings produces four supplier records, four price histories, and a model that thinks the supplier base is denser than it is. Counter: a supplier-resolution table built once and maintained continuously.
4. Country-specific tax handling
A line item in one country is gross of tax; the same item in another is net. A model that treats them as the same will conclude the second country is cheaper. Counter: normalise to net-of-tax across the whole dataset, with country-specific rules documented and reviewed annually.
5. Period misalignment
Some prices are quoted, some are contracted, some are paid; each has a different effective date. Counter: pick one timestamp as canonical (we usually use date-of-contract) and use it consistently. Document the choice.
6. Supplier-hierarchy collapse
A supplier conglomerate appears as five separate vendors; consolidation muddies the price history. Counter: maintain a parent-subsidiary map at the supplier level. Decide explicitly whether models train on the parent or the subsidiary view.
7. Missing-not-at-random gaps
Cheap deals get logged; failed deals get ignored. Models learn from biased samples. Counter: audit the data-collection pipeline; document the missing-data mechanism; correct for sample bias when modelling.
Build the cleanup before you build the model. The cleanup will outlive the model.
The order matters
Currency, UOM, and supplier resolution are foundational — fix them first. Period normalisation and tax handling are the next layer. Hierarchy collapse and missing-data corrections come last because they require the foundations to be stable. Most teams want to start with hierarchy collapse because it is the most intellectually interesting problem; that is the wrong order.
Realistic timing for a 50-200 person firm: six months for a first cleanup pass, then continuous quality monitoring forever. Less and the cleanup is incomplete; more and you have built an ETL platform when what you needed was a data-quality discipline.