The Model Obsession
The default assumption across business teams and AI consultancies alike is that a more complex model produces better results. More parameters. Larger context windows. Fancier architecture. This assumption is almost always wrong for real-world business use cases.
The model is rarely the bottleneck. The data is. And yet the industry’s attention — its headlines, its benchmark races, its conference talks — is almost entirely focused on the model side of the equation. Meanwhile, in production, systems fail quietly because no one validated the input.
On a 10,000-row business dataset with 15% duplicate records, a gradient boosted tree trained on clean data will outperform GPT-4 trained on dirty data. Every time.
This is not an argument against sophisticated models. It is an argument for sequencing. You do not optimise the engine before you fix the fuel supply. Data quality is the fuel. The model is the engine.
What Bad Data Looks Like
Bad data rarely announces itself. It hides inside normal-looking tables. It passes initial inspection. It only reveals itself when the model starts producing nonsense — or worse, plausible-looking nonsense. Here is what we find most often:
- Duplicate customer records inflating churn and lifetime value calculations
- Missing values filled with zeros, skewing numeric distributions
- Dates stored as plain text in inconsistent regional formats
- Inconsistent category labels — “SME”, “sme”, “Small-Medium Enterprise” as separate classes
- Outliers from data entry errors — a salary of 99999 meaning “not provided”
None of these are exotic edge cases. They are standard findings in almost every client dataset we have ever opened. The only variable is proportion.
The 80/20 of Data Work
Ask any practitioner who has shipped ML in production what the actual time split looks like. The answer is consistent: 80% of the project is data work. Ingestion, profiling, cleaning, normalisation, validation, feature engineering. The remaining 20% is the modelling, evaluation, and deployment.
Yet most project proposals are written in reverse — heavy on model selection, light on data pipeline. This misalignment between effort and narrative creates chronic timeline underestimation and persistent quality problems at the model layer.
A basic data quality audit at project start pays back its cost in the first week. Here is the minimal version we run on every new dataset:
import pandas as pd # Load dataset df = pd.read_csv("dataset.csv") # 01 -- Null audit print("=== NULL AUDIT ===") print(df.isnull().sum().sort_values(ascending=False)) # 02 -- Dtype check print(" === DTYPE CHECK ===") print(df.dtypes) # 03 -- Duplicate rows print(" === DUPLICATES ===") n_dupes = df.duplicated().sum() print(f"Duplicate rows: {n_dupes} ({n_dupes/len(df)*100:.1f}%)") # 04 -- Cardinality on object columns print(" === CARDINALITY (OBJECT COLS) ===") for col in df.select_dtypes(include="object").columns: print(f"{col}: {df[col].nunique()} unique values")
This takes under two minutes to run. It consistently surfaces problems that would have cost days later in the pipeline. Run it before writing a single model line.
What Clean Data Enables
The benefits of clean data are invisible in the same way that good infrastructure is invisible — you only notice it when it is absent. When data is clean, the rest of the project flows:
- Faster iteration — no debugging time lost to data artefacts
- Interpretable results — patterns reflect reality, not noise
- Models that generalise — trained on signal, not on systematic error
- Cheaper compute — no wasted cycles on junk records
- Trustworthy outputs — stakeholders can act with confidence
“Clean data is not glamorous. It does not make conference presentations. But it is the single variable with the highest ROI in any analytics project.”
The compounding effect is significant. A project that starts with clean data reaches a qualitatively different level of reliability. Edge cases that would have caused silent failures are surfaced and handled. Stakeholder trust is earned from the first demo rather than rebuilt after the first failure.
Practical Checklist
Before any model training begins, the following checklist should be completed and documented. Not as a bureaucratic exercise — as a technical hygiene requirement:
- Schema validation — all columns present, correct types, expected ranges
- Null handling policy — defined per column, not a global fill strategy
- Deduplication strategy — primary key audit, fuzzy match for name fields
- Date standardisation — single timezone, consistent format, parsed not stored as string
- Categorical encoding — agreed vocabulary, unknown class handling documented
- Outlier audit — statistical bounds checked, domain expert sign-off on edge values
This is what Engrammers does before writing a single line of model code. Every engagement. Every dataset size. Without exception. The checklist is not optional when the outputs are going to be used to make business decisions. It is the foundation on which good inference is built.