Stop Scattering Metadata. Start Shipping Contracts
Pipelines don’t break because data changed. They break because change was undefined.
For a long time, our pipeline framework relied on a “metadata system” made of many JSON files.
It wasn’t one clear definition of a dataset. It was metadata duplicated across folders and files, sometimes in the repo, sometimes mirrored in S3, spread through a structure that only made sense if you already knew the framework.
And it worked… until we needed to evolve it.
The pain: metadata duplication becomes drift
The biggest problem wasn’t that JSON is bad. The problem was duplication.
The same dataset definition would show up in multiple places:
- schema here,
- ingestion config there,
- naming rules somewhere else,
- extra options in another folder.
As the platform grew, it became easy to update one file and forget the other. And that’s the worst kind of bug: not a crash, silent inconsistency.
Sometimes the pipeline would still run, but now two different “sources of truth” disagreed. Debugging that felt like archaeology.
Where it lived: one repo, one S3… too many places
Technically, it was “centralized”: one repo, one S3 bucket.
Practically, it was distributed: many folders, many JSON files, many conventions.
So onboarding was harder than it should be. New people didn’t struggle with Spark, they struggled with where the metadata for this dataset is supposed to be.
The shift: replace config sprawl with a data contract
That’s when I started advocating for data contracts, specifically based on the Open Data Contract Standard (ODCS).
To be fair, I wasn’t the one who introduced the idea first. Another senior engineer on my team brought data contracts to us, and it immediately clicked for me. But after he left the company, the topic lost momentum, and I realized that if nobody kept pushing, we’d drift back to the old metadata sprawl. So I decided to pick it up and advocate for it myself.
The goal wasn’t to add governance. It was to remove ambiguity.
Instead of maintaining metadata in multiple files, we moved to:
- one contract per dataset
- versioned in Git
- reviewed like code
- easy to locate
- easy to explain
And because ODCS gives a standard structure, the contract stops being “our internal format” and becomes something the ecosystem understands.
ODCS + real life: standard fields and platform-specific metadata
In practice, we used ODCS as the foundation, but we also added a few platform-specific fields that matter for execution, things like S3 paths and ingestion hints.
That’s important: a contract should be standard and useful.
If it can’t drive automation, it becomes documentation that drifts again.
A concrete example: sales.customer contract (YAML)
Below is a simplified contract for a dataset named sales with schema customer. It follows an ODCS-style structure, plus a small custom section where we keep platform-specific metadata we used to spread across multiple JSON configs.
# sales.customer — Data Contract (ODCS-style, simplified)
kind: DataContract
apiVersion: v3.1.0
# What's this data contract about?
domain: sales
dataProduct: customer
version: 1.0.0
status: active
id: 2c6e3e44-8d7a-4e4d-9f7a-6c1d8f0b2a11
authoritativeDefinitions:
- type: canonical
url: https://github.com/my-org/data-contracts/blob/main/sales/customer.yaml
description: Canonical URL to the latest version of this contract.
description:
purpose: Curated customer dataset for the Sales domain.
limitations: Represents the sales view of a customer (not a full CRM/MDM golden record).
usage: Used for sales dashboards, segmentation, and forecasting.
team:
name: sales-data-platform
description: Team owning this dataset and contract.
members:
- username: kenji
role: owner
# Infrastructure & servers (where the dataset is served from)
servers:
- server: lakehouse-prod
type: s3
description: "Bronze/Silver storage on S3"
# Platform-specific metadata (kept in one place instead of scattered JSON configs)
customProperties:
- property: s3_path_bronze
value: s3://company-datalake/bronze/sales/customer/
- property: s3_path_silver
value: s3://company-datalake/silver/sales/customer/
- property: primary_key
value: [customer_id]
- property: merge_strategy
value: scd1
- property: ordering_column
value: lsn # or event_time if you don't have LSN/SCN
# Dataset schema + quality expectations
schema:
- name: customer
physicalName: customer
physicalType: table
businessName: Customer (Sales)
description: Current state of customers in the Sales domain.
tags: ["sales", "customer"]
properties:
- name: customer_id
businessName: Customer Identifier
logicalType: string
physicalType: string
required: true
primaryKey: true
description: Unique customer identifier.
quality:
- metric: nullValues
mustBe: 0
dimension: completeness
severity: error
- metric: uniqueness
mustBe: 1
dimension: uniqueness
severity: error
- name: full_name
logicalType: string
physicalType: string
required: true
description: Customer full name.
- name: email
logicalType: string
physicalType: string
required: false
description: Customer email address (when available).
- name: country
logicalType: string
physicalType: string
required: true
description: Customer country code.
- name: status
logicalType: string
physicalType: string
required: true
description: Business status of the customer.
quality:
- metric: acceptedValues
values: ["ACTIVE", "INACTIVE"]
dimension: validity
severity: error
- name: updated_at
logicalType: timestamp
physicalType: timestamp
required: true
description: Last update time for this record.
# Dataset-level quality (volume/freshness examples)
quality:
- metric: freshness
maxDelayMinutes: 1440
dimension: timeliness
severity: warning
- metric: rowCount
mustBeGreaterThan: 1
dimension: completeness
severity: warning
# Change management rules (keeps evolution explicit)
changeManagement:
maturity: evolving
compatibleChanges:
- add_optional_field
- add_enum_value
breakingChanges:
- rename_field
- change_field_type
- remove_field
contractCreatedTs: "2026-02-17T00:00:00-03:00"
What matters here is not the exact YAML shape, it’s the outcome.
This single file replaces a whole ecosystem of scattered JSON configs:
- schema + semantics in one place
- platform hints (S3 paths, PK, merge strategy) in one place
- quality expectations in one place
- ownership and change management in one place
And because it’s versioned and reviewed like code, change becomes explicit:
- downstream teams can see what changed, when, and why.
The part that unlocked the next step: data quality
Once the contract became the single source of truth, the next move was obvious:
If the contract defines what the dataset is, it can also define what “good data” means.
My plan was to use the contract to drive SparkDQ checks:
- nullability rules (what can and can’t be null)
- uniqueness constraints (primary keys that must be unique)
- accepted values (enums / domains)
- freshness expectations
The key idea: don’t create a separate universe of checks.
Attach quality rules to the dataset definition and enforce them as part of publishing the data.
A contract is not just documentation.
A contract is a promise you can automate.
Results: less confusion, easier changes, better onboarding
Even before going deep on automated DQ, the contract alone made a difference:
- less confusion (“where do I change this?”)
- changes in one place were easier to understand and review
- onboarding improved because the dataset definition was explicit and discoverable
It wasn’t a huge re-platforming effort. It was a simplification.
Data contracts (and ODCS in particular) are often presented as governance.
But what I saw in practice is simpler:
- contracts are infrastructure
They reduce fragility, make expectations explicit, and turn “tribal knowledge” into something versioned, reviewable, and automatable.
If you’re starting, keep it small:
- ownership
- schema + semantics
- change policy
- a handful of quality rules
One contract per dataset. One source of truth. Everything else can evolve from there.