Photo by Haim Charbit on Unsplash

Stop Scattering Metadata. Start Shipping Contracts


Pipelines don’t break because data changed. They break because change was undefined.

For a long time, our pipeline framework relied on a “metadata system” made of many JSON files.

It wasn’t one clear definition of a dataset. It was metadata duplicated across folders and files, sometimes in the repo, sometimes mirrored in S3, spread through a structure that only made sense if you already knew the framework.

And it worked… until we needed to evolve it.

The pain: metadata duplication becomes drift

The biggest problem wasn’t that JSON is bad. The problem was duplication.

The same dataset definition would show up in multiple places:

  • schema here,
  • ingestion config there,
  • naming rules somewhere else,
  • extra options in another folder.

As the platform grew, it became easy to update one file and forget the other. And that’s the worst kind of bug: not a crash, silent inconsistency.

Sometimes the pipeline would still run, but now two different “sources of truth” disagreed. Debugging that felt like archaeology.

Where it lived: one repo, one S3… too many places

Technically, it was “centralized”: one repo, one S3 bucket.

Practically, it was distributed: many folders, many JSON files, many conventions.

So onboarding was harder than it should be. New people didn’t struggle with Spark, they struggled with where the metadata for this dataset is supposed to be.

The shift: replace config sprawl with a data contract

That’s when I started advocating for data contracts, specifically based on the Open Data Contract Standard (ODCS).

To be fair, I wasn’t the one who introduced the idea first. Another senior engineer on my team brought data contracts to us, and it immediately clicked for me. But after he left the company, the topic lost momentum, and I realized that if nobody kept pushing, we’d drift back to the old metadata sprawl. So I decided to pick it up and advocate for it myself.

The goal wasn’t to add governance. It was to remove ambiguity.

Instead of maintaining metadata in multiple files, we moved to:

  • one contract per dataset
  • versioned in Git
  • reviewed like code
  • easy to locate
  • easy to explain

And because ODCS gives a standard structure, the contract stops being “our internal format” and becomes something the ecosystem understands.

ODCS + real life: standard fields and platform-specific metadata

In practice, we used ODCS as the foundation, but we also added a few platform-specific fields that matter for execution, things like S3 paths and ingestion hints.

That’s important: a contract should be standard and useful.

If it can’t drive automation, it becomes documentation that drifts again.

A concrete example: sales.customer contract (YAML)

Below is a simplified contract for a dataset named sales with schema customer. It follows an ODCS-style structure, plus a small custom section where we keep platform-specific metadata we used to spread across multiple JSON configs.

# sales.customer — Data Contract (ODCS-style, simplified)
kind: DataContract
apiVersion: v3.1.0

# What's this data contract about?
domain: sales
dataProduct: customer
version: 1.0.0
status: active
id: 2c6e3e44-8d7a-4e4d-9f7a-6c1d8f0b2a11

authoritativeDefinitions:
  - type: canonical
    url: https://github.com/my-org/data-contracts/blob/main/sales/customer.yaml
    description: Canonical URL to the latest version of this contract.

description:
  purpose: Curated customer dataset for the Sales domain.
  limitations: Represents the sales view of a customer (not a full CRM/MDM golden record).
  usage: Used for sales dashboards, segmentation, and forecasting.

team:
  name: sales-data-platform
  description: Team owning this dataset and contract.
  members:
    - username: kenji
      role: owner

# Infrastructure & servers (where the dataset is served from)
servers:
  - server: lakehouse-prod
    type: s3
    description: "Bronze/Silver storage on S3"

# Platform-specific metadata (kept in one place instead of scattered JSON configs)
customProperties:
  - property: s3_path_bronze
    value: s3://company-datalake/bronze/sales/customer/
  - property: s3_path_silver
    value: s3://company-datalake/silver/sales/customer/
  - property: primary_key
    value: [customer_id]
  - property: merge_strategy
    value: scd1
  - property: ordering_column
    value: lsn   # or event_time if you don't have LSN/SCN

# Dataset schema + quality expectations
schema:
  - name: customer
    physicalName: customer
    physicalType: table
    businessName: Customer (Sales)
    description: Current state of customers in the Sales domain.
    tags: ["sales", "customer"]
    properties:
      - name: customer_id
        businessName: Customer Identifier
        logicalType: string
        physicalType: string
        required: true
        primaryKey: true
        description: Unique customer identifier.
        quality:
          - metric: nullValues
            mustBe: 0
            dimension: completeness
            severity: error
          - metric: uniqueness
            mustBe: 1
            dimension: uniqueness
            severity: error

      - name: full_name
        logicalType: string
        physicalType: string
        required: true
        description: Customer full name.

      - name: email
        logicalType: string
        physicalType: string
        required: false
        description: Customer email address (when available).

      - name: country
        logicalType: string
        physicalType: string
        required: true
        description: Customer country code.

      - name: status
        logicalType: string
        physicalType: string
        required: true
        description: Business status of the customer.
        quality:
          - metric: acceptedValues
            values: ["ACTIVE", "INACTIVE"]
            dimension: validity
            severity: error

      - name: updated_at
        logicalType: timestamp
        physicalType: timestamp
        required: true
        description: Last update time for this record.

    # Dataset-level quality (volume/freshness examples)
    quality:
      - metric: freshness
        maxDelayMinutes: 1440
        dimension: timeliness
        severity: warning
      - metric: rowCount
        mustBeGreaterThan: 1
        dimension: completeness
        severity: warning

# Change management rules (keeps evolution explicit)
changeManagement:
  maturity: evolving
  compatibleChanges:
    - add_optional_field
    - add_enum_value
  breakingChanges:
    - rename_field
    - change_field_type
    - remove_field

contractCreatedTs: "2026-02-17T00:00:00-03:00"

What matters here is not the exact YAML shape, it’s the outcome.

This single file replaces a whole ecosystem of scattered JSON configs:

  • schema + semantics in one place
  • platform hints (S3 paths, PK, merge strategy) in one place
  • quality expectations in one place
  • ownership and change management in one place

And because it’s versioned and reviewed like code, change becomes explicit:

  • downstream teams can see what changed, when, and why.

The part that unlocked the next step: data quality

Once the contract became the single source of truth, the next move was obvious:

If the contract defines what the dataset is, it can also define what “good data” means.

My plan was to use the contract to drive SparkDQ checks:

  • nullability rules (what can and can’t be null)
  • uniqueness constraints (primary keys that must be unique)
  • accepted values (enums / domains)
  • freshness expectations

The key idea: don’t create a separate universe of checks.

Attach quality rules to the dataset definition and enforce them as part of publishing the data.

A contract is not just documentation.

A contract is a promise you can automate.

Results: less confusion, easier changes, better onboarding

Even before going deep on automated DQ, the contract alone made a difference:

  • less confusion (“where do I change this?”)
  • changes in one place were easier to understand and review
  • onboarding improved because the dataset definition was explicit and discoverable

It wasn’t a huge re-platforming effort. It was a simplification.


Data contracts (and ODCS in particular) are often presented as governance.

But what I saw in practice is simpler:

  • contracts are infrastructure

They reduce fragility, make expectations explicit, and turn “tribal knowledge” into something versioned, reviewable, and automatable.

If you’re starting, keep it small:

  • ownership
  • schema + semantics
  • change policy
  • a handful of quality rules

One contract per dataset. One source of truth. Everything else can evolve from there.

Part of the series: under-the-hood

Tags: #platform-engineering