Data Lineage

The tracking of data's origins, movements, and transformations throughout its lifecycle, enabling understanding of where data comes from and how it changes.

Also known as:Data ProvenanceData Flow

What is Data Lineage?

Data lineage is the process of tracking data from its origin through all its transformations and movements. It provides visibility into where data comes from, how it's processed, and where it goes, essential for governance, compliance, and debugging.

Lineage Components

Source Where data originates.

Transformations How data is changed.

Destination Where data ends up.

Metadata Context about each step.

Types of Lineage

Technical Lineage

  • Column-level tracking
  • ETL jobs
  • Database queries

Business Lineage

  • Business process flow
  • Report dependencies
  • KPI derivation

Benefits

Compliance

  • Audit trails
  • Regulatory requirements
  • Data subject requests

Data Quality

  • Root cause analysis
  • Impact assessment
  • Trust verification

Operations

  • Debugging pipelines
  • Change management
  • Migration planning

Implementation Approaches

Manual Documentation

  • Spreadsheets, wikis
  • Labor intensive
  • Often outdated

Automated Collection

  • Parse code/queries
  • Monitor pipelines
  • Real-time updates

Hybrid

  • Automated technical
  • Manual business context

Tools

  • Apache Atlas
  • Collibra
  • Alation
  • Informatica
  • dbt (data build tool)