Enterprise Data Lineage in Kafka-Based Systems

⏱ 7 min read

Executive summary

Lineage in Kafka-based systems is essential for change impact analysis and compliance. W3C PROV defines provenance and provides a family of standards for interchange; OpenLineage defines an open standard model for dataset/job/run entities and is designed to record metadata for a job in execution. These standards enable consistent lineage collection and analysis across platforms. ArchiMate modeling standards

Modeling lineage: topics, producers/consumers, jobs
Governance: ownership, classification, retention

Figure 1: Data lineage in Kafka — tracking data from source through processing to catalog — **Figure 1:** Data lineage in Kafka — tracking data from source through processing to catalog

PROV-DM.

Building the lineage chain

Figure 2: Data lineage — from source system through Kafka topics and processors to data catalog — **Figure 2:** Data lineage — from source system through Kafka topics and processors to data catalog

Data lineage answers: "Where did this data come from, how was it transformed, and where did it go?" In Kafka-based systems, lineage tracking is both easier (every event is recorded) and harder (data flows through multiple topics and stream processors, making the end-to-end path complex).

Source-to-topic lineage: Track which systems produce to which topics. This is the simplest layer — each producer is registered in the architecture repository with its source system, topic, and schema version. CDC connectors (Debezium) automatically document the database-to-topic mapping.

Topic-to-topic lineage: Stream processors (Kafka Streams, Flink, ksqlDB) consume from input topics, transform data, and produce to output topics. Each processor is a lineage node that must be documented: input topics, transformation logic, output topics. Model these as Application Components in ArchiMate with Access relationships to input and output Data Objects.

Topic-to-sink lineage: Kafka Connect sink connectors push data from topics to external systems (data warehouses, search indexes, notification services). Each sink connector is a lineage endpoint documenting where Kafka data lands.

Automated lineage collection

Manual lineage documentation drifts within weeks. Automate lineage collection by: extracting producer/consumer metadata from Kafka's admin API, parsing Kafka Streams topology descriptions, querying Kafka Connect connector configurations, and integrating with data catalog tools (Apache Atlas, DataHub, Marquez). The architecture repository (Sparx EA) should reference the data catalog for runtime lineage while maintaining the logical data architecture. TOGAF roadmap template

The four phases of lineage management

Figure 3: Lineage management — capture, track, store, and visualize across the data pipeline — **Figure 3:** Lineage management — capture, track, store, and visualize across the data pipeline

Enterprise data lineage in Kafka-based systems requires a structured approach across four phases. Each phase has specific tooling and governance requirements. ARB governance with Sparx EA

Phase 1: Capture. Collect lineage metadata at every point where data enters, transforms, or exits the Kafka ecosystem. CDC connectors (Debezium) automatically capture which database tables feed which topics. Producer applications emit metadata headers (source system, timestamp, correlation ID). Kafka Connect configurations document which topics feed which sink destinations. The key principle: lineage capture must be automatic, not manual. Any lineage that depends on developers remembering to document is lineage that will be incomplete within weeks.

Phase 2: Track. Follow data as it moves through stream processing pipelines. Kafka Streams applications have queryable topologies that describe their input topics, processing steps, and output topics. Flink jobs have execution plans that document the same. ksqlDB queries are SQL statements that explicitly declare source and destination. Parse these topology descriptions to build a directed graph of data flow. Track schema evolution through the Schema Registry's version history — when a schema changes, the lineage graph must reflect which version of the data each consumer sees.

Phase 3: Store. Persist lineage metadata in a dedicated lineage store. Apache Atlas provides a graph-based metadata store designed for lineage. DataHub (by LinkedIn) and Marquez (by WeWork) offer similar capabilities with different API styles. The EA repository (Sparx EA) stores the logical lineage model — the ArchiMate views showing how data flows across business domains. The lineage store and EA repository should reference each other: the lineage store provides runtime accuracy, the EA repository provides architectural context.

Phase 4: Visualize. Make lineage accessible and actionable. Lineage graphs show the complete path from source to destination for any data element. Impact analysis shows which downstream consumers would be affected by a change to a source schema. Data flow diagrams in the EA repository show the architecture-level view. Compliance reports demonstrate that data handling meets regulatory requirements (GDPR data subject access requests, for example, require knowing everywhere a customer's data exists).

Automated lineage with OpenLineage

OpenLineage is an open standard for lineage metadata collection. It defines a common event format that data processing tools emit as they run. Spark, Flink, Airflow, and dbt all have OpenLineage integrations that emit lineage events automatically. These events capture: which datasets were read (inputs), which datasets were written (outputs), which transformation was applied (job), and when it happened (timestamp). integration architecture diagram

For Kafka-based systems, implement OpenLineage at three integration points: Kafka Connect emits lineage events when connectors start and when they process data. Stream processing applications emit lineage events describing their topology. Custom producers and consumers emit lineage events via a lightweight library. Collect all lineage events in a central store (Marquez or DataHub) and integrate with the EA repository for architectural-level lineage views. Sparx EA best practices

Regulatory compliance is the highest-value use case for enterprise data lineage. When a customer exercises their right to access (GDPR Article 15) or right to erasure (Article 17), the organization must know everywhere that customer's data exists — including every Kafka topic, stream processing pipeline, and downstream data store.

The lineage graph answers these questions. Given a customer ID, trace forward through the lineage to find every topic that carries their data, every processor that transforms it, and every sink that stores it. For a right-to-erasure request, this trace identifies every system that must delete or anonymize the customer's records. Without lineage, this trace requires manual investigation across every team — taking weeks instead of hours.

Implement this by tagging data lineage nodes with data classification metadata: which topics carry personally identifiable information (PII), which processors access PII fields, and which sinks store PII. Build a "GDPR trace" query that, given a customer identifier, returns the complete list of systems and topics that must be addressed for a data subject request. Automate the erasure process where possible: produce tombstone records to Kafka topics (triggering compaction-based deletion), send deletion commands to downstream sinks, and log the entire process as compliance evidence.

If you'd like hands-on training tailored to your team (Sparx Enterprise Architect, ArchiMate, TOGAF, BPMN, SysML, Apache Kafka, or the Archi tool), you can reach us via our contact page.

Frequently Asked Questions

What is enterprise architecture?

Enterprise architecture is a discipline that aligns an organisation's strategy, business operations, information systems, and technology infrastructure. It provides a structured framework for understanding how an enterprise works today, where it needs to go, and how to manage the transition.

How is ArchiMate used in enterprise architecture practice?

ArchiMate is used as the standard modeling language in enterprise architecture practice. It enables architects to create consistent, layered models covering business capabilities, application services, data flows, and technology infrastructure — all traceable from strategic goals to implementation.

What tools are used for enterprise architecture modeling?

Common enterprise architecture modeling tools include Sparx Enterprise Architect (Sparx EA), Archi, BiZZdesign Enterprise Studio, LeanIX, and Orbus iServer. Sparx EA is widely used for its ArchiMate, UML, BPMN and SysML support combined with powerful automation and scripting capabilities.