Articles

How to Choose the Best Data Lineage Tools for Your Organization's Needs

June 19, 2025
Greg Shoup

Data lineage is the process of tracking data usage within your organization. This includes how data originates, how it is transformed, how it is calculated, its movement between different systems, and ultimately how it is utilized in applications, reporting, analysis, and decision-making. This is a crucial capability for any modern ecosystem, as the amount of data businesses generate and store increases every year. 

As of 2024, 64% of organizations manage at least one petabyte of data — and 41% have at least 500 petabytes of information within their systems. In many industries, like banking and insurance, this includes legacy data that spans not just systems but eras of technology.

As the data volume grows, so does the need to aid the business with trust in access to that data. Thus, it is important for companies to invest in data lineage initiatives to improve data governance, quality, and transparency. If you’re shopping for a data lineage tool, there are many cutting-edge options. The cloud-based Zengines platform uses an innovative artificial intelligence-powered model that includes data lineage capabilities to support clean, consistent, and well-organized data.

Whether you go with Zengines or something else, though, it’s important to be strategic in your decision-making. Here is a step-by-step process to help you choose the best data lineage tools for your organization’s needs.

Understanding Data Lineage Tool Requirements

Start by ensuring your selection team has a thorough understanding of not just data lineage as a concept but also the requirements that your particular data lineage tools must have.

First, consider core data lineage tool functionalities that every company needs. For example, you want to be able to access a clear visualization of the relationship between complex data across programs and systems at a glance. Impact analysis also provides a clear picture of how change will influence your current data system.

In addition, review technology-specific data-lineage needs, such as the need to ingest legacy codebases like COBOL. Compliance and regulatory requirements vary from one industry to the next, too. They also change often. Make sure you’re aware of both business operations needs and what is expected of the business from a compliance and legal perspective.

Also, consider future growth. Can the tool you select support the data as you scale? Don’t hamstring momentum down the road by short-changing your data lineage capabilities in the present.

Key Features to Look for in Data Lineage Tools

When you begin to review specific data lineage tools, you want to know what features to prioritize. Here are six key areas to focus on:

  1. Automated metadata collection: Automation should be a feature throughout any data lineage tool at this point. However, the specific ability to automate the collection of metadata from internal solutions and data catalogs is critical to sustainable data lineage activity over time.
  2. Integration capabilities: Data lineage tools must be able to comprehensively integrate across entities that store data — past, present, and future. This is where a tool like Zengines shines, where accessing legacy data in legacy technology has been the #1 challenge for most organizations. 
  3. End-to-end visibility: Data lineage must provide a clear picture of the respective data paths from beginning to end. This is a fundamental element of quality data lineage analysis.
  4. Impact analysis and research capabilities: Leading solutions make it easy for users to obtain and understand impact analysis for any data path or data changes showing data relationships and dependencies.. Further, the ability to seamlessly research across data entities assists in confidence for such analysis.
  5. Tracking and monitoring: Data lineage is an ongoing activity, thus data lineage tools must be able to keep up with ongoing data change within an organization.
  6. Visualization features: Visualizations - such as logic graphs - should provide comprehensive data paths across the data life cycle.

Keep these factors in mind and make sure whatever tool you choose satisfies these basic requirements.

Implementation and User Experience Considerations

Along with specific features, you want to assess how easy it is to implement the tool and how easy it is to use the tool.  

Start with setup. Consider how well each data lineage software solution is designed to implement within and configure to your system. For businesses that  built technology solutions before the 1980s, you may have critical business operations that run on mainframes. Make sure a data lineage tool will be able to easily integrate into a complex system before signing off on it.

Consider the learning curve and usability too. Does the tool have an intuitive interface? Are there complex training requirements? Is the information and operation accessible? 

Cost Analysis and ROI

When considering the cost of a data lineage software solution, there are a few factors to keep in mind. Here are the top elements that can influence expenses when implementing and using a tool like this over time:

  • Direct cost: Take the time to compare the up-front cost of each option. Recognize and account for capability differentiation e.g., ability to integrate with legacy technology, necessary to generate a comprehensive end-to-end data lineage. The lowest-cost option is unlikely to include this necessary capability. . 
  • Benefit analysis: : Consider benefits across both quantitative (e.g., time savings for automated data lineage versus manual data lineage search) and qualitative (e.g., cost avoidance for compliance and regulatory fines). 
  • Total cost of ownership (TCO): Consider the big picture with your investment. Are there licensing and subscription fees? Implementation costs? Ongoing maintenance and support expenses? These should be clarified before making a decision.
  • Expected return on investment: Your ROI will depend on things like speed of implementation, reduced costs, and accelerated digital transformations. Automated AI/ML solutions, like those that power Zengines, can provide meaningful benefit to TCO over time and  should be factored into the cost analysis.

Make sure to consider costs, benefits, TCO and ROI when assessing your options.

The Zengines Advantage

If you’re looking for a comprehensive assessment of what makes the Zengines platform stand out from other data lineage solutions, here it is in a nutshell:

  • Zengines Mainframe Data Lineage offers a unique approach to data lineage with its robust ability to ingest and parse COBOL programs so that businesses can now have data lineage inclusive of legacy technology.  Zengines de-risks the mainframe “black box” and enables business to better and more quickly manage, modernize, or migrate mainframes.  
  • Zengines comes backed by a wide range of software companies, businesses, consulting firms, and other enterprises that have successfully used our software solutions.
  • Our data lineage is part of our larger frictionless data conversion and integration solutions designed to speed up the notoriously slow process of proper data management. We use AI/ML to automate and accelerate the process, from implementation through ongoing use, helping your data work for you rather than get in the way of your core functions.
  • Our ZKG (Zengines Knowledge Graph) is a proprietary, industry-specific database that is always growing and providing more detailed information for our algorithms.

Our automated solutions create frictionless, sped-up solutions that reduce risk, lower costs, and create more accessible data lineage solutions.

Making the Final Decision

As you assess your data lineage tool choices, keep the above factors in mind. What are your industry and organizational requirements? Focus on key features like automation and integration capabilities. Consider implementation, training, user experience, ROI, and comprehensive cost analyses. 

Use this framework to help create stakeholder buy-in for your strategy. Then, select your tool with confidence, knowing you are organizing your data’s past to improve your present and lay the groundwork for a more successful future.

If you have any follow-up questions about data lineage and what makes a software solution particularly effective and relevant in this field, our team at Zengines can help. Reach out for a consultation, and together, we can explore how to create a clean, transparent, and effective future for your data.

You may also like

For Chief Risk Officers and Chief Compliance Officers at insurance carriers, ORSA season brings a familiar tension: demonstrating that your organization truly understands its risk exposure -- while knowing that critical calculations still run through systems nobody fully understands anymore.

The Own Risk and Solvency Assessment (ORSA) isn't just paperwork. It's a commitment to regulators that you can trace how capital adequacy gets calculated, where stress test assumptions originate, and why your models produce the outputs they do. For carriers still running policy administration, actuarial calculations, or claims processing on legacy mainframes, that commitment gets harder to keep every year.

The Documentation Problem Nobody Talks About

Most large insurers have mainframe systems that have been running -- and evolving -- for 30, 40, even 50+ years. The original architects retired decades ago. The business logic is encoded in millions of lines of COBOL across thousands of modules. And the documentation? It hasn’t been updated in years.

This creates a specific problem for ORSA compliance: when regulators ask how a particular reserve calculation works, or where a risk factor originates, the honest answer is often "we'd need to trace it through the code."

That trace can take weeks. Sometimes months. And even then, you're relying on the handful of mainframe specialists who can actually read the logic -- specialists who are increasingly close to retirement themselves.

What Regulators Actually Want to See

ORSA requires carriers to demonstrate effective risk management governance. In practice, that means showing:

  • Data lineage: Where do the inputs to your risk models actually come from? Which systems touch them along the way?
  • Calculation transparency: How does a policy record become a reserve estimate? What business rules apply?
  • Change traceability: When you modify a calculation, what downstream impacts does that create?

For modern cloud-based systems, this is straightforward. Metadata catalogs, audit logs, and documentation are built in. But for mainframe systems -- where the business logic is the documentation, buried in procedural code -- this level of transparency requires actual investigation.

The Regulatory Fire Drill

Every CRO knows the scenario: an examiner asks a pointed question about a specific calculation. Your team scrambles to trace it back through the systems. The mainframe team pulls in their most senior developer (who was already over-allocated with other work). Days pass. The answer finally emerges -- but the process exposed just how fragile your institutional knowledge has become.

These fire drills are getting more frequent, not less. Regulators have become more sophisticated about data governance expectations. And the talent pool that understands legacy COBOL systems shrinks every year.

The question isn't whether you'll face this challenge. It's whether you'll face it reactively -- during an exam -- or proactively, on your own timeline.

Extracting Lineage from Legacy Systems

The good news: you don't have to modernize your entire core system to solve the documentation problem. New AI-powered tools can parse legacy codebases and extract the data lineage that's been locked inside for decades.

This means:

  • Automated tracing of how data flows through COBOL/RPG modules, job schedulers, and database schemas
  • Visual mapping of calculation logic, branching conditions, and downstream dependencies
  • Searchable documentation that lets compliance teams answer regulator questions in hours instead of weeks
  • Preserved institutional knowledge that doesn't walk out the door when your mainframe experts retire

The goal isn't to replace your legacy systems overnight. It's to shine a light into the black box -- so you can demonstrate governance and control over systems that still run critical functions.

From Reactive to Proactive

The carriers who navigate ORSA most smoothly aren't the ones with the newest technology. They're the ones who can clearly articulate how their risk management processes work -- including the parts that run on 40-year-old infrastructure.

That clarity doesn't require a multi-year modernization program. It requires the ability to extract and visualize what your systems already do, in a format that satisfies both internal governance requirements and external regulatory scrutiny.

For CROs and CCOs managing legacy technology estates, that capability is becoming less of a nice-to-have and more of a prerequisite for confident compliance.

Zengines helps insurance carriers extract data lineage and governance controls from legacy mainframe systems. Our AI-powered platform parses COBOL code and related infrastructure to deliver the transparency regulators expect -- without requiring a rip-and-replace modernization.

TL;DR: The Quick Answer

LLM code analysis tools like ChatGPT and Copilot excel at explaining and translating specific COBOL programs you've already identified. Mainframe data lineage platforms like Zengines excel at discovering business logic across thousands of programs when you don't know where to look. Most enterprise modernization initiatives need both: data lineage to find what matters, LLMs to accelerate the work once you've found it.

---------------

When enterprises tackle mainframe modernization and legacy COBOL code analysis, two technologies dominate the conversation: Large Language Models (LLMs) and mainframe data lineage platforms. Both promise to reveal what your code does—but they solve fundamentally different problems.

LLMs like ChatGPT, GitHub Copilot, and IBM watsonx Code Assistant excel at interpreting and translating code you paste into them. Data lineage platforms like Zengines excel at discovering and extracting business logic across enterprise codebases—often millions of lines of COBOL—when you don't know where that logic lives.

Understanding this distinction determines whether your modernization initiative succeeds or stalls. This guide clarifies when each approach fits your actual need.

What LLMs and Data Lineage Platforms Actually Do

LLM code analysis tools provide deep explanations of specific code. They rewrite programs in modern languages, optimize algorithms, and tutor developers. If you know which program to analyze, LLMs accelerate understanding and translation.

Mainframe data lineage platforms find business logic you didn't know existed. They search across thousands of programs, extract calculations and conditions at enterprise scale, and prove completeness for regulatory compliance like BCBS-239.

The overlap matters: Both can show you what calculations do. The critical difference is scale and discovery. Zengines extracts calculation logic from anywhere in your codebase without knowing where to look. LLMs explain and transform specific code once you identify it.

Most enterprise teams need both: data lineage to discover scope and extract system-wide business logic, LLMs to accelerate understanding and translation of specific programs.

How Each Tool "Shows You How Code Works"

The phrase "shows you how code works" means different things for each tool—and the distinction matters for mainframe modernization projects.

Traditional (schema-based) lineage tools show that Field A flows to Field B, but not what happens during that transformation. They map connections without revealing logic.

Code-based lineage platforms like Zengines extract the actual calculation:

PREMIUM = BASE_RATE * RISK_FACTOR * (1 + ADJUSTMENT)

...along with the conditions that govern when it applies:

IF CUSTOMER_TYPE = 'COMMERCIAL' AND REGION = 'EU'

This reveals business rules governing when logic applies across your entire system.

LLMs explain code line-by-line, clarify algorithmic intent, suggest optimizations, and generate alternatives—but only for code you paste into them.

The key difference: Zengines shows you calculations across 5,000 programs without needing to know where to look. LLMs explain calculations in depth once you know which program matters. Both "show how code works," but at different scales for different purposes.

When to Use LLMs vs. Data Lineage Platforms

The right tool depends on the question you're trying to answer. Use this table to identify whether your challenge calls for an LLM, a data lineage platform, or both.

Notice the pattern: LLMs shine when you've already identified the code in question. Zengines shines when you need to find or trace logic across an unknown scope.

Your Question Use an LLM When... Use Zengines When...
Scope "Explain what Program_X does" "What programs are in scope for this modernization initiative?"
Discovery "I'm looking at InterestCalc.cbl - explain the algorithm" "Find all interest rate logic across the codebase - I don't know which programs contain it"
Extraction "Take this one formula and optimize it" "Extract all premium calculation formulas across 200 programs and show me the variations"
Dependencies "Refactor this code to handle the new data structure" "What breaks if I change this copybook? Show me the actual code that will fail."
Data Flow "Walk me through the logic within this single program" "Trace how data flows from File A through all programs to Report Z"
Business Rules "Explain this nested IF-THEN-ELSE logic and suggest a cleaner approach" "What business rules govern when calculation X applies vs calculation Y across the entire system?"
Root Cause "Why does this specific function return unexpected values? Debug this." "Why do System A and System B produce different results? Show me where the calculations diverge."
Compliance "Document what this legacy code does for knowledge transfer" "Prove to auditors complete data lineage with actual business logic for this regulatory metric"

LLM vs. Data Lineage Platform: Feature Comparison

Beyond specific use cases, it helps to understand how these tools differ in design and outcomes. This comparison highlights what each tool is built for—and where each falls short.

Dimension LLM Code Analysis Zengines Data Lineage
Core Use Case Explain, translate, or refactor specific code you've already identified Discover, trace, and document data flows across entire enterprise codebase
User Experience Interactive Q&A - paste code, get explanations, iterate Query-based research - search indexed codebase, visualize dependencies
Primary Output Code explanations, translations, refactored snippets Complete lineage maps, impact analysis, dependency graphs, regulatory docs
Success Outcome Faster understanding and porting of known programs Comprehensive scope, validated completeness, regulatory compliance proof
What You Must Know First Which programs/files to analyze Nothing - designed for discovery when you don't know where logic resides
Proves Completeness? No - limited to what you ask about; may hallucinate details Yes - systematic indexing enables audit trail; deterministic extraction

How to Use LLMs and Data Lineage Together

Successful enterprise modernization initiatives use both tools strategically. Here's the workflow that works:

  1. Zengines discovers scope: "Find all programs touching customer credit calculation" — returns 47 programs with actual calculation logic extracted.
  1. Zengines diagnoses issues: "Why do System A and System B produce different results?" — shows where logic diverges across programs.
  1. LLM accelerates implementation: Take specific programs identified by Zengines and use an LLM to explain details, generate Java equivalents, and create tests.
  1. Zengines validates completeness: Prove to auditors that the initiative covered all logic paths and transformations.

Why Teams Confuse LLMs with Data Lineage Tools

Many teams successfully use LLMs to port known programs and assume this scales to enterprise-wide COBOL modernization. The confusion happens because:

  • 80% of programs may be straightforward — well-documented, isolated, known scope.
  • LLMs work great on this 80% — fast translation, helpful explanations.
  • The 20% with hidden complexity stops initiatives — cross-program dependencies, undocumented business rules, conditional logic spread across multiple files.

Teams don't realize they have a system-level problem until deep into the initiative when they discover programs or dependencies they didn't know existed.

The Bottom Line: Choose Based on Your Problem

LLM code analysis and mainframe data lineage platforms solve different problems:

  • LLMs excel at code-level interpretation and generation for known programs.
  • Data lineage platforms excel at system-scale discovery and extraction across thousands of programs.

The critical distinction isn't whether they can show you what code does—both can. The distinction is scale, discovery, and proof of completeness.

For enterprise mainframe modernization, regulatory compliance, and large-scale initiatives, you need both. Data lineage platforms like Zengines find what matters across your entire codebase and prove you didn't miss anything. LLMs then accelerate the mechanical work of understanding and translating what you found.

The question isn't "which tool should I use?", it's "which problem am I solving right now?".

See How Zengines Complements Your LLM Tools

If you're planning a mainframe modernization initiative, regulatory compliance project, or enterprise-wide code analysis, we'd love to show you how Zengines works alongside your existing LLM tools.

Schedule a demo to see our mainframe data lineage platform in action with your use case.

For nearly a decade, global banks have treated BCBS 239 compliance as an aspirational goal rather than a regulatory mandate. That era is ending.

Since January 2016, the Basel Committee's Principles for Effective Risk Data Aggregation and Risk Reporting (BCBS 239) have required global systemically important banks to maintain complete, accurate, and timely risk data. Yet enforcement was inconsistent, and banks routinely pushed back implementation timelines.

Now regulators are done waiting. According to KPMG, banks that fail to remediate BCBS 239 deficiencies are "playing with fire."

At the heart of BCBS 239 compliance sits data lineage - the complete, auditable trail of data from its origin through all transformations to final reporting. Despite being mandatory for nearly nine years, it remains the most consistently unmet requirement.

The Data Lineage Challenge: Why Banks Deferred Implementation

From 2016 through 2023, comprehensive data lineage proved extraordinarily difficult to verify and enforce. The numbers tell the story: as of November 2023, only 2 out of 31 assessed global systemically important banks fully complied with all BCBS 239 principles. Not a single principle has been fully implemented by all banks (PwC).

Even more troubling? Progress has been glacial. Between 2019 and 2022, the average compliance level across all principles barely moved - from 3.14 to 3.17 on a scale of 1 ("non-compliant") to 4 ("fully compliant") (PwC).

Throughout this period, banks submitted implementation roadmaps extending through 2019, 2021, and beyond, citing the technical complexity of establishing end-to-end lineage across legacy systems. Many BCBS 239 programs were underfunded and lacked attention from boards and senior management (PwC). For seven years past the compliance deadline, data lineage requirements remained particularly challenging to implement and even harder to validate.

The Turning Point: Escalating Enforcement and Explicit Guidance

The Basel Committee's November 2023 progress report marked a shift in tone. Banks' progress was deemed "unsatisfactory," and regulators signaled that increased enforcement measures - including capital surcharges, restrictions on capital distribution, and other penalties would follow (PwC).

Then came the ECB's May 2024 Risk Data Aggregation and Risk Reporting (RDARR) Guide, which provides unprecedented specificity on what compliant data lineage actually looks like - requirements that were previously open to interpretation (EY).

Daily Fines on the Table

In public statements, ECB leaders have hinted that BCBS 239 could be the next area for periodic penalty payments (PPPs)—daily fines that accrue as long as a bank remains noncompliant (KPMG). These penalties can reach up to 5% of average daily turnover for every day the infringement continues, for a maximum of six months (European Central Bank).

This enforcement mechanism is no longer theoretical. In November 2024, the ECB imposed €187,650 in periodic penalty payments on ABANCA for failing to comply with climate risk requirements—demonstrating the regulator's willingness to deploy this tool (European Banking Authority).

Capital Consequences are already here

European enforcement now includes ECB letters with findings, Pillar 2 requirement (P2R) add-ons, and fines (McKinsey & Company). These aren't hypothetical consequences.

ABN AMRO's Pillar 2 requirement increased by 0.25% to 2.25% in 2024, with the increase "mainly reflecting improvements required in BCBS 239 compliance" (ABN AMRO). That's a tangible capital cost for risk data aggregation deficiencies.

The ECB's May 2024 RDARR Guide goes further, warning that banks must "step up their efforts" or face "escalation measures." It explicitly states that deficiencies may lead to reassessment of the suitability of responsible executives—and in severe cases, their removal (EY).

U.S. Regulators Taking Similar Action

American regulators have demonstrated equal resolve on data management failures. The OCC assessed a $400 million civil money penalty against Citibank in October 2020 for deficiencies in data governance and internal controls (Office of the Comptroller of the Currency). When Citi's progress proved insufficient, regulators added another $136 million in penalties in July 2024 for failing to meet remediation milestones (FinTech Futures).

Deutsche Bank felt the consequences in 2018, failing the Federal Reserve's CCAR stress test specifically due to "material weaknesses in data capabilities and controls supporting its capital planning process"—deficiencies examiners explicitly linked to weak data management practices (CNBC, Risk.net).

Data Lineage: Explicit Requirements and Rigorous Testing

The ECB's May 2024 RDARR Guide exceeds even the July 2023 consultation draft in requiring rigorous data governance and lineage frameworks (KPMG). The specificity is unprecedented: banks need complete, attribute-level data lineage encompassing all data flows across all systems from end to end—not just subsets or table-level views.

The ECB is testing these requirements through on-site inspections that typically last up to three months and involve as many as 15 inspectors. These examinations often feature risk data "fire drills" requiring banks to produce large quantities of data at short notice with little warning (KPMG). Banks without comprehensive automated data lineage simply cannot respond adequately.

The regulatory stance continues to intensify. The ECB has announced targeted reviews of RDARR practices, on-site inspections, and annual questionnaires as key activities in its supervisory priorities work program (EY). With clearer guidance on what constitutes compliant data lineage and explicit warnings of enforcement escalation, deficiencies that were difficult to verify in previous years have become directly testable.

Solving the Hardest Part: Legacy Mainframe Lineage

BCBS 239 data lineage requirements are mandatory and now explicitly defined in regulatory guidance. But here's the uncomfortable truth: for most banks, the biggest gap isn't in modern cloud systems with well-documented APIs. It's in the legacy mainframes that still process the majority of core banking transactions.

These systems—built on COBOL, RPG, and decades-old custom code—are the "black boxes" that make BCBS 239 compliance so difficult. They hold critical risk data, but their logic is buried in thousands of modules written by engineers who retired years ago. When regulators ask "where did this number come from?", banks often cannot answer with confidence.

Zengines' AI-powered platform solves this specific challenge. We deliver complete, automated, attribute-level lineage for legacy mainframe systems - parsing COBOL code, tracing data flows through job schedulers, and exposing the calculation logic that determines how risk data moves from source to regulatory report.

This isn't enterprise-wide metadata management. It's targeted, deep lineage for the systems that have historically been impossible to document—the same systems that trip up banks during ECB fire drills and on-site inspections. Zengines produces the audit-ready evidence that satisfies examination requirements, with the granularity regulators now explicitly demand.

For banks facing P2R capital add-ons, the cost of addressing mainframe lineage gaps is minimal compared to ongoing capital charges for non-compliance - let alone the risk of periodic penalty payments accruing at up to 5% of daily turnover.

The time to act is now

BCBS 239 has required comprehensive data lineage since January 2016. With the May 2024 RDARR Guide providing explicit requirements and regulators signaling enforcement escalation, banks can no longer defer implementation—especially for legacy systems.

Zengines provides the proven technology to shine a light into mainframe black boxes, enabling banks to demonstrate compliance when regulators arrive with data requests and their enforcement toolkit.

Learn more today.

Subscribe to our Insights