Friday, 20 March 2026 / Published in Entrepreneurship

Due Diligence for Data Moats: An Investor’s Evaluation Framework

Investors often overestimate the value of proprietary data. Simply having exclusive data doesn’t guarantee a competitive edge, especially as advances in AI and synthetic data have made replication faster and cheaper. The real question is: Can competitors replicate the insights, and how long would it take?

This article introduces a 5-part framework to assess the strength of a company’s data moat:

Source: Is the data hard to access or replicate (e.g., from sensors, exclusive workflows)?
Compounding: Does more usage improve the data, creating a self-reinforcing loop?
Switching Costs: How hard is it for customers to leave due to workflow integration?
Replication Timeline: Would it take competitors years or months to rebuild the data advantage?
Monetization: Can the data generate revenue independently (e.g., benchmarks or indexes)?

Key takeaway: Data moats depend on exclusivity, integration, and long-term advantages, not just the volume of data. This framework helps distinguish lasting competitive edges from short-lived claims.

Source Defensibility: Where Does the Data Come From?

When evaluating a data moat, the first question isn’t simply, "Do you have proprietary data?" Instead, it’s, "Where does the data come from, and can competitors access it?" This aspect – source defensibility – forms the bedrock of any evaluation. The way data is acquired determines whether it creates a strong competitive advantage or leaves the business vulnerable.

Data sourced from physical sensors, regulated access, or exclusive workflows provides a solid moat. For example, Tesla’s fleet generates proprietary driving data from billions of miles – making replication nearly impossible due to the astronomical costs. Similarly, in October 2025, Treefera raised $30 million in Series B funding by using proprietary satellite and drone data to gain insights into first-mile supply chains. Another standout is Bloomberg’s AI terminal, which relies on 40 years of exclusive financial data and news, allowing it to command a premium price. These examples highlight how exclusive and hard-to-replicate data sources create a strong foundation for defensibility. For more ideas on building competitive data assets, check out our AI Acceleration Newsletter.

On the other hand, data gathered through web crawling or API scraping offers little defensibility. Advances in large language models have slashed replication costs, making it cheaper and faster for competitors to duplicate once-exclusive datasets. What used to require a team of hundreds can now be achieved by AI at a fraction of the cost. As Abraham Thomas points out, while having unique data is important, it’s not enough – what matters is whether the dataset is truly rivalrous. If a competitor with $5 million can replicate your data in under 18 months, your moat is weak.

The Harvard Business Review (HBR) framework provides a handy test: Data creates a competitive edge only when it delivers substantial value, is rivalrous (meaning one party’s use limits another’s ability to benefit), and has no easy substitutes. Publicly available data often fails this test. For instance, one company might scrape social media to infer buyer intent, while another might achieve similar insights using credit card transactions or search histories – reducing the uniqueness of the original data source.

Data defensibility can be thought of as a hierarchy. At the base is exhaust data (like logs), followed by operational data (transactions), then interactional data (choices or behaviors), and at the top, learning data (feedback or corrections). Companies that capture data higher up this hierarchy, using methods competitors can’t easily replicate, are far more attractive to investors. This naturally leads into discussions about how compounding mechanisms can further enhance defensibility.

Compounding Mechanism: Scale Effects vs. True Flywheels

When evaluating a business, the real focus shouldn’t just be on how much data it has. Instead, the question should be: Does every new customer interaction generate fresh insights, or does it simply add to an existing dataset? This distinction separates companies that hit a plateau (scale effects) from those that develop self-reinforcing flywheels.

Many companies talk about "data network effects", but most are actually experiencing scale effects with diminishing returns. Martin Casado and Peter Lauten from a16z explain that true network effects occur when each new user directly increases the product’s value for everyone else through interaction within a defined system. Scale effects, on the other hand – like Netflix’s recommendation engine – improve with more data but lack this direct user-to-user interaction. Research into customer support chatbots highlights this limitation: intent coverage tends to max out around 40%, meaning additional data beyond that point adds little value.

True flywheels avoid this plateau by leveraging specific mechanisms. Take credit bureau networks as an example: banks must contribute customer credit performance data to access a shared database. This "give-to-get" model creates a flywheel where every new participant enhances the database’s accuracy and usefulness for all users. Similarly, Tesla’s fleet collects proprietary driving data from billions of real-world miles. This data captures rare scenarios and human interventions, which incrementally improve Tesla’s autonomous driving systems across the entire fleet. For competitors, replicating this advantage would require deploying millions of vehicles with similar sensors over many years, making Tesla’s lead difficult to close. These examples show how true flywheels generate compounding value through continuous learning, unlike simple data accumulation.

This difference has major implications for valuation. According to the Opagio Moat Assessment Matrix, businesses with true flywheel-driven network moats can command a 40–80% valuation premium. In contrast, companies relying on scale-based data moats typically see only a 25–50% premium. The critical factor is whether the business focuses on acquiring unique, hard-to-replicate data or merely adds to an existing pool. True flywheels create a defensible data moat, which is one of the five key dimensions investors evaluate.

As Lee Sanderson from Codurance explains, investors should prioritize systems where user activity produces unique, proprietary data that directly enhances the core model. The litmus test is straightforward: Does the data become more valuable to all users as the customer base grows, or does it just expand an existing dataset? If every interaction generates unique insights that competitors would struggle to replicate without years of effort, the advantage remains secure.

This distinction between scale effects and true flywheels is crucial for identifying sustainable competitive advantages in the AI-driven landscape. From here, the next logical step is to assess structural switching costs as another critical factor.

Structural Switching Costs: System of Record vs. System of Action

When evaluating a company’s long-term defensibility, one critical factor is the structural switching costs that keep customers tied to its platform. Investors need to ask: Is this company a System of Record, a System of Action, or neither? Abraham Thomas’s taxonomy offers a clear way to assess this.

A System of Record acts as the central hub for essential data – think Salesforce for sales data or Workday for HR. These systems are incredibly sticky due to what Thomas calls "data viscosity." In simple terms, exporting data, recreating integrations, and transitioning users without disruption is extremely difficult. Over time, these systems build up "workflow barnacles" – layers of rules, integrations, and custom features (like app stores and APIs) that make them nearly irreplaceable once scaled.

On the other hand, Systems of Action take things a step further by not just storing data but enabling critical actions. A great example is GitHub: its value isn’t just in hosting code but in its deeply embedded version control features that developers rely on daily. Ivan Gowan, CEO of Opagio, highlights this dynamic:

A technically mediocre AI system that is deeply integrated into customer workflows can be more defensible than a technically superior system that is loosely coupled.

This distinction matters because it directly impacts valuation. The Opagio Moat Assessment Matrix shows that companies with strong integration moats can see a 15–35% valuation premium, while those with data moats might achieve 25–50%.

The Matters Graph framework (2025) offers a way to quantify these moats. A data moat exists when:

Customers believe it would take 3+ years for competitors to replicate the data advantage.
Insight uniqueness scores exceed competitors by at least 2 points.
Customers show clear willingness to pay for data-driven features.

Take Bloomberg, for instance. Its AI terminal leverages 40 years of proprietary financial data, fully integrated into traders’ workflows. This creates a replication timeline of 3–10 years for competitors, illustrating a structural moat that goes far beyond a simple feature advantage.

However, a new challenge is emerging: the "agentic shift." Thomas describes this as the rise of Systems of Agents, where AI systems, like LLMs, can act autonomously on data. These agents threaten traditional Systems of Record by automating the once-daunting task of data migration. What used to be a massive barrier to switching could now become a solvable problem for competitors with deep pockets willing to subsidize the transition. As Thomas puts it:

Data viscosity is your friend, until it isn’t.

For investors, the takeaway is clear: focus on companies with deeply integrated workflows and data that has achieved "industry standard" status – like FICO scores or Nielsen ratings. The real test of stickiness is whether removing the system would force a complete overhaul of core business operations. This is where true structural switching costs lie.

Replication Timeline: How Long Would It Take a Competitor to Rebuild This?

One of the simplest ways to evaluate data defensibility is by asking this: could a competitor with $5 million replicate your dataset in under 12 months? If the answer is yes, your advantage may not hold up long-term. This benchmark separates lasting structural advantages from early-mover perks that fade quickly when competitors decide to catch up. Let’s break down what factors either stretch or shrink this replication timeline.

Some elements naturally make replication harder and take longer. Physical infrastructure often creates significant hurdles. Similarly, regulatory requirements can act as barriers in industries like healthcare or finance. Compliance with standards like HIPAA or the EU AI Act doesn’t just govern data collection – it also grants the legal right to process that data. This creates what’s often called a "compliance moat." For example, a healthcare company with exclusive access to HIPAA-compliant patient data would force competitors to spend years navigating regulations and building similar partnerships before they could even begin to compete. Additionally, when data collection is deeply woven into essential business workflows, replication becomes even tougher, as it may require rebuilding entire operational systems from scratch.

However, the rise of AI has changed the game, drastically shortening replication timelines. Abraham Thomas, co-founder of Quandl, puts it bluntly:

Companies that spent years building complex human-mediated data pipelines must now contend with upstarts who can replicate 99% of their work for 1% of the cost.

A 2025 Bowmark Capital roundtable highlighted this shift, with 73% of participants in the Data & Insight sector pointing to tasks like aggregation, normalization, and interpretation as especially vulnerable to AI-driven automation. What once required large teams and significant investment can now often be accomplished quickly and cheaply. This has put traditional data advantages under pressure and has had a direct impact on company valuations.

Replication timelines play a critical role in shaping valuations by distinguishing short-lived advantages from enduring data moats. According to the Opagio Moat Assessment Matrix, datasets that would take 3–10 years to replicate typically earn a 25–50% valuation premium. In contrast, model-based moats, which offer only 12–36 months of protection, command a smaller premium of 10–25%. This difference highlights which companies are built on sustainable competitive positions and which ones are at risk of losing their edge as competitors close the gap.

For investors, understanding the replication timeline is essential. Key questions include: What would it actually take for a competitor to match your data scale? Could synthetic data serve as an alternative, or does the need for real-world data streams make replication nearly impossible? If replication depends on costly physical assets, lengthy regulatory approvals, or exclusive agreements, the moat is solid. On the other hand, if a competitor could achieve parity with just a small engineering team and less than a year’s effort, the advantage might be fleeting rather than durable.

Monetization Surface: Can the Data Become a Standalone Product?

When data evolves into a standalone product rather than just supporting a core application, it can dramatically boost a company’s valuation. In fact, this shift can lead to valuation increases of 25–50%, or even multiples of 5–10x in certain cases. Investors looking at data-driven businesses should ask a key question: Can this dataset be sold independently as a benchmark, index, or scoring model? The answer helps determine whether the data serves a single revenue purpose or acts as a multiplier, generating diverse income streams. This distinction is critical, as seen in companies that transform their core datasets into independent revenue powerhouses.

Take S&P Global as an example. The company earns about $1 billion annually by licensing the S&P 500 Index to asset managers, banks, and exchanges. Although the stock data it’s based on is publicly available, the index has become indispensable – a universal benchmark that market players can’t ignore. This is the gold standard for data monetization: turning a dataset into an industry staple that essentially collects a "tax" on every related transaction. Similarly, FICO scores have become so entrenched in banking and consumer credit that replacing them is nearly impossible. They’ve become critical tools for both banks and consumers, locking in their value as standalone assets.

ZoomInfo offers another compelling case. By 2021, the company had achieved 90% profit margins and a 15:1 lifetime value-to-customer acquisition cost (LTV/CAC) ratio, thanks to its highly curated corporate contact directory. Initially designed to improve sales software, this dataset evolved into a primary product. It’s sold through APIs, with pricing based on factors like data freshness, access levels, and specific use cases. ZoomInfo’s success aligns with Harvard Business Review’s framework for actionable insights: their data delivers lasting value, is protected by proprietary verification methods, offers unique improvements, and provides immediate benefits to sales teams. These examples highlight how smart data monetization strategies can create new revenue streams while significantly enhancing a company’s overall valuation.

The financial impact of standalone data products is tangible. According to the Opagio Moat Assessment Matrix, companies offering datasets as industry benchmarks can command valuation premiums of 40–80%, compared to just 10–25% for data that merely enhances a core product. This difference is critical because standalone data often acts as catalyst data – data that not only holds value on its own but also amplifies the worth of related datasets. Such data can be licensed as benchmarks or scoring models, generating additional revenue streams.

However, not all data makes this leap. Data used solely to improve in-product functionality – like Amazon’s recommendation engine or Zendesk’s support ticket analysis – offers operational benefits but lacks the broader appeal needed for standalone monetization. These datasets are tailored for specific use cases and don’t become industry standards. The test is simple: Would someone outside your product ecosystem – like a competitor or adjacent industry player – pay for access to this data? If the answer is no, the data is part of a learning loop rather than an independent product. If the answer is yes, it represents a second revenue stream and a strong competitive edge, likely driving higher valuation premiums.

Assessing whether data can stand alone as a product is essential when evaluating its role in a company’s defensibility. It complements other factors like switching costs, replication timelines, and compounding effects, making it a critical part of building a durable data moat. For more insights on creating impactful moats, join our AI Acceleration Newsletter.

Framework Summary: Weak vs. Strong Defensibility Indicators

Data Moat Evaluation Framework: Weak vs Strong Defensibility Indicators

This framework offers a clear way to evaluate whether a company’s data moat is truly defensible or just a marketing story. For investors, understanding this distinction is critical – it can mean the difference between an ordinary valuation and one that’s 5–10× higher. The table below simplifies the five key dimensions of due diligence, drawing from industry insights like those from Casado & Lauten and Matters Graph (2025). According to commercial due diligence norms, a true data advantage typically takes competitors over three years to replicate. As Martin Casado and Peter Lauten caution:

Treating data as a magical moat can misdirect founders from focusing on what’s really needed to win.

This framework helps cut through the noise, showing where genuine defensibility lies and where it doesn’t.

Dimension	Weak Indicator (Red Light)	Moderate Indicator (Yellow Light)	Strong Indicator (Green Light)
Source Defensibility	Public, scraped, or purchasable datasets	Proprietary but replicable; exclusive partnerships with risks	Unique real-world data (e.g., bio, physics, sensors) or "give-to-get" models
Compounding Mechanism	Static historical data; volume-based advantages	First-mover advantage in collection (12–18 month edge)	Self-reinforcing flywheels; usage improves models automatically
Switching Costs	Loosely coupled; easy to replace	System of Record (SoR); sticky but migratable via AI	System of Action (SoA); deeply integrated into workflows
Replication Timeline	0–12 months	1–3 years (model-based moats)	3–10+ years (proprietary data or network moats)
Monetization Surface	Commodity info; marginal advantage over models	Data as a by-product (exhaust data)	Industry-standard "currency" or clearinghouse for fragmented data

Companies that fall into the green-light category often see valuation premiums of 25–50%, with an even larger boost – 40–80% – when network effects are in play. The replication timeline is a critical measure: if a competitor can rebuild the data advantage in under a year, it’s a temporary edge, not a durable moat. Real defensibility usually stems from elements like physical infrastructure, regulatory barriers, or deeply embedded workflows that are difficult to replicate in less than three years.

For more insights on building and assessing defensible data assets, join our AI Acceleration Newsletter.

Conclusion: The Data Layer Is the Basis of Value

The companies that will thrive in the shifting AI landscape won’t be those with the flashiest interfaces or even the most advanced models. Success will hinge on one thing: a defensible data layer. As Ivan Gowan, CEO of Opagio, aptly explains:

The most important question in AI valuation is not ‘Does this company use AI?’ but ‘Can a competitor replicate this AI capability, and how long would it take?’

Software features can be copied with relative ease. But proprietary data systems, deeply integrated into essential workflows, present a far greater challenge – often taking years, sometimes even a decade, to replicate. Want to stay ahead? Join our free AI Acceleration Newsletter here.

This difficulty in replication highlights why our five-dimensional framework is so critical. By focusing on source defensibility, compounding mechanisms, structural switching costs, replication timeline, and monetization surface, investors can separate the companies with genuine data moats from those relying on marketing hype. With a staggering 73% of AI startups lacking strong moats, identifying the 27% that do is more than just a skill – it’s a necessity.

The real takeaway? Defensibility isn’t about the code; it’s about the data. When data moves beyond passive storage and becomes part of active, dynamic processes, it transforms into an irreplaceable asset. This evolution, from Systems of Record to Systems of Action – and now to Systems of Agents – proves one thing: static data is at risk of disruption, but data that fuels decision-making becomes indispensable.

At M Studio, we specialize in uncovering and building these defensible data assets within our venture portfolio. By blending strategic evaluation with hands-on execution, we ensure that data advantages directly translate into measurable revenue growth and higher valuations. Learn more about our approach to creating data-driven companies that stand out to investors and dominate their markets.

FAQs

What evidence proves the data is truly rivalrous and non-substitutable?

When building a competitive edge, proprietary data collection, a well-designed feedback loop, and deep domain expertise play a crucial role. These elements create advantages that are tough for competitors to match, often requiring over three years to replicate.

What makes this even stronger is the use of real-time, continuously updated data. Unlike static datasets, which can become outdated and vulnerable within 12-18 months, dynamic data remains relevant and harder to replace. This approach ensures the information stays distinct, reliable, and resistant to being easily substituted.

How can I tell a real data flywheel from a scale effect that will plateau?

A real data flywheel generates entirely new insights from every customer interaction. This creates a self-reinforcing cycle that not only enhances the product but also strengthens its competitive edge over time. On the other hand, a scale effect merely increases the amount of data collected. While useful, it tends to hit a ceiling and can often be replicated within a few months. To spot a genuine data flywheel, focus on insights that competitors can’t easily replicate and feedback loops that provide lasting advantages.

What due diligence tests best estimate a competitor’s replication timeline?

Competitors can often replicate a dataset in less than a year if the data comes from sources like public APIs or scraped content. On the other hand, datasets that depend on physical infrastructure or regulatory permissions are much harder to duplicate, often requiring three years or more. These aspects play a crucial role in evaluating the strength of a data moat and the difficulty of replication.

JOIN in 3 Steps