Thursday, 12 March 2026 / Published in Entrepreneurship

Five Questions Every Investor Should Ask Before Backing an AI Company

AI startups are booming, but most fail to deliver lasting value. Why? Many rely on third-party APIs or generic data, making them vulnerable to competition from tech giants. Investors need to focus on companies with strong "data moats" – proprietary datasets that grow more valuable with every customer interaction. Here’s what to ask before investing:

What data does the company generate? Proprietary data is key. Startups relying on public or licensed datasets lack a long-term edge.
How hard is it to replicate their dataset? If a competitor could replicate it in under a year, the company’s advantage is weak.
Is it a model play or a data play? Companies relying on third-party models face commoditization. Data-driven businesses have better staying power.
How integrated is it in customer workflows? Products deeply embedded in workflows create high switching costs, making them indispensable.
Can the data generate independent revenue? Monetizable datasets add extra layers of resilience and valuation upside.

5 Critical Questions Investors Must Ask Before Backing AI Companies

What Data Does This Company Actually Generate?

Proprietary Data vs. Public Data

A company’s competitive edge often lies in the type of data it generates. Data uniquely created through customer interactions provides a lasting advantage, unlike publicly sourced or licensed datasets. If a company says, "We train on publicly available datasets" or "We fine-tune GPT-4 on industry documents", that’s more of a feature than a differentiator. On the other hand, when every transaction produces unique behavioral signals that competitors can’t replicate without matching customer volume, the company is building a true data advantage.

This distinction is critical because proprietary data creates a self-reinforcing cycle: each customer interaction improves the model, attracting more customers and generating even more valuable data. Public datasets, however, pose scalability challenges, often requiring constant manual updates to maintain relevance. For more insights on evaluating data moats, check out our AI Acceleration Newsletter.

Take Incode Technologies, for example. They grew their revenue from $6 million to $170 million by creating a proprietary biometric verification dataset. Every identity check they process generates new training data, which competitors can’t duplicate without achieving the same transaction scale. Similarly, Brigit scaled to $150 million in revenue by building a unique dataset of spending patterns from millions of user transactions, allowing their AI to predict financial distress with increasing accuracy over time. Both companies pass what Dan Gray at Equidam calls the "model swap test." This means their competitive edge isn’t tied to a specific foundation model but to their ability to generate and leverage unique data.

This ability to produce irreplaceable data not only builds a strong competitive moat but also ensures a more secure position in the market over the long term.

Impact on Long-Term Positioning

The SEG 2026 Report highlights that businesses leveraging proprietary data to drive sustainable revenue growth tend to command higher valuations. In contrast, those dependent on generic data face lower valuations. By 2026, venture capital firms had already shifted focus, cutting out 70% of AI startups that lacked differentiation, prioritizing companies with genuine data moats.

Legal risks further emphasize the value of proprietary data. In September 2025, Anthropic faced a $1.5 billion copyright settlement, the largest in U.S. history, underscoring the dangers of relying on unauthorized or non-proprietary datasets.

"AI does not work without accurate data. Period."

Brian MacMahon, CEO of Expert Dojo

For investors reviewing countless pitch decks, the key question becomes: does this company generate irreplaceable data through deep customer integration, or are they simply processing interchangeable data via third-party APIs? As the industry undergoes rapid shifts, only companies with defensible, unique data assets are likely to maintain their value.

How Long Would It Take a Competitor to Replicate This Dataset?

Structural Barriers to Replication

The real question isn’t whether a competitor can replicate your dataset – it’s how long it would take them with $50 million in funding. If they could do it in less than 12 months, your dataset is likely just a feature, not a long-term advantage. But if the process would take over three years due to inherent challenges, then you’ve got an asset worth investing in.

One of the biggest hurdles? Regulatory compliance. Take the EU AI Act, for example. By August 2, 2026, it will fully apply, and non-compliance could lead to fines of up to €35 million or 7% of global annual revenue. Companies with established compliance systems already have a head start, while competitors face steep costs and lengthy timelines just to enter the market. This creates a significant delay for anyone trying to catch up.

Another challenge arises when a product becomes deeply embedded in workflows, especially in areas tied to compliance, audits, or operational procedures. If a dataset is tied to these workflows, switching isn’t just inconvenient – it can be prohibitively expensive. For a competitor, it’s not just about replicating the data but convincing organizations to overhaul their entire systems. And when data is generated from these tightly integrated workflows, it can’t simply be scraped, purchased, or reverse-engineered.

Take Cursor as an example. They developed a proprietary mixture-of-experts model that’s four times faster than frontier models. By early 2026, Cursor had hit $1 billion in annualized revenue, with gross margins projected to rise from 74% to 85% by 2027. Replicating their success isn’t as simple as copying the interface. Competitors would face years of work to match their proprietary feedback loops, task-specific quality checks, and custom infrastructure – all requiring significant capital with no guarantee of success.

These structural challenges don’t just slow down replication; they’re also critical factors for investors to evaluate.

Investor Evaluation Criteria

For investors, the key question is whether the core functionality of a product can be replicated quickly. If a well-funded, AI-savvy team could replicate it in under 12 months, the business might lack long-term defensibility. But if replication would take years, it’s worth digging into what’s creating that barrier. Is it lengthy regulatory approval processes, exclusive partnerships, or the accumulation of data that creates high switching costs? These elements separate businesses with staying power from those that are just polished prototypes.

Elad Gil (2023) points out that defensibility often develops over time, especially with proprietary datasets and deeply ingrained workflows. True defensibility comes from making it exponentially harder for competitors to catch up. Proprietary feedback loops are a big part of this. When performance depends on private data – like support tickets, claims, or operational logs – and task-specific feedback, it becomes locked to the vendor. On the other hand, if performance is tied only to the choice of foundation model, it’s fragile and easy to replicate.

Here’s a practical test for founders: can they swap their foundation model within 72 hours without seeing a drop in performance? If they can’t, the real advantage lies with the model provider, not the company. But if they can, it means the company’s strength comes from its proprietary data and workflow integration. And that’s exactly where defensibility should be.

Is This a Model Play or a Data Play?

When looking at AI investments, the real question isn’t just whether a company uses AI, but how they’re using it. Are they building their business on someone else’s intelligence, or are they creating their own? Model-focused companies aim to outpace competitors with better algorithms, while data-focused companies rely on owning the unique inputs that make any model valuable. Here’s the thing: one of these strategies has a much shorter lifespan than the other. Companies that generate and protect proprietary data are better positioned to maintain value, even as models improve rapidly. This distinction is crucial when evaluating whether a startup’s edge lies in its exclusive data or its dependence on third-party models.

Commoditization in the Model Layer

The model layer is starting to look a lot like the cloud infrastructure wars – where scale dominates, competition is fierce, and only a few players come out on top. According to the 2026 Adaline Labs investor panel, foundation models are converging in performance much faster than user experiences, leaving companies caught in the middle facing tough valuation challenges.

By 2026, the market had already eliminated 70% of low-differentiation AI startups from serious funding consideration. These so-called "model wrappers" have seen their valuation multiples drop from 9–12x ARR to 3–4x ARR. Why? Companies like OpenAI, Anthropic, and Google continue to expand their feature sets, often releasing new capabilities for free – capabilities that entire startups had built their businesses around.

"Most AI startups don’t fail because they’re bad, but because they’re building something that OpenAI, Anthropic, or Google can eventually ship ‘for free’ as a feature."

Jenny Xiao, Cofounder, Leonis Capital

The financials paint a similar picture. OpenAI’s inference costs hit $2.3 billion in 2024, a figure 15 times higher than the training cost for GPT-4. Unlike traditional SaaS companies with gross margins of 80–90%, AI-native businesses struggle with 50–60% margins because inference costs grow with the complexity of the queries. In fact, 84% of enterprises report that AI infrastructure costs have reduced gross margins by 6% or more.

When a company’s value depends entirely on accessing someone else’s model through an API, it’s not building a sustainable business – it’s essentially renting space in someone else’s store. And the landlord can change the terms at any time.

Advantages of Application and Data Layers

While the model layer continues to face commoditization, the data layer offers a path to lasting value by strengthening customer relationships. Data-driven companies operate under very different rules. When AI labs release model upgrades, these companies can adopt the improvements seamlessly without rebuilding their systems. Their competitive edge doesn’t come from the reasoning engine itself – it comes from proprietary data, workflow integration, and behavioral insights that no foundation model provider can replicate.

The application layer holds the most valuable asset: the user relationship. This includes daily usage patterns, trust, workflow integration, and switching costs. As the 2026 Adaline Labs panel pointed out, while models are converging in capability, the real differentiation lies in user experiences shaped by domain expertise and the compounding effects of data.

Here’s a key contrast: a model-focused startup must constantly re-engineer its offering to stay competitive. A data-focused company, on the other hand, can simply plug in a better model and instantly deliver improved results – without touching its application code.

"If you’re using someone else’s model for your inference, you might have zero gross margins."

Steve Schlenker, Managing Partner, DN Capital

The companies creating lasting value are those that generate proprietary data through usage – like industrial workflow traces, compliance logs, or domain-specific feedback loops. Every interaction with their system produces unique signals that competitors can’t replicate. This is the power of a data flywheel. Without it, even the best interface is just a polished façade. Understanding whether a company is focused on models or data is the first step in assessing how deeply its solution is embedded into customer workflows.

How Deeply Is This Embedded in the Customer’s Workflow?

The depth of a product’s integration into a customer’s daily workflow often determines whether it’s a nice-to-have or an absolute necessity. For AI companies, the focus shouldn’t just be on what the product does, but on how central it is to the customer’s operations. Tools that sit on the edges of workflows are easy to replace. In contrast, deeply embedded systems create high switching costs, making them harder to displace and more valuable. To assess this, it’s crucial to understand how the product transitions from merely storing data to actively managing workflows.

System of Record vs. System of Action

Abraham Thomas provides a useful way to evaluate workflow integration through two categories: System of Record and System of Action.

A System of Record acts as the definitive source for specific data, such as compliance logs, audit trails, or standard operating procedures. Its strength lies in "data gravity" – the immense cost and effort required for an organization to migrate years of accumulated data elsewhere. As Jay Zhao, Cofounder of Leonis Capital, explains:

Structural advantages – such as being a system of record or being embedded into compliance, audits, or standard operating procedures – are much harder to replicate.

A System of Action, on the other hand, doesn’t just store information – it actively executes tasks within workflows. It controls the full cycle of "trigger → input data → decision rules → action", creating daily reliance and making it more integral to operations. Zhao highlights the trade-off involved:

We chose to execute actions rather than merely suggest, which increased liability but created real lock-in. We became the system of record instead of a thin layer, slowing early integrations and sales but making later switching prohibitively expensive.

Combining both approaches – acting as a System of Record while executing actions – results in the most defensible product. A great example is Incode Technologies, which grew its revenue from $6 million to $170 million between 2023 and 2025. By serving as both a System of Record (storing biometric data) and a System of Action (executing real-time authentication), Incode created a product that competitors couldn’t easily replicate. Customers faced major disruptions if they tried to switch, as competitors lacked the same transaction volume to match Incode’s accuracy.

Evaluating Workflow Stickiness

To truly assess how well an AI product integrates into daily operations, the focus should be on its role in the workflow. Dan Gray from Equidam offers an insightful test:

If the vendor only owns the chat interface and not the data ingestion, enrichment, and actioning layers, they can be swapped out. Market pull couples to depth of workflow ownership.

A product’s integration depth can often be measured by the number of bi-directional connectors it has. These connectors allow the product to both read from and write to upstream and downstream systems, ensuring it plays a central role in the workflow. Products that go beyond providing recommendations and actually close the loop on actions create much higher switching costs, making them indispensable.

Take Brigit, for example. By 2025, the financial app reached $150 million in revenue at exit. Instead of just analyzing spending patterns, Brigit actively executes overdraft protection based on its proprietary predictions of financial distress. This level of automation embeds the product into the user’s financial routine. Switching to another tool would mean losing both its predictive accuracy, built from millions of transactions, and the seamless automation it provides.

Metrics that reveal integration depth include:

The percentage of tasks fully automated versus partially assisted
Retention rates tied to deeper integration
The ability to pass a "72-hour re-platform drill", which tests whether a company can maintain operations if its primary provider’s terms suddenly change

If a product’s performance collapses when its underlying model is swapped out, its defensibility is weak. However, if outcomes remain stable, it’s a sign of strong workflow integration – an advantage that can weather even challenging market conditions.

Can the Data Generate Revenue Independently?

A dataset becomes exponentially more valuable when it can generate revenue outside of its core product. This isn’t about spinning off a new business but about creating a "defensibility multiplier" that enhances valuation. The real question is: would other companies, industries, or decision-makers pay for access to your data even if they never used your primary product?

One clear way to achieve this is through licensing the dataset itself. Take Transport for London (TfL), for example. TfL provides free and open data across 80 feeds, powering over 600 apps. This initiative contributes up to $130 million annually to London’s economy by driving efficiency and enabling third-party innovation. For proprietary datasets, the same principle applies: if your data solves larger problems, it can be packaged and licensed, creating an independent revenue stream. Gartner does this effectively by selling premium access to its proprietary datasets, which are used to rank and evaluate data analytics tools and vendors. This transforms their research infrastructure into a revenue-generating asset.

What strategies are you using to monetize your data? Subscribe to our AI Acceleration Newsletter for weekly insights on building resilient AI revenue models and strengthening your competitive edge.

These examples show how proprietary data can evolve into a standalone revenue-generating asset.

Examples of Monetizable Data Streams

Defensible datasets turn raw data into actionable insights, such as credit scores, ESG ratings, or NPS. These insights are built using expert methodologies and specific thresholds, converting raw figures into meaningful metrics.

Meta’s "Meta for Business" exemplifies this approach. By analyzing customer interactions with ads, Meta provides guided analytics to help marketing teams optimize campaign performance. Similarly, a single data product at a national U.S. bank supports 60 different use cases, generating $60 million in incremental revenue annually. This demonstrates how one dataset can power multiple applications, creating a significant revenue multiplier.

Intercom’s AI agent, "Fin", offers another example. In 2025, Intercom began charging clients based on outcomes – specifically, the number of customer issues resolved. This model capitalizes on a vast dataset of resolved support interactions, transforming it into a standalone revenue stream.

These monetization strategies not only generate independent revenue but also strengthen the overall defensibility of the business.

Dual Revenue Stream Advantages

Monetizing data independently provides two major benefits. First, it adds an extra layer of defensibility. Even if competitive forces reduce margins in the core product, the data asset retains its intrinsic value. Research from IBM and McKinsey shows that data monetization can drive industry-leading performance, contributing over 20% to a company’s profitability.

Second, standalone data assets can boost valuation multiples. According to the SEG 2026 Report, buyers are increasingly selective, favoring companies with durable revenue growth, strong net revenue retention, and clear AI positioning. With median SaaS revenue growth dropping to 12.2% by Q4 2025 and only 17% of public SaaS companies meeting the Rule of 40 (down from 30% in 2015), diversified revenue streams have become a key differentiator. The ultimate test is simple: if a competitor replicated your product tomorrow, would your dataset still attract buyers? If the answer is yes, you’ve built a standalone asset. If not, you’re relying solely on product differentiation – a far riskier position in a volatile market.

Conclusion: The AI Correction Is Coming – Evaluate Accordingly

The market is already sorting the strong contenders from the rest. By early 2026, around 70% of low-differentiation AI startups were no longer considered viable for serious investment. This shift is happening now. The companies that survive are the ones generating data that grows in value with every interaction they have with their customers.

So, which AI companies are you supporting – and how are you evaluating them? Sign up for our AI Acceleration Newsletter to get weekly updates on data defensibility, investment strategies, and insights that help you separate meaningful opportunities from the noise. These questions help define the traits that set successful companies apart.

The five-question framework we’ve outlined serves as a practical tool to distinguish companies with deep workflow integration and proprietary data advantages from those that compete purely on price. As Steve Schlenker, Managing Partner at DN Capital, explains:

"Ultimately none of us are investing in AI companies. We’re investing in great companies that just use AI to be even better."

The key difference lies in defensibility.

As the market correction continues, valuation gaps will grow. Generic AI wrappers are trading at 3x–4x ARR, while vertical AI companies with strong data moats are commanding 9x–12x ARR. This widening gap underscores the market dynamics we’ve discussed. As buyers become more selective and weaker players face margin pressures, the divide will only grow. Companies that can confidently address all five questions – through unique data generation, long replication timelines, ownership of the application layer, workflow integration, and independent data monetization – will hold their value. Those relying on flashy demos or third-party models won’t.

At M Studio, we focus on ventures that align with this data defensibility framework, prioritizing go-to-market systems that deliver compounding advantages. Learn more about our approach to building AI companies with structural moats at https://maccelerator.la/en/#eluid1e3e2401.

FAQs

What qualifies as proprietary data versus generic data?

Proprietary data refers to exclusive, company-generated information that provides a competitive edge. For example, it might include unique behavioral insights gathered from customer interactions. In contrast, generic data is publicly accessible or easy to replicate, such as datasets commonly used to fine-tune models. While proprietary data sets a business apart, generic data lacks that distinctiveness.

How can I estimate if a competitor could replicate the dataset in under 12 months?

When evaluating a dataset, it’s crucial to examine both its structural barriers and the collection process. For instance, datasets built on proprietary customer interactions, distinct behavioral patterns, or those safeguarded by regulatory measures can be tough to replicate, often extending the time it takes for others to match their value. On the other hand, if the data is publicly accessible or can be easily aggregated, a resource-rich competitor could catch up relatively fast.

What sets durable datasets apart is their reliance on elements like trust, integration into established workflows, and exclusive partnerships. These aspects are not easily duplicated and often require years of effort to build, creating a strong competitive edge.

What are the fastest signs an AI startup is just a ‘model wrapper’?

The most obvious indicators that an AI startup is merely a "model wrapper" are its dependence on open-source or widely available models and the absence of unique, defensible data. If their business revolves around fine-tuning publicly available datasets or repurposing existing models, their approach can be easily duplicated. Startups that concentrate solely on model performance, without integrating deeply into user workflows or creating distinctive data assets, face significant risks of being outpaced by competitors or falling into commoditization.

JOIN in 3 Steps

Five Questions Every Investor Should Ask Before Backing an AI Company

Five Questions Every Investor Should Ask Before Backing an AI Company