Lakehouse

What a Lakehouse-Native CRM Actually Means

Two architectural decisions separate a CRM that "connects to Databricks" from one that is genuinely lakehouse-native. We walk through both (federated reads via MCP, and outbound Delta Sharing) and what the difference looks like in production.

Factory Labs
LakehouseDatabricksSnowflakeMCPDelta Sharing

"Lakehouse integration" appears on the marketing pages of every CRM that has shipped a feature in the last three years. The thing it means in practice ranges from "we have a Fivetran connector" to "the CRM is a Spark notebook with a pretty UI." Most of the implementations on the spectrum are not what an engineer would call lakehouse-native.

This post walks through what we think the term should actually mean, what the two architectural decisions are that separate the real thing from the marketing claim, and what the practical implications are for distributors who are evaluating CRMs with a Databricks or Snowflake estate in play.

The shape of the problem

A B2B distributor has data in roughly three places. The ERP holds the operational record: customers, orders, inventory, pricing. The CRM holds the relationship layer: accounts, contacts, leads, opportunities, conversations. The lakehouse (Databricks or Snowflake) holds the analytical store: historical aggregates, ML features, denormalized fact tables, the data science notebook landscape, the BI semantic layer.

The three systems pull on each other. The sales team wants to ask "what is the lifetime value of accounts in the Cleveland territory," a question the CRM cannot answer because the data is in the lakehouse, and the lakehouse cannot answer because it does not know what "the Cleveland territory" means in the CRM's sense. The analytics team wants to push a propensity score back into the CRM to drive a workflow. The data engineering team is tired of writing one-off pipelines between the three.

The traditional answer is to copy data around. The CRM sync writes a copy of accounts into the lakehouse. The lakehouse ETL writes a copy of aggregates back into the CRM. A third sync copies ERP data into both. Three copies of the same record, three drift surfaces, three reconciliation reports.

A lakehouse-native CRM rejects this. The data stays where it lives; the CRM reads it where it lives.

Decision 1: federated read at the query layer

The first decision is whether the CRM can read from the lakehouse directly, without a copy. In production this means: when a sales rep asks the AI assistant "show me lifetime revenue by account for the Cleveland territory," the assistant translates that question into a query against Databricks (or Snowflake), runs it federated, and renders the answer in the CRM's UI. No sync window. No staleness. No copy.

The mechanism we use is the Model Context Protocol (MCP), which is becoming the de facto standard for LLM tool-use. Databricks ships a Genie Spaces MCP server. Factory's Assistant is a bidirectional MCP client and server. It speaks MCP outbound to Genie, Snowflake's Snowpark MCP server, or any third-party MCP server, and speaks MCP inbound so Claude / Cursor / Mosaic AI can drive the CRM with the same guardrails as a logged-in user.

The MCP layer is the part most "Databricks integration" CRMs do not have. Without it, the CRM can only ask the lakehouse questions the integration vendor anticipated. With it, the assistant can synthesize new questions on the fly because the lakehouse is exposed as a generic tool, not a hard-coded API.

What this looks like for the user:

  • Rep types "what is the lifetime gross margin by account in the Cleveland territory" in the assistant.
  • Assistant calls the Databricks MCP server with a generated SQL query against the Cleveland-mart schema.
  • Databricks runs the query, returns rows.
  • Assistant renders the table in the CRM, attached to the account context the rep is viewing.
  • Rep clicks one of the accounts; that opens the account record in the CRM, which itself live-reads the ERP.

Three live reads (CRM, ERP, lakehouse) composed in one answer. No copy of anything.

Decision 2: outbound sharing without ETL

The second decision is how the CRM exposes its own data back to the lakehouse. The default is a Fivetran connector that batch-extracts CRM records into the lakehouse on a schedule. That is the same sync-engine pattern this post is arguing against, just running in the other direction.

The lakehouse-native answer is outbound Delta Sharing. The CRM exposes its records (accounts, contacts, opportunities, activities) as Delta tables that the lakehouse subscribes to. The Delta Sharing protocol handles the wire format; the consumer reads the tables directly from any Delta-compatible engine (Databricks, DuckDB, Trino, Apache Spark). No connector to install, no schedule to operate, no schema drift, no API quota.

For a Databricks-first shop, this means the CRM appears in Unity Catalog as a set of governed tables, just like every other data source. The data engineering team queries them with SQL. The data science team trains models on them with PySpark. The BI team builds dashboards on them in Lakeview. Nobody operates a connector.

For a Snowflake shop, the Iceberg REST catalog conformance provides the equivalent. Factory's Delta tables are simultaneously readable as Iceberg tables via the same standard catalog protocol Snowflake supports natively. One protocol, two consumer shapes.

What "lakehouse-native" specifically excludes

These two decisions, federated read via MCP and outbound share via Delta, are the test. If a CRM does both, it is lakehouse-native. If it does neither, it has a Databricks integration in the marketing sense. The intermediate cases are interesting:

  • Federated read only. The assistant can ask the lakehouse questions, but the CRM's own data is not in the lakehouse without a sync. This is a real engineering posture and is genuinely useful, but it leaves the analytics team running a CRM-to-lakehouse pipeline.
  • Outbound share only. The lakehouse has the CRM's data, but the CRM cannot synthesize answers from the lakehouse without a developer round-trip. The assistant is limited to the questions a developer pre-baked.
  • Neither, plus a Genie hyperlink. The CRM has a button that opens a Databricks Genie chat in a new tab, pre-populated with the account ID. This is not integration; it is a hyperlink with extra steps.

The first two are reasonable mid-states. The third is what "Databricks integration" frequently means and is worth flagging in evaluation.

Why this matters for distributors specifically

Distributors tend to have unusually rich lakehouse estates relative to their headcount because the analytical questions matter so much to the business: segment profitability, contract-price drift, inventory turn, freight cost trending. A distributor with $300M in revenue can have a Databricks bill larger than their CRM bill because the analytical work is where the margin decisions get made.

The lakehouse-native CRM closes the loop between the analytical view and the operational view. The propensity score that the data science team built is queryable from the assistant. The freight-cost trend the analytics team is watching is one click away from the account record where the rep is about to quote. The pricing analyst's "which accounts are paying off-contract" report becomes a CRM-side worklist for the reps who own those accounts.

None of that requires a sync. It only requires the CRM to be a first-class lakehouse consumer.

What the surface looks like

If you are evaluating CRMs against this test, three questions cut through the marketing:

  1. Can the AI assistant write a SQL query against my warehouse, run it, and render the result inline? Not "can it open a notebook," not "can it link to a dashboard." Can it actually execute a federated query at conversation time. If yes, federated read is real.
  2. Does my data engineering team need to install a connector to get CRM data into the lakehouse? If yes, the CRM does not have outbound Delta Sharing. If no, and the data just appears as Unity Catalog tables under the CRM's catalog, outbound is real.
  3. Does the CRM hold a local copy of any of my warehouse data? If yes, the integration is the sync-engine pattern in disguise.

Three questions, three concrete answers. They cut through the marketing because they are not asking "do you integrate." They are asking how.

The Factory Labs implementation

Factory's Lakehouse module is the implementation of both decisions:

  • The Assistant is a bidirectional MCP client + server. Outbound MCP to Databricks Genie, Snowpark, or any third-party MCP server. Inbound MCP so Claude, Cursor, and Mosaic AI can drive the CRM as a tool.
  • Outbound Delta Sharing exposes accounts, contacts, opportunities, activities, conversations as Delta tables, governed by per-tenant HKDF-derived encryption keys and a single-SELECT query-guard.
  • Iceberg REST catalog conformance is read-only and lives next to the Delta surface, so Snowflake-native shops get the same data without leaving Snowflake's tooling.
  • Federated vector search via Databricks Vector Search + Mosaic AI embeddings completes the loop for the semantic-search side of the assistant.

The /docs/lakehouse/ section has the technical reference. The short version: if your data lives in a lakehouse, the CRM should be a tenant of that lakehouse, not a system that copies from it.

What we expect to change

The lakehouse-native CRM is currently a small category. We expect that to flip over the next 18-24 months as the MCP ecosystem matures and Delta Sharing becomes the default way data is shared between SaaS systems. CRM vendors that are not on this trajectory will have a credibility problem with technical buyers; the question "do you have an MCP server" is becoming the new "do you have an API."

Distributors evaluating now have an asymmetric option: pick the architecture that the rest of the market is heading toward, before the laggards catch up and turn it back into table stakes. The buyers who picked Salesforce in 2008 because it was the cloud-native CRM had the same window.