Lakehouse Governance
Query-guard, encrypted credentials, schema-per-tenant isolation, recipient revocation, and audit trails — the security model that makes federated reads and bidirectional MCP safe to ship.
Federated reads, bidirectional MCP, and Delta Sharing only make sense if the security model holds. This page documents the guardrails that stand between an external agent (or a chatty AI Assistant) and your warehouse.
Threat model
The Lakehouse stack is designed against three concrete threats:
| Threat | Mitigation |
|---|---|
| Prompt injection writes destructive SQL ("Drop the orders table") | Query-guard — single-SELECT enforcement, DDL/DML rejected before reaching your warehouse |
| Prompt injection exfiltrates a column ("Select all customer emails into a CSV") | Per-source allow-list + row filter — un-allow-listed columns rejected; row filters merged into every query |
| Stolen Factory database snapshot exposes other tenants' credentials | Encrypted credentials with per-tenant HKDF — leaked master key still can't decrypt other tenants' secrets |
The model is deny-by-default. Adding a tool to an Assistant requires explicit registration; no implicit "give the AI everything" path exists.
Query-guard
Every SQL query the Assistant emits against a federated warehouse passes through query-guard.ts before reaching your Databricks or Snowflake account. Five guards, applied in order:
| # | Guard | Reject code |
|---|---|---|
| 1 | Single-SELECT enforcement — exactly one statement, must be SELECT | MULTI_STATEMENT, DDL_FORBIDDEN, DML_FORBIDDEN |
| 2 | Table allow-list — table referenced in FROM / JOIN must be a registered data source | TABLE_NOT_ALLOW_LISTED |
| 3 | Column allow-list — every column referenced (incl. inside SELECT, WHERE, ORDER BY) must be in the data source's exposed_columns | COLUMN_NOT_ALLOW_LISTED |
| 4 | Row filter merge — the data source's row_filter SQL is AND-merged into the WHERE clause | (no rejection — silent enforcement) |
| 5 | LIMIT clamp + payload cap — LIMIT clamped to ≤ 200 (default); query aborted if scan exceeds 1 GiB (default, tunable per source) | LIMIT_CLAMPED (warning), BYTES_EXCEEDED (rejection) |
Rejections are surfaced to the Assistant as structured tool errors:
{
"error": "Column 'customer_ssn' is not allow-listed for warehouse_table__abc123.",
"code": "COLUMN_NOT_ALLOW_LISTED",
"details": { "rejected_token": "customer_ssn" }
}The Assistant interprets these gracefully — typically replies "I can't query that column. The admin would need to add it to the allow-list."
Every rejection is logged with the rewritten SQL, the rejected token, and the principal. Audit trail at Audit trail below.
Allow-list & row filter
Each warehouse data source carries two governance fields:
| Field | What it does |
|---|---|
exposed_columns | Array of column names the Assistant may SELECT. Set in wizard Step 5. |
row_filter | A SQL WHERE clause AND-merged into every Assistant query. Useful for hiding rows belonging to other tenants, regions, or sensitive segments. |
Examples:
-- exposed_columns: ["account_id", "order_total", "ordered_at"]
-- row_filter: "region = 'NA' AND ordered_at > '2024-01-01'"
-- Assistant emits:
SELECT account_id, SUM(order_total)
FROM orders
WHERE account_id IN ('acc_1', 'acc_2')
GROUP BY account_id
-- Becomes (after guard rewrite):
SELECT account_id, SUM(order_total)
FROM orders
WHERE (account_id IN ('acc_1', 'acc_2'))
AND (region = 'NA' AND ordered_at > '2024-01-01')
GROUP BY account_id
LIMIT 200The row filter is invisible to the Assistant — it can't observe or evade it. If the model emits SQL referencing a column outside exposed_columns (e.g. SELECT customer_ssn FROM orders), guard 3 rejects before the rewrite even happens.
Encrypted credentials
Every credential stored by the Lakehouse stack is encrypted at rest:
- Databricks PATs and OAuth M2M secrets (federated reads)
- Snowflake RSA private keys (Snowflake federated reads)
- Bearer tokens for inbound MCP servers (MCP bridge)
- Bearer tokens for outbound Delta Sharing recipients (Delta Sharing)
The encryption is AES-256-GCM with a per-tenant key derived via HKDF from a deployment-wide master key:
tenant_key = HKDF-SHA256(
master_key,
salt = tenant_id,
info = "factory-lakehouse-credentials-v1",
length = 32
)
ciphertext, iv, tag = AES-256-GCM(tenant_key, plaintext)What this buys you:
- Compromise of one tenant's key does not compromise others — each tenant's secrets decrypt only with that tenant's derived key.
- Master key leak is not catastrophic — an attacker would still need to derive each tenant's key (cheap individually, but rate-limitable in practice).
- Credentials never round-trip to the browser — the wizard uploads them once over HTTPS; the
WAREHOUSE_REVEAL_CREDENTIALcapability is intentionally absent. Editing a credential = re-uploading it.
The master key is sourced from WAREHOUSE_MASTER_KEY (32-byte base64) on the deployment. Rotate by setting WAREHOUSE_MASTER_KEY_NEXT, redeploying, then running the migration script that re-encrypts all rows under the new key — see the runbook for the procedure (operational, not user-facing).
Schema-per-tenant isolation
Beneath the Lakehouse stack, Factory uses schema-per-tenant Postgres isolation. Every tenant's data — including:
warehouse_connectionswarehouse_data_sourceswarehouse_query_logwarehouse_vector_indexeswarehouse_vector_call_logintegration_connections(for inbound MCP clients)share_recipientsshare_publish_runs
…lives in its own Postgres schema. Cross-tenant SELECTs are not just disallowed by application logic — they're physically impossible because the application's per-request connection sets search_path to the tenant's schema only. No row-level-security retrofits, no risk of a forgotten WHERE tenant_id = ?.
This applies to every API surface: the Custom Agent builder in tenant B never sees tools from tenant A; the dynamic tool catalog (GET /api/v1/admin/agents/tool-catalog) returns only the tenant's own connections; Iceberg REST and Delta Sharing endpoints reject cross-tenant bearer tokens with 401.
Recipient revocation
Delta Sharing bearer tokens are not cached. The Delta Sharing endpoint validates every request against share_recipients table, so revocation propagates within seconds:
- Admin clicks Revoke on a recipient row.
share_recipients.statusflips torevoked(single Postgres update, < 50 ms).- The next read attempt from the consumer fails with
HTTP 401: invalid_bearer.
There is no grace period by design. If you need temporary access, create a new recipient with the desired lifetime rather than relying on token TTL.
The same model applies to Iceberg REST — same bearer tokens, same instant revocation.
MCP client revocation
For inbound MCP clients, "revocation" is two-pronged:
| Action | Effect |
|---|---|
| Disable connection in the MCP Clients page | Toggle off — Factory stops invoking the remote server. Existing tools disappear from the Custom Agent builder. |
| Delete connection | Encrypted bearer token is purged from integration_connections. The remote server is unaware (no token revocation API exists in MCP); any other holder of that token can still use it on the remote side. |
Best practice: rotate the underlying bearer token on the remote side (Databricks, Mosaic AI, custom server) when offboarding an integration. Then delete the Factory-side connection to clean up.
Audit trail
Every byte that crosses the boundary is logged:
| Table | What gets logged |
|---|---|
warehouse_query_log | Federated SQL queries — rewritten SQL, principal, latency, rows returned, bytes scanned, status (ok / rejected), structured rejection code |
warehouse_vector_call_log | Vector Search calls — query text, result count, bytes returned, latency, budget-exceeded flag |
mcp_client_call_log | Inbound MCP tool calls — tool name, args (truncated), result (truncated), error code if any |
share_publish_runs | Delta Sharing publish runs — rows per table, bytes written, watermark, error if any |
share_recipients_access_log | Delta Sharing reads — recipient, table, bytes egressed, IP (X-Forwarded-For), user agent |
audit_log | Schema changes — connection created/edited/deleted, recipient added/revoked, allow-list edited |
All log tables live in the per-tenant schema (no cross-tenant leakage even in audit data). Retention defaults to 90 days; tunable per deployment.
Same auth on outbound MCP
Factory's outbound MCP server (/api/mcp) uses OAuth 2.0 authorization_code + PKCE — the same auth surface as the rest of the API. External agents (Claude, Cursor, Mosaic AI) authenticate against your existing OAuth clients with their assigned scopes. There is no "API key for AI agents" parallel auth path — your tenant boundary, your roles, your audit trail, regardless of who's calling.
See MCP Authentication for the full PKCE walkthrough and Connect Databricks Mosaic AI Agents for the outbound-direction setup.
Token budget
Vector Search results can be arbitrarily large. To keep the Assistant's context window manageable (and prompt costs predictable), every federated retrieval call is bounded:
| Limit | Default | Tunable per index? |
|---|---|---|
num_results | 5 | Yes — max 20 |
max_chars_per_result | 2 000 | No (deployment-wide) |
total_chars_per_call | 8 000 | No (deployment-wide) |
When total_chars_per_call is exceeded, the lowest-scoring hits drop until under budget. The truncation is logged in warehouse_vector_call_log.budget_exceeded for observability.
Feature flags
The Lakehouse stack is gated by deployment-wide flags so you can turn streams on/off per environment:
| Variable | Default | Controls |
|---|---|---|
WAREHOUSE_INTEGRATION_ENABLED | false | Federated reads, Vector Search, inbound MCP — flip to true to expose the wizards and tools |
DELTA_SHARING_BASE_URL | unset | Outbound Delta Sharing — when unset, the publish UI shows "not configured" and the cron skips |
ICEBERG_REST_ENABLED | false | Iceberg REST catalog discovery — when false, the endpoint returns 404 |
BLOB_READ_WRITE_TOKEN | unset | Required for Delta Sharing publish — destination for parquet + log files |
CRON_SECRET | unset | Required for the publish cron trigger endpoint |
WAREHOUSE_MASTER_KEY | unset | 32-byte base64 — master key for HKDF-derived per-tenant credential encryption |
WAREHOUSE_MASTER_KEY_NEXT | unset | Optional — staging key for rotation |
All flags are documented in .env.example and synced to Vercel via the vercel-env-sync procedure (internal).
Compliance posture
The Lakehouse stack inherits Factory Labs's broader compliance posture:
- Data residency — federated reads never copy your warehouse data; Delta Sharing files are written to your configured Vercel Blob region (typically the same as your CRM database).
- PII handling — Delta Sharing publish hashes
emailandphonecolumns with SHA-256; thebodiescolumn onactivitiesis excluded entirely. - Audit retention — per-tenant log tables, 90-day default, customizable per deployment.
- SOC 2 / ISO 27001 — same controls apply; the Lakehouse stack adds no new in-scope subprocessors beyond what's already disclosed.
For the full compliance dossier, see Security & Trust.
Related guides
- Connect Databricks (warehouse) — guard behavior in context.
- Connect Snowflake — guards apply identically across providers.
- Delta Sharing publish — revocation procedure.
- MCP bridge — structured error codes returned by inbound MCP calls.