Lakehouse Governance

Query-guard, encrypted credentials, schema-per-tenant isolation, recipient revocation, and audit trails — the security model that makes federated reads and bidirectional MCP safe to ship.

Federated reads, bidirectional MCP, and Delta Sharing only make sense if the security model holds. This page documents the guardrails that stand between an external agent (or a chatty AI Assistant) and your warehouse.

Threat model

The Lakehouse stack is designed against three concrete threats:

Threat	Mitigation
Prompt injection writes destructive SQL ("Drop the orders table")	Query-guard — single-SELECT enforcement, DDL/DML rejected before reaching your warehouse
Prompt injection exfiltrates a column ("Select all customer emails into a CSV")	Per-source allow-list + row filter — un-allow-listed columns rejected; row filters merged into every query
Stolen Factory database snapshot exposes other tenants' credentials	Encrypted credentials with per-tenant HKDF — leaked master key still can't decrypt other tenants' secrets

The model is deny-by-default. Adding a tool to an Assistant requires explicit registration; no implicit "give the AI everything" path exists.

Query-guard

Every SQL query the Assistant emits against a federated warehouse passes through query-guard.ts before reaching your Databricks or Snowflake account. Five guards, applied in order:

#	Guard	Reject code
1	Single-SELECT enforcement — exactly one statement, must be `SELECT`	`MULTI_STATEMENT`, `DDL_FORBIDDEN`, `DML_FORBIDDEN`
2	Table allow-list — table referenced in `FROM` / `JOIN` must be a registered data source	`TABLE_NOT_ALLOW_LISTED`
3	Column allow-list — every column referenced (incl. inside `SELECT`, `WHERE`, `ORDER BY`) must be in the data source's `exposed_columns`	`COLUMN_NOT_ALLOW_LISTED`
4	Row filter merge — the data source's `row_filter` SQL is `AND`-merged into the `WHERE` clause	(no rejection — silent enforcement)
5	LIMIT clamp + payload cap — `LIMIT` clamped to ≤ 200 (default); query aborted if scan exceeds 1 GiB (default, tunable per source)	`LIMIT_CLAMPED` (warning), `BYTES_EXCEEDED` (rejection)

Rejections are surfaced to the Assistant as structured tool errors:

{
  "error": "Column 'customer_ssn' is not allow-listed for warehouse_table__abc123.",
  "code": "COLUMN_NOT_ALLOW_LISTED",
  "details": { "rejected_token": "customer_ssn" }
}

The Assistant interprets these gracefully — typically replies "I can't query that column. The admin would need to add it to the allow-list."

Every rejection is logged with the rewritten SQL, the rejected token, and the principal. Audit trail at Audit trail below.

Allow-list & row filter

Each warehouse data source carries two governance fields:

Field	What it does
`exposed_columns`	Array of column names the Assistant may `SELECT`. Set in wizard Step 5.
`row_filter`	A SQL `WHERE` clause `AND`-merged into every Assistant query. Useful for hiding rows belonging to other tenants, regions, or sensitive segments.

Examples:

-- exposed_columns: ["account_id", "order_total", "ordered_at"]
-- row_filter: "region = 'NA' AND ordered_at > '2024-01-01'"

-- Assistant emits:
SELECT account_id, SUM(order_total)
FROM orders
WHERE account_id IN ('acc_1', 'acc_2')
GROUP BY account_id

-- Becomes (after guard rewrite):
SELECT account_id, SUM(order_total)
FROM orders
WHERE (account_id IN ('acc_1', 'acc_2'))
  AND (region = 'NA' AND ordered_at > '2024-01-01')
GROUP BY account_id
LIMIT 200

The row filter is invisible to the Assistant — it can't observe or evade it. If the model emits SQL referencing a column outside exposed_columns (e.g. SELECT customer_ssn FROM orders), guard 3 rejects before the rewrite even happens.

Encrypted credentials

Every credential stored by the Lakehouse stack is encrypted at rest:

Databricks PATs and OAuth M2M secrets (federated reads)
Snowflake RSA private keys (Snowflake federated reads)
Bearer tokens for inbound MCP servers (MCP bridge)
Bearer tokens for outbound Delta Sharing recipients (Delta Sharing)

The encryption is AES-256-GCM with a per-tenant key derived via HKDF from a deployment-wide master key:

tenant_key = HKDF-SHA256(
  master_key,
  salt   = tenant_id,
  info   = "factory-lakehouse-credentials-v1",
  length = 32
)
ciphertext, iv, tag = AES-256-GCM(tenant_key, plaintext)

What this buys you:

Compromise of one tenant's key does not compromise others — each tenant's secrets decrypt only with that tenant's derived key.
Master key leak is not catastrophic — an attacker would still need to derive each tenant's key (cheap individually, but rate-limitable in practice).
Credentials never round-trip to the browser — the wizard uploads them once over HTTPS; the WAREHOUSE_REVEAL_CREDENTIAL capability is intentionally absent. Editing a credential = re-uploading it.

The master key is sourced from WAREHOUSE_MASTER_KEY (32-byte base64) on the deployment. Rotate by setting WAREHOUSE_MASTER_KEY_NEXT, redeploying, then running the migration script that re-encrypts all rows under the new key — see the runbook for the procedure (operational, not user-facing).

Schema-per-tenant isolation

Beneath the Lakehouse stack, Factory uses schema-per-tenant Postgres isolation. Every tenant's data — including:

warehouse_connections
warehouse_data_sources
warehouse_query_log
warehouse_vector_indexes
warehouse_vector_call_log
integration_connections (for inbound MCP clients)
share_recipients
share_publish_runs

…lives in its own Postgres schema. Cross-tenant SELECTs are not just disallowed by application logic — they're physically impossible because the application's per-request connection sets search_path to the tenant's schema only. No row-level-security retrofits, no risk of a forgotten WHERE tenant_id = ?.

This applies to every API surface: the Custom Agent builder in tenant B never sees tools from tenant A; the dynamic tool catalog (GET /api/v1/admin/agents/tool-catalog) returns only the tenant's own connections; Iceberg REST and Delta Sharing endpoints reject cross-tenant bearer tokens with 401.

Recipient revocation

Delta Sharing bearer tokens are not cached. The Delta Sharing endpoint validates every request against share_recipients table, so revocation propagates within seconds:

Admin clicks Revoke on a recipient row.
share_recipients.status flips to revoked (single Postgres update, < 50 ms).
The next read attempt from the consumer fails with HTTP 401: invalid_bearer.

There is no grace period by design. If you need temporary access, create a new recipient with the desired lifetime rather than relying on token TTL.

The same model applies to Iceberg REST — same bearer tokens, same instant revocation.

MCP client revocation

For inbound MCP clients, "revocation" is two-pronged:

Action	Effect
Disable connection in the MCP Clients page	Toggle off — Factory stops invoking the remote server. Existing tools disappear from the Custom Agent builder.
Delete connection	Encrypted bearer token is purged from `integration_connections`. The remote server is unaware (no token revocation API exists in MCP); any other holder of that token can still use it on the remote side.

Best practice: rotate the underlying bearer token on the remote side (Databricks, Mosaic AI, custom server) when offboarding an integration. Then delete the Factory-side connection to clean up.

Audit trail

Every byte that crosses the boundary is logged:

Table	What gets logged
`warehouse_query_log`	Federated SQL queries — rewritten SQL, principal, latency, rows returned, bytes scanned, status (`ok` / `rejected`), structured rejection code
`warehouse_vector_call_log`	Vector Search calls — query text, result count, bytes returned, latency, budget-exceeded flag
`mcp_client_call_log`	Inbound MCP tool calls — tool name, args (truncated), result (truncated), error code if any
`share_publish_runs`	Delta Sharing publish runs — rows per table, bytes written, watermark, error if any
`share_recipients_access_log`	Delta Sharing reads — recipient, table, bytes egressed, IP (X-Forwarded-For), user agent
`audit_log`	Schema changes — connection created/edited/deleted, recipient added/revoked, allow-list edited

All log tables live in the per-tenant schema (no cross-tenant leakage even in audit data). Retention defaults to 90 days; tunable per deployment.

Same auth on outbound MCP

Factory's outbound MCP server (/api/mcp) uses OAuth 2.0 authorization_code + PKCE — the same auth surface as the rest of the API. External agents (Claude, Cursor, Mosaic AI) authenticate against your existing OAuth clients with their assigned scopes. There is no "API key for AI agents" parallel auth path — your tenant boundary, your roles, your audit trail, regardless of who's calling.

See MCP Authentication for the full PKCE walkthrough and Connect Databricks Mosaic AI Agents for the outbound-direction setup.

Token budget

Vector Search results can be arbitrarily large. To keep the Assistant's context window manageable (and prompt costs predictable), every federated retrieval call is bounded:

Limit	Default	Tunable per index?
`num_results`	5	Yes — max 20
`max_chars_per_result`	2 000	No (deployment-wide)
`total_chars_per_call`	8 000	No (deployment-wide)

When total_chars_per_call is exceeded, the lowest-scoring hits drop until under budget. The truncation is logged in warehouse_vector_call_log.budget_exceeded for observability.

Feature flags

The Lakehouse stack is gated by deployment-wide flags so you can turn streams on/off per environment:

Variable	Default	Controls
`WAREHOUSE_INTEGRATION_ENABLED`	`false`	Federated reads, Vector Search, inbound MCP — flip to `true` to expose the wizards and tools
`DELTA_SHARING_BASE_URL`	unset	Outbound Delta Sharing — when unset, the publish UI shows "not configured" and the cron skips
`ICEBERG_REST_ENABLED`	`false`	Iceberg REST catalog discovery — when `false`, the endpoint returns `404`
`BLOB_READ_WRITE_TOKEN`	unset	Required for Delta Sharing publish — destination for parquet + log files
`CRON_SECRET`	unset	Required for the publish cron trigger endpoint
`WAREHOUSE_MASTER_KEY`	unset	32-byte base64 — master key for HKDF-derived per-tenant credential encryption
`WAREHOUSE_MASTER_KEY_NEXT`	unset	Optional — staging key for rotation

All flags are documented in .env.example and synced to Vercel via the vercel-env-sync procedure (internal).

Compliance posture

The Lakehouse stack inherits Factory Labs's broader compliance posture:

Data residency — federated reads never copy your warehouse data; Delta Sharing files are written to your configured Vercel Blob region (typically the same as your CRM database).
PII handling — Delta Sharing publish hashes email and phone columns with SHA-256; the bodies column on activities is excluded entirely.
Audit retention — per-tenant log tables, 90-day default, customizable per deployment.
SOC 2 / ISO 27001 — same controls apply; the Lakehouse stack adds no new in-scope subprocessors beyond what's already disclosed.

For the full compliance dossier, see Security & Trust.

Connect Databricks (warehouse) — guard behavior in context.
Connect Snowflake — guards apply identically across providers.
Delta Sharing publish — revocation procedure.
MCP bridge — structured error codes returned by inbound MCP calls.

On this page