Factory Labs

Connect Databricks (Warehouse)

Expose Databricks SQL warehouse tables as governed AI Assistant skills via federated reads. OAuth M2M or PAT, ~5 minutes through the wizard.

This guide sets up the federated reads stream — the AI Assistant queries your Databricks SQL warehouse live as governed skills, with no copy of your data.

Two complementary modes. This page is the inbound direction (Factory → Databricks SQL warehouse). For the outbound direction (a Databricks Mosaic AI agent calling into Factory), see Connect Databricks Mosaic AI Agents.

Prerequisites

  • A Databricks workspace with at least one SQL Warehouse running.
  • One of:
    • A Personal Access Token (PAT) scoped to the schema(s) you want to expose. User Settings → Developer → Access tokens.
    • OAuth M2M credentials (a service principal with client_id / client_secret and the warehouse permissions). Recommended for production.
  • Admin role in your Factory Labs workspace.

The PAT or service principal needs USE CATALOG + USE SCHEMA + SELECT on every table you plan to expose, and CAN_USE on the SQL warehouse.

Step 1 — Open the wizard

Go to Settings → Integrations → Databricks (Warehouse) → Connect, or directly to:

/settings/integrations/warehouse/databricks

The 6-step wizard renders. Total wall-clock target: ≤ 5 minutes for the first table, < 2 min for each additional table on the same connection.

Step 2 — Drop in credentials

FieldWhere to find it
Workspace URLhttps://<workspace>.cloud.databricks.com (or your custom subdomain)
Warehouse IDSQL Warehouses → your warehouse → "Connection details" tab
AuthPick PAT or OAuth M2M
PATThe token you generated in User Settings
OAuth M2Mclient_id, client_secret, and the OAuth token_endpoint (defaults to https://<workspace>/oidc/v1/token)

Click Save & continue. A row is created in warehouse_connections with status untested. The credentials are encrypted at rest with AES-256-GCM and a per-tenant key derived via HKDF — never round-tripped to the browser. See Governance → Encrypted credentials.

Step 3 — Test the connection

Click Run test. The platform issues a single SELECT 1 through your warehouse and:

  • Verifies auth resolves (OAuth M2M issues a fresh token; PAT is checked directly).
  • Captures the round-trip latency.
  • Surfaces the principal that authenticated (e.g. principalHint = svc_warehouse).
  • Flips the connection status from untestedok.

If the test fails, the wizard shows a structured error code and a fix hint:

CodeLikely cause
AUTH_FAILEDPAT expired or revoked; OAuth client lacks the warehouse permission
WAREHOUSE_OFFLINEThe SQL Warehouse is STOPPED — start it from the Databricks UI
CATALOG_MISSINGDefault catalog/schema couldn't be resolved with the principal's grants
RATE_LIMITEDToo many concurrent test calls — wait 30s

Step 4 — Pick a table

Browse catalog → schema → table using SHOW CATALOGS / SCHEMAS / TABLES. The picker only shows what the principal can see, so a least-privilege PAT will surface a focused shortlist.

Pick a table, and Factory Labs:

  • Renders the first 3 rows in a preview grid (sampled, with type badges).
  • Captures the column metadata snapshot — name, type, nullability — for the allow-list editor.

Step 5 — Define the Assistant skill

Each warehouse table becomes one AI Assistant skill keyed warehouse_table__<uuid>. This step controls what the Assistant can see and do with the table:

FieldWhat it does
DescriptionFree-text summary the AI uses to decide when to call this skill. Click Suggest with AI for a one-paragraph draft based on the column metadata.
Allow-listed columnsTick which columns the Assistant may SELECT. Anything unticked is invisible — and any SQL that mentions an un-allow-listed column is rejected by query-guard before it reaches your warehouse.
Row filter (optional)A SQL WHERE clause that's AND-merged into every Assistant query — e.g. region = 'NA' or tenant_id = 'acme'. Useful for multi-tenant tables you want to expose only one slice of.

Trim columns to the minimum needed (typically 5–10) — narrower allow-lists make the AI's queries faster and safer.

This step is optional but high-leverage. Click Get an AI suggestion and the platform:

  1. Runs a heuristic pass — looks for column names like account_id, customer_uuid, account_email that obviously map to a CRM entity. Confidence ≥ 0.5 surfaces a recommendation.
  2. Falls through to an AI pass if no heuristic match — the LLM proposes a join target (accounts, contacts, opportunities, etc.) and the join key.

Accept the suggestion to teach the Assistant how to JOIN this warehouse table with CRM data:

"Show me opportunities for accounts in the warehouse with > $10k LTV." → Assistant emits a single SELECT joining crm.accounts (Postgres) with the warehouse table on the accepted link key, all guarded.

You can skip this step and accept the link later — call PATCH /api/v1/admin/warehouse/data-sources/<id> with { semantic_link: ... }.

Step 7 — Activate as Assistant skill

Click Activate as Assistant skill. The platform:

  • Inserts a row in warehouse_data_sources with enabled_for_assistants = 'all'.
  • Generates a typed AI tool keyed warehouse_table__<uuid> with the description, columns, row filter, and semantic link baked in.
  • Deep-links you to Settings → AI Agents so you can scope the new skill per agent.

The tool is now live for every default agent. To restrict it to specific agents, untick enabled_for_assistants = 'all' on the data source and tick the tool individually inside each custom agent.

Repeat for more tables

Steps 4–7 take < 2 min per additional table on the same connection. Each Databricks connection can host any number of data sources; each becomes its own Assistant skill.

What the Assistant sees

When the Assistant decides a question needs warehouse data, it picks the relevant warehouse_table__* tool, generates SQL from the description + columns + semantic link, and submits it. Before the SQL touches your warehouse, query-guard enforces:

  • Single SELECT statement (no DDL, DML, or multi-statement)
  • Allow-listed table & columns only
  • Row filter AND-merged into the WHERE
  • LIMIT clamp (default 200 rows)
  • Payload cap (default 1 GiB scanned per call)

Every call is logged to warehouse_query_log with the rewritten SQL, principal, latency, rows returned, and bytes scanned. See Governance → Audit trail.

Connected Warehouses dashboard

Open Data Lake in the sidebar (or /data-lake) for the operator view:

  • Connection cards with status, last-tested time, and a one-click Test button.
  • KPI strip: connections, tables, last 24h queries, last 24h rejections.
  • Per-data-source query log with SQL, latency, and rejection reasons.
  • "Disable" toggle to instantly revoke an entire connection from the Assistant.

Troubleshooting

Wizard step 3 reports AUTH_FAILED PAT expired or scoped too narrowly. Generate a fresh PAT in User Settings → Developer → Access tokens with no expiration (or rotate it via the wizard's Edit credentials button without re-running steps 4–7).

Wizard step 4 shows no schemas The principal has USE CATALOG but lacks USE SCHEMA on every schema in the catalog. Grant USE SCHEMA on the specific schema(s) you want to expose, or pick a catalog where the principal does have access.

Assistant tool exists but every call returns SQL_REJECTED The model emitted SQL that violated the guard — usually mentioning an un-allow-listed column or trying a multi-statement. Inspect the rejection reason in warehouse_query_log (or the Data Lake page) and either widen the allow-list or tighten the skill description so the model picks a different approach.

Tool calls succeed but counts don't match Databricks UI The row filter (Step 5) is AND-merged into every query — verify the filter matches your expectation.