Vector Search RAG (Federated)

Federate the AI Assistant to your Databricks Vector Search or Mosaic AI embeddings index — RAG fallback without re-embedding or maintaining a parallel store.

The Vector Search RAG stream lets the AI Assistant retrieve from your existing Databricks Vector Search (or Mosaic AI embeddings) index as a RAG fallback — without re-embedding documents into Factory's local Pinecone/pgvector store. Use it when you already have a substantial domain corpus indexed in Databricks (product manuals, contracts, runbooks, support tickets) and don't want a parallel embeddings investment.

When to use it

Situation	Use Vector Search RAG?
You have a Databricks Vector Search index with ≥ 10k indexed chunks	Yes — federate, don't duplicate
You're starting from scratch with a small (~1k chunk) corpus	No — use Factory's built-in knowledge base (faster, cheaper)
You need < 100 ms p95 latency on retrieval	No — federated calls are ~200–500 ms; use the local index
Your corpus is in S3 / Postgres / Pinecone, not Databricks	No — register that source separately, or wait for the generic Vector Search adapter

The federated path is strictly a fallback layered alongside the local search_knowledge tool. The Assistant calls both and ranks results by similarity score regardless of source.

Prerequisites

A Databricks workspace with Mosaic AI Vector Search enabled.
A Vector Search index of either type:
- Direct Access — you control the embeddings, Factory queries the index by vector
- Delta Sync — Databricks maintains the index from a Delta table, Factory queries by text or vector
The index must have:
- A text column (the source chunk text)
- An embedding column (Databricks-managed or BYO)
- Optional metadata columns to surface as return_columns (e.g. doc_id, title, url, published_at)
A Databricks PAT (or service principal) with SQL + Vector Search privileges.
An existing federated warehouse data source (Phase 1 of the Connect Databricks flow) — vector indexes are scoped to a warehouse connection so they share auth + per-tenant encryption.

Step 1 — Register the index

Vector indexes are registered via API today (UI shipping next). With the connectionId of an active warehouse connection:

curl -X POST https://app.factorylabs.ai/api/v1/admin/warehouse/vector-indexes \
  -H "Cookie: <session>" \
  -H "Content-Type: application/json" \
  -d '{
    "connectionId": "<conn_uuid>",
    "indexName": "main.knowledge.product_specs_idx",
    "description": "Product specifications, install guides, and runbooks indexed by Databricks Vector Search.",
    "textColumn": "text",
    "returnColumns": ["doc_id", "title", "url"],
    "enabled_for_assistants": true
  }'

Field	Purpose
`connectionId`	Existing warehouse connection (uses its encrypted PAT for auth)
`indexName`	Fully qualified `catalog.schema.index_name` from Databricks
`description`	Free-text summary the Assistant uses to decide when to call this index
`textColumn`	Column name that holds the source chunk text
`returnColumns`	Metadata columns to include alongside text in each hit (used for citations)
`enabled_for_assistants`	If `true`, the index becomes a tool for every default agent

Response includes data.id — a row in warehouse_vector_indexes with status='active'.

Step 2 — Tick the tool in the agent

Open Settings → AI Agents → <your agent>. Under Warehouse vector indexes, the new index appears keyed warehouse_vector__<uuid>. Tick it.

Default agents auto-include all enabled_for_assistants=true indexes. Custom agents start empty — tick explicitly.

Step 3 — Use from the Assistant

"Find product spec sheets for SKU XYZ-123."

The Assistant:

Sees the question is grounding-style and picks the relevant retrieval tools — typically both search_knowledge (local) and warehouse_vector__<uuid> (federated).
Calls them in parallel — sub-second cumulative because the local hit returns immediately.
Receives:
- From search_knowledge: top-k chunks from Pinecone/pgvector with score, text, citation anchor.
- From warehouse_vector__<uuid>: top-k chunks from Databricks with score, text, and the returnColumns you configured.
Ranks results by score regardless of source, synthesizes an answer, cites the highest-scoring chunks (with the url column from your warehouse hits as the citation link).

RAG ranking behavior

The Assistant doesn't pre-emptively pick one source over the other — both are tools, both get called, the model ranks. Three sub-cases worth knowing:

Local + warehouse both have hits — both surfaced, model prefers higher-similarity. Citations interleave naturally.
Only warehouse has hits — Assistant answers from warehouse only; the trace shows zero search_knowledge results.
You disable search_knowledge for this agent — the Assistant relies solely on the federated index. Useful for agents whose domain is exclusively in your lakehouse corpus.

To force one source only, untick the other in the Custom Agent builder.

How the federated call works

Loading diagram…

The Factory tool wraps the Databricks Vector Search REST API. For Delta Sync indexes, the call uses query_text (Databricks generates the embedding server-side). For Direct Access indexes, you can either pass query_text (if Factory has an embedder configured for the index's model) or query_vector (Factory generates the embedding locally first).

Token budget guard

Vector Search indexes can return arbitrarily large chunks — a single hit could be a 50KB document. To keep the Assistant context window manageable, Factory enforces a per-call token budget:

Limit	Default	Notes
`num_results`	5	Tunable per index; max 20
`max_chars_per_result`	2 000	Truncates each chunk's `text` field
`total_chars_per_call`	8 000	Across all hits combined

When total_chars_per_call is exceeded, lowest-score hits drop until under budget. Token budget rejections are logged to warehouse_vector_call_log.budget_exceeded.

Per-index allow-list

Each index registration carries its own enabled_for_assistants flag and per-agent ticking. To restrict an index to one or two custom agents (e.g. only the "Support specialist" agent gets the warranty docs index):

Set enabled_for_assistants = false on the index registration:

curl -X PATCH https://app.factorylabs.ai/api/v1/admin/warehouse/vector-indexes/<id> \
  -H "Cookie: <session>" -H "Content-Type: application/json" \
  -d '{ "enabled_for_assistants": false }'

Tick the tool only inside the agents that should have it.

Audit trail

Every federated vector call is logged to warehouse_vector_call_log:

query_text (the prompt sent to Vector Search)
result_count and bytes_returned
latency_ms and HTTP status
budget_exceeded flag if guard truncated results
principal (the encrypted PAT's resolved user, surfaced from the test step)

Surface the same data in the Data Lake operator dashboard alongside warehouse SQL queries.

Cost considerations

Federated Vector Search calls hit your Databricks Vector Search endpoint — usage shows up in your Databricks bill, not Factory's. Rough orders of magnitude:

Direct Access index: ~$0.0001 per query (mostly serving compute)
Delta Sync index: same query cost + the standing sync compute (varies by source table churn)

If you don't want every Assistant chat to hit Databricks, scope the tool to specific agents (Step 2 — uncheck enabled_for_assistants) so only intentional flows trigger federated retrieval.

Troubleshooting

Tool registers but every call returns 0 results The textColumn value is wrong — verify the column actually contains text in Databricks: SELECT <textColumn> FROM main.knowledge.product_specs_idx LIMIT 5. Also confirm the index is READY (not still building).

AUTH_FAILED on every call The warehouse connection's PAT lacks Vector Search privileges. In Databricks, grant the PAT user USE CATALOG + USE SCHEMA + SELECT on the index's catalog/schema. Re-test the warehouse connection (the vector index reuses its credentials — no separate test step).

Hits come back but citations show no url The index doesn't have a url column, or you didn't include it in returnColumns at registration. Re-register with the corrected returnColumns list.

Latency > 2 seconds per call Either the index is on a serverless endpoint that's cold-started, or the num_results is too high for the embedding model. Lower num_results to 5 (default) and confirm the endpoint type in Databricks. For consistent latency, switch to a provisioned serving endpoint.

Connect Databricks (warehouse) — federated SQL is a prerequisite (vector indexes share connection credentials).
MCP bridge — inbound clients — alternative federation path if your retrieval is exposed as an MCP server rather than a Vector Search index.
Governance — per-call budget enforcement, audit, encryption.

On this page