Salesforce AI January 5, 2026 15 min read

Data Cloud RAG: Grounding Salesforce AI in Real Customer Data

An LLM without context is a confident guesser. Retrieval Augmented Generation gives it your actual customer data before it responds. Here is how the RAG pipeline works inside Salesforce Data Cloud, from ingestion to vector search to response.

Tyler Colby · Founder, Colby's Data Movers

The Problem RAG Solves

A large language model knows what it was trained on. It does not know your customers. It does not know that Account ID 001XX000003DGFF has a $2.4M pipeline with three open opportunities and a pending support escalation. If you ask the LLM to summarize that account, it will either refuse ("I don't have access to that data") or hallucinate something plausible but wrong.

RAG fixes this by retrieving relevant data from your systems and injecting it into the prompt before the LLM generates a response. The model is not guessing. It is reading your data and summarizing it. The quality of its response is bounded by the quality of what you retrieve.

In Salesforce's implementation, Data Cloud is the retrieval layer. It ingests data from CRM, external systems, and unstructured sources, unifies it through identity resolution, indexes it for semantic search, and serves it to Einstein and Agentforce when they need context.

The Full RAG Pipeline

┌───────────────┐   ┌───────────────┐   ┌───────────────┐
│  Data Sources  │   │  Data Sources  │   │  Data Sources  │
│  (CRM Objects) │   │  (External DB) │   │  (Unstructured)│
│  Account, Opp, │   │  ERP, billing, │   │  PDFs, emails, │
│  Case, Contact │   │  inventory     │   │  chat logs     │
└───────┬───────┘   └───────┬───────┘   └───────┬───────┘
        │                   │                   │
        └──────────────┬────┴───────────────────┘
                       ▼
              ┌─────────────────┐
              │  DATA CLOUD     │
              │  Ingestion      │
              │  (connectors +  │
              │   data streams) │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │  Data Lake      │
              │  Objects (DLOs) │
              │  Raw ingested   │
              │  data           │
              └────────┬────────┘
                       │
                       ▼
              ┌─────────────────┐
              │  Data Model     │
              │  Objects (DMOs) │
              │  Mapped and     │
              │  harmonized     │
              └────────┬────────┘
                       │
            ┌──────────┼──────────┐
            │          │          │
            ▼          ▼          ▼
      ┌──────────┐ ┌────────┐ ┌──────────┐
      │ Identity │ │ Vector │ │ Semantic  │
      │ Resoln.  │ │ Index  │ │ Search   │
      └──────────┘ └────────┘ └──────────┘
                       │
                       ▼
              ┌─────────────────┐
              │  Retrieval      │
              │  (Einstein /    │
              │   Agentforce)   │
              └─────────────────┘

Let me walk through each stage with real examples.

Stage 1: Data Ingestion

Data Cloud ingests data through connectors. There are native connectors for Salesforce CRM objects (zero config), and external connectors for databases, cloud storage, and APIs.

Native CRM Connector (automatic):
  Source: Account, Contact, Opportunity, Case, Lead
  Sync: Real-time (CDC - Change Data Capture)
  Latency: Under 60 seconds from CRM change to Data Cloud

External Connector (configured):
  Source: Snowflake, BigQuery, S3, Azure Blob, custom API
  Sync: Batch (hourly, daily) or streaming
  Latency: Batch = up to 1 hour, Streaming = minutes

Unstructured Connector:
  Source: S3 bucket with PDFs, email archives, chat transcripts
  Sync: Batch (typically nightly)
  Processing: Extraction + chunking + embedding (more below)

The CRM connector is the easiest. Turn it on. Select which objects to sync. Data flows in real-time via Change Data Capture. When a rep updates an opportunity stage, that change is in Data Cloud within a minute.

External connectors require more setup. You define the schema mapping, configure authentication, and set the sync schedule. For RAG purposes, the critical decision is sync frequency. If your AI is answering questions about order status and orders live in an external ERP, batch sync every hour means your AI could be up to 60 minutes stale. For latency-sensitive use cases, use streaming ingestion or query the source directly at retrieval time.

Stage 2: Data Lake Objects and Data Model Objects

Ingested data lands in Data Lake Objects (DLOs). These are raw, source-specific tables. The DLO for Salesforce Accounts looks different from the DLO for ERP customer records, even if they represent the same entities.

Data Model Objects (DMOs) are the harmonized layer. You map DLO fields to a standard schema. The Salesforce Account and the ERP customer record both map to the "Unified Individual" or "Unified Account" DMO.

Data Lake Objects (raw):
  sf_account_dlo:
    - Id, Name, Industry, AnnualRevenue, OwnerId, ...
  erp_customer_dlo:
    - customer_id, company_name, vertical, arr, sales_rep, ...

Data Model Objects (harmonized):
  UnifiedAccount:
    - account_name    <- sf_account_dlo.Name, erp_customer_dlo.company_name
    - industry        <- sf_account_dlo.Industry, erp_customer_dlo.vertical
    - revenue         <- sf_account_dlo.AnnualRevenue, erp_customer_dlo.arr
    - source_system   <- "salesforce" | "erp"

The mapping is manual and requires domain knowledge. Which field in the ERP corresponds to which field in Salesforce? What happens when they conflict? Data Cloud lets you define merge rules: "take the Salesforce value if both exist" or "take the most recently updated value." These decisions affect RAG quality directly. If the merge rule picks stale data, the AI gets stale context.

Stage 3: Identity Resolution

This is the feature that makes Data Cloud's RAG different from a generic vector database. Identity resolution matches records from different sources that represent the same real-world entity.

Identity Resolution Example:

  Salesforce CRM:
    Contact: "John Smith", john@acme.com, (555) 123-4567

  Marketing Cloud:
    Subscriber: "J. Smith", john.smith@acme.com

  ERP:
    Customer: "Jonathan Smith", jsmith@acme.com, (555) 123-4567

  After Identity Resolution:
    Unified Individual:
      Names: ["John Smith", "J. Smith", "Jonathan Smith"]
      Emails: ["john@acme.com", "john.smith@acme.com", "jsmith@acme.com"]
      Phone: "(555) 123-4567"
      Matched by: Phone (exact) + Email domain (fuzzy) + Name (fuzzy)

Identity resolution uses probabilistic matching. It compares fields across records using exact match, fuzzy match (Levenshtein distance, soundex), and domain-specific rules. The match confidence is scored, and records above a threshold are unified.

For RAG, this means an Agentforce agent can retrieve all data about a customer regardless of which system it originated from. A service agent asking "What's the full picture on this customer?" gets CRM data, marketing engagement data, ERP order history, and support tickets, all unified under one identity.

The match rules are configurable. Too loose and you merge distinct people. Too strict and you miss valid matches. Tuning match rules is a separate discipline, but for RAG purposes, precision matters more than recall. It is worse to retrieve wrong data (merged from the wrong person) than to miss some data (a valid record that was not matched).

Stage 4: Chunking and Embedding Unstructured Content

Structured data (CRM fields, ERP records) does not need embedding for most RAG use cases. A SOQL query retrieves it precisely. But unstructured data (PDFs, emails, knowledge articles, chat transcripts) cannot be queried with SQL. It needs to be chunked, embedded, and indexed for semantic search.

Chunking Strategy:
  1. Extract text from source (PDF parsing, email body extraction)
  2. Split into chunks of ~500 tokens (roughly 375 words)
  3. Overlap adjacent chunks by 50 tokens (context preservation)
  4. Attach metadata: source document, page number, timestamp,
     related record ID (e.g., Case ID for a case email)

Embedding:
  Model: Salesforce's embedding model (based on BERT architecture)
  Dimensions: 768
  Input: Chunk text + metadata
  Output: Dense vector representation

Example chunk:
  Text: "The customer reported intermittent connectivity issues
    with the X9000 unit. Troubleshooting revealed the firmware
    was on version 2.1.3, which has a known bug with WiFi
    reconnection after power cycling. Updated to firmware 3.0.1.
    Issue resolved."
  Metadata: {
    source: "Case 00045678",
    type: "case_comment",
    product: "X9000",
    date: "2025-09-15"
  }
  Vector: [0.023, -0.117, 0.891, ...] (768 dimensions)

The chunk size of 500 tokens is a trade-off. Smaller chunks are more precise (the retrieved text is closely related to the query) but lack context. Larger chunks have more context but may include irrelevant information that dilutes the signal. 500 tokens with 50-token overlap works well for support case data and knowledge articles. For highly structured documents (contracts, specifications), larger chunks of 800-1000 tokens preserve the structural context better.

Stage 5: Semantic Search vs SOSL Keyword Search

Data Cloud supports two retrieval methods: vector (semantic) search and SOSL keyword search. They serve different purposes.

Vector Search:
  Query: "customer having trouble connecting X9000 to wifi"
  Matches: Chunks about WiFi issues, connectivity problems,
    X9000 troubleshooting, firmware updates
  Strength: Finds conceptually related content even if
    exact keywords don't match
  Weakness: Can retrieve tangentially related content
    that dilutes context

SOSL Keyword Search:
  Query: "X9000 WiFi firmware 3.0.1"
  Matches: Chunks containing those exact terms
  Strength: Precise. Returns exactly what you asked for.
  Weakness: Misses relevant content that uses different terms
    ("wireless" instead of "WiFi", "software update" instead
    of "firmware")

In practice, we use both. The retrieval step runs a vector search for semantic relevance and a keyword search for precision, then merges and deduplicates the results. This hybrid approach consistently outperforms either method alone.

Hybrid Retrieval Strategy:
  1. Vector search: top 5 chunks by cosine similarity
  2. SOSL search: top 5 chunks by keyword relevance
  3. Merge results, deduplicate by chunk ID
  4. Re-rank by combined score
  5. Take top 5 chunks for prompt injection

  Typical result: 3 chunks from vector, 2 from SOSL
  (with 1-2 appearing in both)

Stage 6: FLS Enforcement in AI Context

This is the part that every other RAG system ignores. Salesforce enforces Field-Level Security (FLS) and record-level sharing during retrieval. If the user asking the question does not have access to a field or record, the retrieval step will not return it.

FLS Enforcement in RAG:

  User: Sales Rep (Profile: Standard User)
  Question: "Summarize the Acme account"

  Retrieval returns:
    - Account name, industry, owner (user has access)
    - Open opportunities (user has access to their own)
    - Recent activities (user has access)

  Retrieval does NOT return:
    - AnnualRevenue field (hidden by FLS for this profile)
    - Opportunities owned by other reps (sharing rules)
    - Case details (user does not have Case read access)
    - Internal notes field (hidden by FLS)

  The LLM can only summarize what was retrieved.
  It cannot hallucinate the revenue or other reps' deals
  because that data never entered the prompt.

This is fundamentally different from building RAG on a generic vector database. In a custom RAG system, you would need to implement access control in your retrieval layer. In Data Cloud, it is automatic. The retrieval respects the same permission model that the Salesforce UI enforces. This matters enormously for regulated industries. HIPAA, SOX, and GDPR do not care about your AI's capabilities. They care about who can access what data. Data Cloud's RAG pipeline inherits the existing access control model.

When RAG Helps

RAG is not universally good. It adds latency, complexity, and cost. Use it when the value justifies the overhead.

RAG is valuable when:
  - The AI needs to reference specific customer data
  - The answer depends on recent or dynamic information
  - Accuracy is more important than speed
  - The knowledge base changes frequently
  - Users need citations ("where did this come from?")

Examples:
  "Summarize this account's recent activity"  -> RAG (needs real data)
  "What's our return policy?"                 -> RAG (knowledge article)
  "Draft an email to this contact"            -> RAG (needs contact context)
  "Generate a case summary"                   -> RAG (needs case details)

When RAG Slows Things Down

RAG adds latency:
  Without RAG: User query -> LLM -> Response (1-2 seconds)
  With RAG: User query -> Embed query -> Vector search ->
            Retrieve chunks -> Inject into prompt -> LLM ->
            Response (3-5 seconds)

RAG is NOT needed when:
  - The answer is static and well-known
  - The LLM's training data already contains the answer
  - The question is about general knowledge, not customer-specific data
  - Speed is more important than accuracy

Examples:
  "How do I reset my password?"     -> No RAG needed (static FAQ)
  "What are your business hours?"   -> No RAG needed (static info)
  "Hello, I need help"              -> No RAG needed (greeting)

For Agentforce agents, the decision of whether to invoke RAG should be action-level. The agent's topic classification determines the topic. The topic's actions determine whether RAG is invoked. A "Greeting" action does not need RAG. An "Account Summary" action does. This selective approach keeps simple interactions fast while grounding complex ones in real data.

Retrieval Quality: The Make-or-Break Factor

The quality of the AI's response is bounded by the quality of what you retrieve. If you retrieve the wrong chunks, the LLM will confidently summarize the wrong data. This is worse than hallucination because the response looks grounded ("Based on the case history...") while being grounded in the wrong case history.

Retrieval Quality Checklist:
  [ ] Chunk size appropriate for content type
  [ ] Overlap prevents context loss at chunk boundaries
  [ ] Metadata attached to every chunk (source, date, record ID)
  [ ] Embedding model matches query language (English vs multilingual)
  [ ] Top-k parameter tuned (too few = missing context, too many = noise)
  [ ] Re-ranking applied after initial retrieval
  [ ] Stale content purged from index on source deletion
  [ ] FLS and sharing enforced at retrieval time

Test retrieval quality separately from generation quality. Give the system 50 representative queries. For each query, evaluate whether the top-5 retrieved chunks contain the information needed to answer correctly. If retrieval precision is below 80%, improve chunking, embeddings, or metadata before tuning the LLM prompt. The best prompt in the world cannot compensate for retrieving the wrong data.

Data Cloud RAG is not a plug-and-play feature. It is a pipeline with six stages, each of which affects the final output quality. Getting it right requires understanding every stage, tuning each one, and monitoring the end-to-end result. But when it works, it transforms Salesforce AI from a generic chatbot into a system that actually knows your customers. Need help building your RAG pipeline? We have implemented it across CRM, ERP, and unstructured data sources.