Prompt Templates at Scale: When Your Template Meets 50,000 Records
A prompt template that works beautifully on ten records can fail catastrophically on ten thousand. Variable data quality, token budget overflows, and missing field values turn a demo into a production incident. Here is how to build templates that survive real data.
The Variable Data Quality Problem
Every prompt template demo uses perfect data. The Account has a name, industry, annual revenue, description, and a clean set of related opportunities. The template merges the fields, the LLM generates a beautiful summary, and the demo audience applauds.
Then you deploy the template to an org with 50,000 accounts. And you discover that 40% of accounts have no industry set. 25% have no annual revenue. 60% have an empty description field. 15% have names like "Test Account DO NOT USE" or "Acme Corp (DUPLICATE)". The template that looked flawless in the demo now generates garbage for half your records.
Data Quality Reality Check (Typical Org):
==========================================
Field Population Rate Quality Issues
Account.Name 100% 12% have "test", "duplicate", etc.
Account.Industry 62% 8% are "Other"
Account.AnnualRevenue 38% 15% are clearly wrong ($1, $999999999)
Account.Description 41% 30% are copy-pasted boilerplate
Account.Website 55% 5% are broken links
Contact.Title 47% 20% are outdated
Contact.Email 89% 3% are bouncing
Opportunity.Description 34% Most are empty or "TBD"
Opportunity.Amount 72% Some are placeholder values
A template that starts with "Given that {Account.Name} is a {Account.Industry} company with ${Account.AnnualRevenue} in annual revenue..." will produce "{Account.Name} is a company with $ in annual revenue" for 38% of your accounts. The LLM will try to work with this, and the results will be embarrassing.
The Four Template Types
Salesforce offers four types of prompt templates, each with different use cases and different failure modes at scale.
Template Types:
1. Sales Email - Generate outbound email copy
2. Record Summary - Summarize a record and related data
3. Field Generation - Generate a value for a specific field
4. Flex - Free-form, custom instructions
Failure modes at scale:
Sales Email: Personalization fails when contact data is sparse.
"Dear {Contact.FirstName}" becomes "Dear" for 2% of records.
Record Summary: Summary is shallow when related records are empty.
"Account has no recent activity" x 40,000 records.
Field Generation: Generates plausible but wrong values when source
fields are empty. "Industry: Technology" for a
healthcare company with no Industry field set.
Flex: Most powerful, most dangerous. Custom logic can
reference fields that do not exist on all record types.
Merge Field Resolution Order
When a prompt template is invoked, Salesforce resolves merge fields in a specific order. Understanding this order is essential for building reliable templates.
Merge Field Resolution Order:
1. Direct field on the source record
{Account.Name} -> "Acme Corp"
2. Related record fields (parent relationships)
{Account.Owner.Name} -> "Jane Doe"
{Opportunity.Account.Industry} -> "Technology"
3. Related list aggregations (child relationships)
{Account.Opportunities} -> List of opportunity records
{Case.CaseComments} -> List of comment records
4. Custom formula fields
{Account.Health_Score__c} -> 78
5. Cross-object references (via lookup)
{Opportunity.Account.Owner.Manager.Name} -> traverses 3 lookups
Resolution limits:
- Maximum 5 relationship traversals
- Maximum 2,000 tokens of merged data per template
- Related lists limited to most recent 20 records
- FLS enforced: fields the user cannot see resolve to empty
The 2,000-token limit on merged data is the one that bites at scale. A record summary template that pulls in related opportunities, cases, and activities can easily exceed this. When it does, the merged data is truncated silently. The template gets partial context and generates a partial summary.
Defensive Templates for Missing Data
The core technique for building scale-ready templates is defensive prompting. You instruct the LLM how to handle missing data explicitly, rather than hoping it figures it out.
// BAD: Assumes all data exists
Template: Sales Email
---
"Write a personalized email to {Contact.FirstName} {Contact.LastName},
who is the {Contact.Title} at {Account.Name}, a {Account.Industry}
company with ${Account.AnnualRevenue} in annual revenue. Reference
their recent {Account.Last_Activity_Type__c} on {Account.Last_Activity_Date__c}."
// Problem: When fields are empty, the LLM sees:
"Write a personalized email to ,
who is the at Acme Corp, a company with $ in annual revenue.
Reference their recent on ."
// The LLM generates a terrible, generic email.
// GOOD: Handles missing data explicitly
Template: Sales Email
---
"Write a personalized sales email using the following context.
Some fields may be empty. If a field is empty, do NOT mention it
or make up a value. Adjust the email to focus on what we know.
Contact Name: {Contact.FirstName} {Contact.LastName}
Contact Title: {Contact.Title}
Company: {Account.Name}
Industry: {Account.Industry}
Annual Revenue: {Account.AnnualRevenue}
Recent Activity: {Account.Last_Activity_Type__c} on {Account.Last_Activity_Date__c}
Open Opportunities: {Account.Open_Opp_Count__c}
Rules:
- If Contact Title is empty, do not reference their role.
- If Industry is empty, keep the email industry-agnostic.
- If Annual Revenue is empty, do not mention company size.
- If no recent activity, open with a cold introduction instead.
- If Contact FirstName is empty, use 'Hi there' as greeting.
- Keep the email under 150 words.
- Include one specific value proposition relevant to
the available context."
The defensive template works for 100% of records, not just the well-populated ones. When data is missing, the LLM adapts the output rather than generating nonsense. The explicit rules ("If Industry is empty, keep the email industry-agnostic") are more reliable than implicit expectations.
Token Budgets
Every prompt template has a token budget. The budget is split between the template instructions, the merged data, and the expected output. If you do not manage this budget, you will hit truncation or LLM errors at scale.
Token Budget Breakdown:
┌─────────────────────────────────────────┐
│ Model Context Window: 8,192 tokens │
│ │
│ System Prompt (Trust Layer): ~500 │
│ Template Instructions: ~400 │
│ Merged Data: ~2,000 │
│ Conversation History (if agent): ~1,000 │
│ ─────────────────────────────────────── │
│ Available for Output: ~4,292 │
│ Target Output Length: ~500 │
│ Safety Buffer: ~3,792 │
└─────────────────────────────────────────┘
For record summaries with related lists:
Each related record ≈ 50-100 tokens
20 related opportunities ≈ 1,000-2,000 tokens
20 related cases ≈ 800-1,600 tokens
10 activities ≈ 300-500 tokens
Total merged data for a "full picture" template:
2,100-4,100 tokens (often exceeds the 2,000 limit)
The practical solution is to be selective about what data you merge. Do not pull all related records. Pull the most relevant ones.
// Instead of merging ALL opportunities:
Related Opportunities: {Account.Opportunities}
// This pulls up to 20 opportunities with all fields. Easily 2,000+ tokens.
// Merge a pre-filtered summary (via formula or Apex):
Open Opportunities Summary:
Count: {Account.Open_Opp_Count__c}
Total Pipeline: {Account.Open_Pipeline_Amount__c}
Nearest Close Date: {Account.Next_Close_Date__c}
Largest Deal: {Account.Largest_Open_Opp__c}
// This is 4 fields. Under 50 tokens. Contains the essential information.
Using roll-up summary fields or formula fields to pre-compute aggregates is the single most effective token optimization. Instead of merging 20 opportunity records (1,000+ tokens), you merge 4 aggregate fields (50 tokens) that contain the same decision-relevant information.
Chaining Templates in Flows
A single template has limits. Chaining templates in a Flow lets you build multi-step AI workflows where the output of one template feeds into the next.
Flow: Account Review Preparation
==========================================
Step 1: Record Summary Template
Input: Account record
Output: 200-word account summary
-> Store in variable: accountSummary
Step 2: Risk Assessment Template
Input: accountSummary + open Cases + recent Activities
Output: Risk level (Low/Medium/High) + reasoning
-> Store in variable: riskAssessment
Step 3: Recommended Actions Template
Input: accountSummary + riskAssessment + Account.Owner preferences
Output: 3-5 prioritized action items
-> Store in variable: recommendations
Step 4: Email Draft Template
Input: recommendations + Account Owner name + Account contact info
Output: Email to Account Owner with review summary
-> Send or store as draft
Chaining has three advantages over a single mega-template. First, each template has its own token budget. A 4-step chain has 4x the effective context window. Second, intermediate outputs are focused. The risk assessment template sees the summary, not the raw data. The signal-to-noise ratio is better. Third, you can test and iterate each step independently.
The trade-off is latency. Each template invocation takes 2-4 seconds. A 4-step chain takes 8-16 seconds. For user-facing features where the user waits for the output, this is too slow. For background batch operations (prepare account reviews overnight), it is fine.
// Flow with error handling for template chains
Decision: Did Step 1 succeed?
Yes -> Proceed to Step 2
No -> Log error, use fallback summary from formula field
Decision: Did Step 2 output contain a risk level?
Yes -> Proceed to Step 3
No -> Default to "Medium" risk, proceed with caution note
Decision: Is total latency under 30 seconds?
Yes -> Proceed to Step 4
No -> Skip email draft, deliver partial results
Monitoring Which Outputs Users Edit
The most valuable signal for improving prompt templates is what users change. When a user generates a sales email and then edits 80% of it before sending, the template is failing. When they edit 10%, it is succeeding.
Monitoring Strategy:
1. Store the raw template output (before user edits)
2. Store the final version (after user edits)
3. Compute edit distance (Levenshtein or diff ratio)
4. Track over time by template, by user, by record type
Metrics:
Edit Rate: % of outputs that users modify at all
Edit Depth: Average % of text changed when edited
Reject Rate: % of outputs regenerated or discarded
Time to Send: Seconds between generation and send/save
Targets:
Edit Rate: < 60% (most outputs used as-is or lightly tweaked)
Edit Depth: < 30% (edits are minor adjustments, not rewrites)
Reject Rate: < 10% (output is usable on first generation)
Red Flags:
Edit Depth > 50%: Template is not generating useful output.
Review instructions and merge field quality.
Reject Rate > 20%: Template is producing harmful or wrong content.
Check for data quality issues in source records.
Certain users edit 90%+: Those users may not trust AI output.
Training issue, not template issue.
Implementing this monitoring requires capturing the generated text before and after editing. In Salesforce, you can do this with a before-update trigger on the field being populated (e.g., Email body) that compares the old value (template output) with the new value (edited version). Store the diff metrics on a custom object for reporting.
// Trigger to track template output edits
trigger TrackTemplateEdits on EmailMessage (before update) {
for (EmailMessage msg : Trigger.new) {
EmailMessage oldMsg = Trigger.oldMap.get(msg.Id);
// Check if the body was edited
if (msg.HtmlBody != oldMsg.HtmlBody && oldMsg.AI_Generated__c) {
String original = oldMsg.HtmlBody;
String edited = msg.HtmlBody;
// Calculate edit distance (simplified)
Integer editDistance = LevenshteinDistance.compute(
original, edited
);
Decimal editRatio = (Decimal) editDistance /
Math.max(original.length(), 1);
msg.AI_Edit_Ratio__c = editRatio;
msg.AI_Was_Edited__c = (editRatio > 0.05);
// Log for monitoring dashboard
insert new AI_Template_Metric__c(
Template_Name__c = oldMsg.AI_Template_Name__c,
Edit_Ratio__c = editRatio,
User__c = UserInfo.getUserId(),
Record_Type__c = 'EmailMessage',
Timestamp__c = DateTime.now()
);
}
}
}
Template Testing Strategy
Testing prompt templates is different from testing code. The output is non-deterministic. The same input can produce different outputs across invocations. You cannot assert exact string matches. You need a different approach.
Testing Framework for Prompt Templates:
==========================================
Level 1: Structural Tests (automated)
- Output is not empty
- Output is within expected length range (100-300 words)
- Output does not contain merge field syntax ({Account.Name})
- Output does not contain "null", "undefined", or "N/A" in weird places
- Output does not contain PII that should have been masked
Level 2: Content Tests (semi-automated)
- Output mentions the account/contact name
- Output references data that was in the merged context
- Output does NOT reference data that was NOT in the context (hallucination)
- Output follows the template's style instructions (formal/casual)
- Output respects the length constraint
Level 3: Edge Case Tests (manual review)
- Record with all fields empty
- Record with extremely long field values (4,000 char description)
- Record with special characters in name (O'Brien, Muller & Sons)
- Record with non-English content in fields
- Record with contradictory data (industry = Healthcare, description mentions software)
- Record with "test" or "duplicate" in the name
For Level 1 and 2 tests, build a test harness that invokes the template against a set of 50 representative records (10 well-populated, 10 sparse, 10 with edge cases, 10 from different record types, 10 random). Score each output against the criteria. Flag outputs that fail any criterion for manual review.
Test Record Categories (50 records):
Well-populated (10):
Full data across all merge fields. Baseline quality check.
Sparse (10):
50-70% of merge fields empty. Tests defensive template logic.
Edge cases (10):
Special characters, very long values, contradictions.
Different record types (10):
Different industries, sizes, stages. Tests template versatility.
Random production sample (10):
Randomly selected from production data. Reality check.
Scoring:
Each output scored 0-5 on: Accuracy, Relevance, Completeness,
Tone, No Hallucination
Aggregate score per template:
> 4.0 average: Ready for production
3.0-4.0: Needs instruction tuning
< 3.0: Fundamental template redesign needed
Batch Operations: Templates at 50,000 Records
When you need to run a prompt template against thousands of records (e.g., generate account summaries for all accounts before a quarterly review), you face three constraints: API rate limits, token costs, and quality variance.
Batch Template Execution Constraints:
API Rate Limits:
Einstein API: ~100 requests/minute (varies by org edition)
50,000 records at 100/min = 500 minutes = 8.3 hours
Token Costs (approximate):
Input: ~500 tokens/record (template + merged data)
Output: ~200 tokens/record
Total: 700 tokens x 50,000 = 35,000,000 tokens
Cost at GPT-4o-mini pricing: ~$5-10
Cost at GPT-4o pricing: ~$100-200
Quality Variance:
5% of outputs will need manual review (edge cases, bad data)
50,000 x 5% = 2,500 records needing human review
The practical approach for batch operations:
Batch Strategy:
1. Pre-filter: Run a report to identify records with sufficient data
quality. Skip records where > 50% of merge fields are empty.
Result: 50,000 -> 35,000 viable records.
2. Segment: Group by record type or data quality tier.
Tier A (well-populated): Use the standard template.
Tier B (sparse): Use the defensive template with simplified output.
Tier C (poor data): Skip or use a "data quality alert" template.
3. Throttle: Process in batches of 200 with 2-second delays.
Respects API limits. Takes ~6 hours for 35,000 records.
4. Validate: Run Level 1 structural tests on all outputs.
Flag failures for manual review.
5. Sample review: Manually review 2% random sample (700 records).
Extrapolate quality to the full batch.
The Template Quality Flywheel
Build Template
│
▼
Test Against 50 Records ─── Fix Issues ──┐
│ │
▼ │
Deploy to Production │
│ │
▼ │
Monitor Edit Rate + Reject Rate │
│ │
▼ │
Identify Patterns │
- Which records produce bad output? │
- Which users edit the most? │
- Which fields are most often missing? │
│ │
▼ │
Improve Data OR Improve Template ──────────┘
The flywheel works in two directions. Sometimes the template needs better instructions. Sometimes the data needs better quality. Monitoring tells you which. If 90% of bad outputs come from records with empty Industry fields, the fix is a data quality campaign to populate Industry, not a template redesign.
Templates at scale are a data quality problem disguised as an AI problem. The LLM is the easy part. The hard part is ensuring that 50,000 records have enough usable data to generate meaningful output. Invest in data quality before you invest in template sophistication. Need help building templates that survive production data? We have deployed them across orgs with 100K+ records.