Engineering notes

How the FAQ Widget actually works.

One Next.js deploy. No vector database. No embedding model. A small keyword retrieval algorithm, a strictly grounded Gemini prompt, and a sandboxed iframe. Below: every step of the pipeline, the security decisions, and the gaps that are honestly disclosed.

← back to playground

Pipeline — step by step

From question to grounded answer.

01
CORS preflight (OPTIONS)
The widget runs on third-party domains, so every request is cross-origin. An OPTIONS handler returns the CORS headers (Access-Control-Allow-Origin: *) so browsers permit the preflight. This is an explicit, documented trade-off — the open policy is mitigated by rate limiting. See the security stance below.
02
Rate limit by IP
50 questions per IP per 24 hours via Upstash sliding window (prefix: rl:faq). The open CORS policy makes this endpoint especially abuse-prone — rate limiting is the primary control. When Upstash is not configured, the app degrades gracefully: no rate limiting, but a console warning is emitted.
03
Zod validation
question: 3–500 chars; kb: 20–40 000 chars. Both fields are required. Validation runs before any retrieval or AI call — invalid input is rejected with a 400 and a typed error code. No stack traces reach the client.
04
Keyword retrieval (kb.ts)
The KB is split on double-newlines, then chunked to ≤800 chars at sentence boundaries. Both the question and each chunk are tokenised into lowercase word sets with stop words removed. Jaccard token overlap scores each chunk. Top-5 passages are returned. If overlap is zero for all chunks (stop-word-only query), the first 5 chunks are returned as a fallback so the model always has context to work with.
05
Prompt construction (prompt.ts)
The system instruction contains: (a) the assistant persona + rules, and (b) the top-5 KB passages as TRUSTED CONTEXT. The user question is placed in the user turn only as "VISITOR QUESTION: …" — it is NEVER interpolated into the system instruction. This separation is the prompt-injection defence: even if the question contains "Ignore previous instructions", the system rules still apply.
06
Gemini 2.5 Flash call
Temperature 0.1 (near-deterministic), maxOutputTokens 512. The grounding rule is explicit in the system prompt: answer only from context; say "I don't have that in my knowledge base" when the answer is not there. maxDuration: 30s on the route. Errors are caught and a typed MODEL_ERROR code is returned — the raw exception message never reaches the client.
07
Response + passages
The API returns { ok: true, answer, passages, remaining }. passages is the list of KB chunks the retrieval selected — the frontend can display them as "sources used". remaining tells the widget how many queries the user has left today. CORS headers are applied to every response, including error responses.

At a glance

ASCII pipeline diagram.

POST /api/ask  { question, kb }
       │
       ├─── OPTIONS preflight ──▶ CORS headers (204)
       │
       ▼
 ┌─────────────────────────────────────────────────┐
 │  1. Rate limit                                  │
 │     Upstash sliding window · 50/IP/day          │
 │     prefix: rl:faq                              │
 │     no-op (warn) if Redis not configured        │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  2. Zod validate                                │
 │     question: 3–500 chars                       │
 │     kb:       20–40 000 chars                   │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  3. Keyword retrieval  (kb.ts)                  │
 │                                                 │
 │  kb ──▶ split paragraphs ──▶ chunk (≤800 chars) │
 │                                                 │
 │  question ──▶ tokenise ──▶ stop-word filter     │
 │                                                 │
 │  ∀ chunk: Jaccard overlap score                 │
 │  sort desc ──▶ top-5 passages                   │
 │  fallback: first-5 when all scores = 0          │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  4. Build prompt  (prompt.ts)                   │
 │                                                 │
 │  system instruction:                            │
 │    · persona + grounding rules                  │
 │    · KB passages  ← TRUSTED CONTEXT             │
 │                                                 │
 │  user turn:                                     │
 │    "VISITOR QUESTION: {question}"               │
 │                       ↑ UNTRUSTED DATA          │
 │                                                 │
 │  Separation = prompt-injection defence          │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  5. Gemini 2.5 Flash                            │
 │     temp 0.1 · maxOutputTokens 512              │
 │                                                 │
 │     grounding rule in system prompt:            │
 │     "answer ONLY from context;                  │
 │      say 'I don't have that in my KB'           │
 │      if answer not found"                       │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 { ok: true, answer, passages, remaining }
 + CORS headers on every response

Security stance

What’s defended, what’s not.

Open CORS policy — deliberate trade-off

Access-Control-Allow-Origin: * is required for the widget to work on arbitrary third-party sites. This means any site can POST questions. Mitigations: (1) rate limiting by IP; (2) question length cap; (3) the KB is sent by the operator, not stored server-side — there is no persistent data to exfiltrate; (4) the model cannot perform actions — it only returns text.

Prompt injection

The user question is placed in the user turn only, never in the system instruction. The system prompt explicitly instructs the model to treat the question as pure data and ignore injection attempts. KB passages are the trusted context; the question is untrusted input.

Hallucination / ungrounded answers

The grounding rule is hard-coded in the system instruction: answer only from the provided KB passages; say "I don't have that in my knowledge base" when the answer is absent. Temperature 0.1 reduces creative variation. The retrieval fallback (first 5 chunks when overlap = 0) ensures the model always has context to evaluate against.

Rate limiting / abuse

50 questions per IP per day via Upstash Redis sliding window. Graceful no-op when Upstash is not configured — but deploying without Redis in production is a known gap that must be documented and accepted by the operator.

Input validation

Zod validates both inputs at the API boundary. question: 3–500 chars. kb: 20–40 000 chars. Oversized requests are rejected before the retrieval and model calls.

No server-side KB storage

The KB is sent with every request and never persisted server-side. There is no database to breach. The operator's KB content is in transit only — protected by TLS.

No stack traces to client

All errors are returned as typed { ok: false, error: AskErrorCode, message } objects. Raw exception messages and stack traces are logged server-side only.

Sandboxed iframe

embed.js creates the iframe with sandbox="allow-scripts allow-same-origin". allow-same-origin is needed for the /api/ask fetch; allow-scripts is needed for React. The parent page cannot access the iframe's DOM cross-origin. The KB is passed via postMessage, not URL query params, to avoid it appearing in server/referrer logs.

No Redis in production

If UPSTASH_REDIS_REST_* is not configured, rate limiting silently becomes a no-op. The app logs a warning, but operators must configure Redis before exposing this to the open web. This is a deployment responsibility gap, not a code gap.

KB confidentiality in transit

The KB is sent with every API request. On production HTTPS deployments this is TLS-protected. On HTTP deployments (local dev, misconfigured prod) the KB content is visible in transit. The embed.js snippet passes KB via postMessage, not URL params, reducing exposure in logs.

Design notes

Why no embeddings?

Simplicity

Keyword retrieval needs zero infrastructure. No vector database, no embedding API, no chunking pipeline. The KB is processed in-memory per request. For FAQs (short, well-scoped questions), keyword overlap is highly competitive with semantic similarity.

Latency

Keyword scoring over a 40k-char KB takes ~5ms. An embedding call adds 100–500ms and a vector DB lookup adds more. For a widget that must feel instant, in-memory retrieval wins.

Cost

No embedding API means no per-call embedding cost. The only variable cost is the Gemini generation call — ~$0.0003 per question at Gemini 2.5 Flash pricing. For a widget doing 50 questions/IP/day this is negligible.

From question to grounded answer.

CORS preflight (OPTIONS)

Rate limit by IP

Zod validation

Keyword retrieval (kb.ts)

Prompt construction (prompt.ts)

Gemini 2.5 Flash call

Response + passages

ASCII pipeline diagram.

Why no embeddings?

Simplicity

Latency

Cost

Get an embeddable FAQ widget for your product.