Engineering notes

How the FAQ Widget actually works.

One Next.js deploy. No vector database. No embedding model. A small keyword retrieval algorithm, a strictly grounded Gemini prompt, and a sandboxed iframe. Below: every step of the pipeline, the security decisions, and the gaps that are honestly disclosed.

Pipeline — step by step

From question to grounded answer.

  1. 01

    CORS preflight (OPTIONS)

    The widget runs on third-party domains, so every request is cross-origin. An OPTIONS handler returns the CORS headers (Access-Control-Allow-Origin: *) so browsers permit the preflight. This is an explicit, documented trade-off — the open policy is mitigated by rate limiting. See the security stance below.

  2. 02

    Rate limit by IP

    50 questions per IP per 24 hours via Upstash sliding window (prefix: rl:faq). The open CORS policy makes this endpoint especially abuse-prone — rate limiting is the primary control. When Upstash is not configured, the app degrades gracefully: no rate limiting, but a console warning is emitted.

  3. 03

    Zod validation

    question: 3–500 chars; kb: 20–40 000 chars. Both fields are required. Validation runs before any retrieval or AI call — invalid input is rejected with a 400 and a typed error code. No stack traces reach the client.

  4. 04

    Keyword retrieval (kb.ts)

    The KB is split on double-newlines, then chunked to ≤800 chars at sentence boundaries. Both the question and each chunk are tokenised into lowercase word sets with stop words removed. Jaccard token overlap scores each chunk. Top-5 passages are returned. If overlap is zero for all chunks (stop-word-only query), the first 5 chunks are returned as a fallback so the model always has context to work with.

  5. 05

    Prompt construction (prompt.ts)

    The system instruction contains: (a) the assistant persona + rules, and (b) the top-5 KB passages as TRUSTED CONTEXT. The user question is placed in the user turn only as "VISITOR QUESTION: …" — it is NEVER interpolated into the system instruction. This separation is the prompt-injection defence: even if the question contains "Ignore previous instructions", the system rules still apply.

  6. 06

    Gemini 2.5 Flash call

    Temperature 0.1 (near-deterministic), maxOutputTokens 512. The grounding rule is explicit in the system prompt: answer only from context; say "I don't have that in my knowledge base" when the answer is not there. maxDuration: 30s on the route. Errors are caught and a typed MODEL_ERROR code is returned — the raw exception message never reaches the client.

  7. 07

    Response + passages

    The API returns { ok: true, answer, passages, remaining }. passages is the list of KB chunks the retrieval selected — the frontend can display them as "sources used". remaining tells the widget how many queries the user has left today. CORS headers are applied to every response, including error responses.

At a glance

ASCII pipeline diagram.

POST /api/ask  { question, kb }
       │
       ├─── OPTIONS preflight ──▶ CORS headers (204)
       │
       ▼
 ┌─────────────────────────────────────────────────┐
 │  1. Rate limit                                  │
 │     Upstash sliding window · 50/IP/day          │
 │     prefix: rl:faq                              │
 │     no-op (warn) if Redis not configured        │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  2. Zod validate                                │
 │     question: 3–500 chars                       │
 │     kb:       20–40 000 chars                   │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  3. Keyword retrieval  (kb.ts)                  │
 │                                                 │
 │  kb ──▶ split paragraphs ──▶ chunk (≤800 chars) │
 │                                                 │
 │  question ──▶ tokenise ──▶ stop-word filter     │
 │                                                 │
 │  ∀ chunk: Jaccard overlap score                 │
 │  sort desc ──▶ top-5 passages                   │
 │  fallback: first-5 when all scores = 0          │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  4. Build prompt  (prompt.ts)                   │
 │                                                 │
 │  system instruction:                            │
 │    · persona + grounding rules                  │
 │    · KB passages  ← TRUSTED CONTEXT             │
 │                                                 │
 │  user turn:                                     │
 │    "VISITOR QUESTION: {question}"               │
 │                       ↑ UNTRUSTED DATA          │
 │                                                 │
 │  Separation = prompt-injection defence          │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 ┌─────────────────────────────────────────────────┐
 │  5. Gemini 2.5 Flash                            │
 │     temp 0.1 · maxOutputTokens 512              │
 │                                                 │
 │     grounding rule in system prompt:            │
 │     "answer ONLY from context;                  │
 │      say 'I don't have that in my KB'           │
 │      if answer not found"                       │
 └──────────────────────┬──────────────────────────┘
                        │
                        ▼
 { ok: true, answer, passages, remaining }
 + CORS headers on every response
Design notes

Why no embeddings?

Simplicity

Keyword retrieval needs zero infrastructure. No vector database, no embedding API, no chunking pipeline. The KB is processed in-memory per request. For FAQs (short, well-scoped questions), keyword overlap is highly competitive with semantic similarity.

Latency

Keyword scoring over a 40k-char KB takes ~5ms. An embedding call adds 100–500ms and a vector DB lookup adds more. For a widget that must feel instant, in-memory retrieval wins.

Cost

No embedding API means no per-call embedding cost. The only variable cost is the Gemini generation call — ~$0.0003 per question at Gemini 2.5 Flash pricing. For a widget doing 50 questions/IP/day this is negligible.

Want this for your site?

Get an embeddable FAQ widget for your product.

This is a portfolio demo — the architecture is the real client build. If you have a support knowledge base and want a drop-in AI FAQ widget, get in touch.