Mar 2026 – Mar 2026

AI · SWE

Last edited May 23, 2026

Type-Aware Hybrid RAG for Factoid QA

Built a RAG system for short-answer factoid QA over UC Berkeley EECS pages. The pipeline crawls EECS subdomains into a JSONL corpus, generates candidate QA pairs with LLM assistance, and indexes chunks with enriched retrieval text (page title, URL host/path tokens, and content) to improve BM25 lexical matching.

At query time the system builds weighted query variants per question type (person, email, location, date/year, etc.), fuses BM25 scores, applies domain-specific reranking, and answers with an instruction-tuned LLM under strict short-answer formatting. A deterministic extractive fallback (regex + relation patterns + overlap scoring) takes over when the LLM is disabled.

Achieved token-level F1 of 76% on the validation set and 92% on the holdout-mini set with the LLM enabled. Inter-annotator agreement was 86.7% Exact Match and 92.4% token-level F1. Ablations isolate the LLM as the largest single contributor to end-to-end accuracy.

Affiliation

UC Berkeley

Partners

Report

Manuscript

Keywords

NLP
RAG
LLMs
BM25
BFS Crawl
Query Expansion
Slot Filling
Python
Slurm

▸ Deepdive

Introduction

This project is a retrieval-augmented question-answering system built for UC Berkeley’s CS288 (Natural Language Processing), targeting short-form factoid questions over the UC Berkeley EECS domain. The pipeline crawls EECS pages into a JSONL corpus, indexes enriched chunks with BM25, retrieves with type-aware reranking, and answers with an instruction-tuned LLM under strict short-form formatting, with a deterministic regex/entity fallback for when the LLM is disabled or unavailable, and a soft “question-prior” module that biases retrieval toward URLs that similar past questions resolved to. The headline number on the validation set is a token-level $F_1$ of $\mathbf{0.7579}$ ; the more interesting numbers are in the ablations, where switching the LLM off drops $F_1$ to $0.2844$ on validation but only to $0.6496$ on holdout, a distributional gap that says more about the system’s failure modes than the headline does.

Problem Definition

A factoid QA system over a corpus $\mathcal{C}$ takes a natural-language question $q$ and produces a short-form answer string $\hat{a}$ . Each $(q, a^\star, u^\star)$ triple in the evaluation set is grounded in a specific source URL $u^\star \in \mathcal{C}$ , and the system is judged on two token-level metrics computed over predicted vs. gold answer tokens,

\mathrm{EM} \;=\; \mathbf{1}\!\left[\, \mathrm{norm}(\hat{a}) = \mathrm{norm}(a^\star) \,\right], \qquad F_1 \;=\; \frac{2 \cdot |T_{\hat{a}} \cap T_{a^\star}|}{|T_{\hat{a}}| + |T_{a^\star}|},

where $T_x$ is the bag of normalised tokens in $x$ and $\mathrm{norm}(\cdot)$ lowercases, strips punctuation, and collapses whitespace. Two retrieval-side diagnostics are reported alongside the headline metric: the URL recall@k (does $u^\star$ appear in the top- $k$ retrieved chunks) and the answer-in-context rate (does $a^\star$ appear verbatim in any retrieved chunk). These separate retrieval failures from generation failures, the system can score zero on a question for at least four very different reasons, and the headline $F_1$ doesn’t tell them apart.

The validation set contains $100$ questions drawn from $33$ unique URLs with an average question length of $9.7$ tokens (median $10$ ) and an average answer length of $2.19$ tokens (median $2$ , range $1{-}7$ ). A blind inter-annotator agreement subset of $30$ questions reached $86.7\,\%$ EM and $92.4$ token- $F_1$ , which is the ceiling that any system competing on this evaluation could reasonably expect to hit. Three holdout sets ( $30 + 31 + 31$ questions) sit on top for generalisation checks.

Background

A modern QA system can in principle just feed the question to a large language model and let it generate an answer from its parametric knowledge, but for any domain whose facts post-date the model’s training cut-off, or whose facts simply aren’t memorised at high enough fidelity, that approach fails open: the model confidently produces an answer that is fluent, plausible, and wrong. Retrieval-augmented generation (RAG) inserts a retrieval stage in front of the generator so that the model conditions its answer on a small set of documents pulled from a corpus at query time rather than on its parametric memory. The shape of the pipeline is invariant across most RAG systems, index a corpus, retrieve $k$ candidate chunks per question, condition a generator on those chunks, and the design choices live entirely in what is indexed, how it is retrieved, and how the generator is constrained.

Sparse Retrieval: BM25 in One Equation

The retrieval baseline is BM25, a sparse, term-frequency-based ranker that scores a query $q$ against a document $d$ as

\mathrm{BM25}(q, d) \;=\; \sum_{t \in q} \mathrm{IDF}(t) \cdot \frac{f(t, d)\,(k_1 + 1)}{f(t, d) + k_1 \!\left(1, b + b \cdot \frac{|d|}{\overline{|d|}}\right)},

where $f(t, d)$ is the term frequency of $t$ in $d$ , $\mathrm{IDF}(t)$ is the inverse document frequency, $|d|$ is the document length, $\overline{|d|}$ is the average length over the corpus, and $k_1$ , $b$ are tuning constants (typically $k_1 \approx 1.5$ , $b \approx 0.75$ ). The intuition is simple, rare terms count more, terms that appear many times in a document count more but with saturation, and long documents get a length penalty so they can’t dominate by accident, and it has the unbeatable practical virtue of being parameter-free at inference time. Critically for this project, BM25 is lexical: it rewards exact token overlap, which is why enriching each chunk with its URL host/path and title tokens improves retrieval. A question about “the Director of External Relations” lexically overlaps the URL path /people/staff/external-relations-staff/ more cleanly than it overlaps the page body, and BM25 catches that overlap for free.

Dense Retrieval and the Transformer Encoder

The natural complement to BM25 is dense retrieval, where both query and chunk are projected into a shared semantic vector space by a transformer encoder and nearest-neighbour search is performed in that space. The canonical encoders are Sentence-BERT (Reimers & Gurevych, 2019) and Dense Passage Retrieval (Karpukhin et al., 2020), both of which fine-tune a BERT-family transformer with a contrastive objective so that semantically related (question, chunk) pairs end up close under cosine similarity. The encoder itself is just a stack of self-attention layers producing a fixed-dimensional embedding per input; the design choice that matters is whether to use a bi-encoder (independent embeddings for $q$ and $d$ , cosine similarity afterwards, fast and cacheable) or a cross-encoder (joint $q$ + $d$ input, full self-attention across both, much slower but much more accurate). Bi-encoders are practical for first-pass retrieval over a corpus the size of an EECS crawl; cross-encoders are typically reserved for reranking the top tens of candidates because their per-pair cost forbids running them over the full corpus.

Dense retrieval is not free: it captures synonymy and paraphrase that BM25 misses, but it also loses some of BM25’s exact-string rigour, which on a corpus that contains identifiers like gradadmissions@eecs.berkeley.edu or 387 Soda Hall is exactly the kind of thing you’d rather not lose. In practice, hybrid sparse+dense retrieval with reciprocal rank fusion is the more robust answer for factoid QA over a structured domain, and that’s the natural next step for this project (see Future Work).

Retrieve-Then-Generate and Why Generators Hallucinate Less With Context

Once $k$ chunks have been retrieved, the generation stage concatenates them with the question into a single prompt and asks an instruction-tuned LLM to produce a short answer. The LLM is itself a transformer, a stack of self-attention blocks with causal masking, and what makes RAG work is the simple observation that conditioning the model on the actual passage containing the answer reduces hallucination dramatically: instead of generating from priors over what the answer “should” look like, the model is mostly performing extractive selection over the visible context, with light paraphrasing. The cost is that the generator’s output is now bottlenecked on retrieval quality, if the right chunk isn’t in the prompt, no amount of clever decoding will produce the right answer.

For factoid QA specifically, the generator is doing very little work: the answers are short (averaging $2.19$ tokens in the validation set), highly templated by question type (a date, an email, a person’s name), and grounded in a single span of the retrieved context. This is why a deterministic regex/entity-detector fallback gets to within striking distance of the LLM on holdout-mini, for clean factoid types, the LLM’s main contribution is robust span selection under context distractors and consistent short-form formatting, not deep reasoning. The validation set is harder for the fallback because it includes more disambiguation cases (multiple candidate emails on the same contact page, multiple staff members in adjacent paragraphs), and there the LLM’s context-aware selection is what closes the gap.

Approach

Type-aware hybrid RAG architecture: an offline crawler over eecs.berkeley.edu builds a JSONL corpus of 8417 documents, which are chunked with overlap and enriched with title, URL host/path, and body before indexing with BM25; at query time the question is typed (person, email, location, date, program, binary), expanded into per-type weighted query variants whose BM25 scores are fused, the top k chunks are retrieved and reranked with type-aware bonuses, and an instruction-tuned LLM produces a short-form answer that is postprocessed for course-code, degree, date, and unknown normalisation; a deterministic regex and entity-detector fallback handles the no-LLM path, and a soft question-prior module derived from curated past QA pairs supplies URL bias and Jaccard-similarity transfer without acting as a hard filter. — System architecture. Solid teal is the primary retrieve-then-generate path. Dashed orange is the offline-built BM25 index and the deterministic fallback that activates when the LLM is disabled. Dashed purple is the question-prior module that biases retrieval (URL hint) and short-circuits the pipeline on an exact normalised match, always as a soft bias, never as a hard filter.

The system decomposes into a crawl/corpus build, a type-aware retrieval stage, a generation/fallback stage, and a question-prior module that sits beside the main path.

Corpus Construction

A URL crawler walks links whose host contains eecs.berkeley.edu, skipping non-HTML and unsupported file types, and dumps the extracted page text into a JSONL corpus. The primary corpus contains $8\,417$ documents at an average of $292.9$ words each; chunks are overlapping windows over the page body. At retrieval time each chunk is represented by enriched text rather than its raw body, title terms, URL host and path tokens, and canonical URL features (e.g., flags for legacy www2-style hosts and PDF-like paths) are prepended to the chunk content. This is the single highest-leverage choice in the retrieval pipeline: a question that asks about a specific role typically lexically overlaps with the URL path of that role’s page (e.g., /people/staff/external-relations-staff/) more cleanly than with the page body, and BM25 over the enriched representation catches that.

Counterintuitively, the less aggressively the page text is cleaned, the better BM25 performs, heavy cleaning strips exact strings, formatting markers, and token patterns that BM25 relies on for lexical matching. The crawl-as-is corpus outperformed the cleaned variants in side-by-side comparison.

Type-Aware Retrieval

For each question, the system first classifies the answer type (person, email, location, date/year, program, binary), then expands the question into a small bundle of weighted query variants tailored to that type. The variants share most of the original query tokens but lean on type-specific cues (e.g., for email questions the variants over-weight @, host tokens, and contact-page n-grams; for person questions they over-weight role/title tokens). BM25 scores for each variant are fused, and the top- $k$ candidates are then run through a type-aware reranker that applies domain-specific bonuses (e.g., reward chunks whose URL path matches the question’s typed cue) and penalties (e.g., suppress obvious distractor pages). The default $k = 3$ , small enough to keep distractor spans out of the LLM context window, but big enough to clear the URL-recall bar.

Generation and Deterministic Fallback

In generation mode the LLM sees the question plus the top-ranked chunks, prompted to return only the answer text under strict short-form rules (Yes/No for binary questions, unknown when the context does not support an answer, no surrounding prose). Whichever string the model returns is then postprocessed for consistency, course-code canonicalisation (CS 70 ↔ CS70), degree-format normalisation, date/year/number cleanup, and unknown-string normalisation, because the grader is strict about formatting.

When the LLM is disabled or its call fails, the system swaps in a deterministic extractor that combines regex/entity detectors (emails, phone numbers, dates, named entities), relation patterns, and overlap-based candidate scoring over the same retrieved chunks. The fallback is dramatically weaker than the LLM on validation but, as the ablations below show, much closer to parity on holdout.

Question-Prior Module

After the initial system underperformed on the hidden dev set, a lightweight question-prior module was added. It does two things with the curated past-QA file. First, an optional exact normalised-question lookup short-circuits the pipeline when an identical past question exists. Second, and more importantly, a soft Jaccard-similarity transfer over previously seen questions picks out the most similar past question with a matching answer type, and uses the URL it resolved to as a retrieval bias (not a hard filter) on the current query’s BM25 reranking.

The trade-off here is explicit and noted in the report: the prior module substantially improves dev-set accuracy but risks overfitting if the evaluation distribution is close to the curated QA file. To keep the two operating modes separable, all ablations are reported both with and without the prior, and the headline number on validation is the prior-disabled $F_1$ .

Results

The validation $F_1$ of $0.7579$ leaves a gap of $\approx 17\,\text{points}$ to the IAA ceiling of $0.924$ . Two retrieval diagnostics localise where that gap comes from: URL recall@4 is $73\,\%$ (the correct source URL is among the top four retrieved chunks for $73$ of $100$ questions) and answer-in-context is $81\,\%$ (the gold answer string appears verbatim in some retrieved chunk for $81$ of $100$ ). The gap between $81\,\%$ answer-in-context and $65\,\%$ EM is the generation-side cost, the LLM and the postprocessor have the right context but pick the wrong span or canonicalise it wrong about $16\,\%$ of the time.

Ablation 1 · LLM vs. fallback-only

Set	LLM enabled	LLM disabled	Δ $F_1$
Validation	$F_1 = 0.7579$ , EM $= 0.6500$	$F_1 = 0.2844$ , EM $= 0.1800$	$-0.474$
Holdout-mini	$F_1 = 0.9239$ , EM $= 0.8333$	$F_1 = 0.6496$ , EM $= 0.6000$	$-0.274$

The LLM is the largest single contributor on both splits, but the gap to fallback is much larger on validation than on holdout-mini. Holdout-mini’s questions are more amenable to the regex/entity extractor, likely because the curated questions skew toward clean factoid types (emails, phone numbers, dates) that the extractor handles natively. Validation is harder for the extractor because more of its questions involve disambiguation between similar entities on the same page, which the LLM resolves via context and the extractor can’t.

Ablation 2 · Retrieval depth in fallback-only mode

Set	$k = 3$	$k = 8$
Validation (fallback)	$F_1 = 0.2867$	$F_1 = 0.2698$
Holdout-mini (fallback)	$F_1 = 0.6718$	$F_1 = 0.6496$

Going from $k = 3$ to $k = 8$ hurts the fallback extractor on both splits. The mechanism is straightforward: the deterministic extractor scores candidate spans across all retrieved chunks, so adding lower-ranked chunks adds distractor spans without adding many new correct ones, and the extractor’s precision drops. This is also weak evidence that the type-aware reranker’s top-3 is already capturing most of the recoverable answers, a deeper retrieve+rerank stack would need to be paired with a smarter extractor to be worth it.

Error Analysis

The $20$ zero- $F_1$ cases break down as: $9$ retrieval misses (the corpus has the answer, the retriever didn’t surface the right chunk, the LLM defaulted to unknown), $5$ wrong-entity selections (right page, wrong person/email), $4$ metric-driven false negatives, and $2$ postprocessing artefacts. The $4$ metric false negatives are worth singling out because they are not bugs in the system, they are bugs in the headline metric. Adding a normalisation-aware secondary metric (alphanumeric spacing, number-form equivalence, minor spelling variants) would catch all four without weakening the strict EM/ $F_1$ contract.

Future Work

Retrieval is the dominant remaining bottleneck. On the $9$ retrieval-miss zero- $F_1$ cases, the right URL never makes it into the top- $k$ , so no amount of generation-side cleverness can recover the answer. Two complementary fixes attack this: (1) augment the BM25 sparse retriever with a dense retriever (sentence-BERT-style encodings of chunks and queries) and fuse the two with reciprocal rank fusion, which would help in particular on questions whose lexical overlap with the answer-bearing chunk is low; and (2) add a learned reranker on top of the fused candidates, a cross-encoder over (question, chunk) pairs, which would resolve the wrong-entity selections by jointly attending to the question and chunk before the LLM ever sees the context.

A more focused improvement is on legacy/www2-style pages and document-like (PDF-derived) pages, which the system consistently underperforms on. The crawler treats these as the same shape of input as modern EECS pages, but their chunk structure, formatting, and URL conventions all differ; per-host chunking parameters and per-host URL feature engineering would close most of that gap without changing the rest of the pipeline.

The fallback extractor is the weakest part of the system and the one most worth strengthening, because robustness to LLM-disable is the single biggest determinant of where the system can run. The current extractor combines regex/entity detectors with relation patterns and overlap-based scoring; adding a small distilled QA model (e.g., a SQuAD-trained extractive head) as an alternate extractor would lift the no-LLM $F_1$ closer to the LLM-enabled number, narrowing the $0.474$ -point gap on validation that is the most uncomfortable number in the report.

Finally, the evaluation harness itself is worth investing in. The current pipeline reports the headline $F_1$ and EM and almost nothing else; making URL recall@ $k$ , answer-in-context, per-question-type breakdowns, and a normalisation-aware secondary metric first-class diagnostics on every run would let future experiments be much more honest about which part of the pipeline an improvement came from. The four metric false negatives in the error analysis aren’t surprising once seen, but they wouldn’t have surfaced at all without a manual look at the zero- $F_1$ cases.