Back to all projects

Mar 2026 – Mar 2026

AI · SWE

Last edited

Type-Aware Hybrid RAG for Factoid QA

Built a RAG system for short-answer factoid QA over UC Berkeley EECS pages. The pipeline crawls EECS subdomains into a JSONL corpus, generates candidate QA pairs with LLM assistance, and indexes chunks with enriched retrieval text (page title, URL host/path tokens, and content) to improve BM25 lexical matching.

At query time the system builds weighted query variants per question type (person, email, location, date/year, etc.), fuses BM25 scores, applies domain-specific reranking, and answers with an instruction-tuned LLM under strict short-answer formatting. A deterministic extractive fallback (regex + relation patterns + overlap scoring) takes over when the LLM is disabled.

Achieved token-level F1 of 0.76 on the validation set and 0.92 on the holdout-mini set with the LLM enabled. Inter-annotator agreement was 86.7% Exact Match and 92.4 token-level F1. Ablations isolate the LLM as the largest single contributor to end-to-end accuracy.

Affiliation

UC Berkeley

Partners

Report

  • Manuscript

Keywords

  • NLP
  • RAG
  • LLMs
  • BM25
  • BFS Crawl
  • Query Expansion
  • Slot Filling
  • Python
  • Slurm

Deepdive

Under development.