Mar 2026 – Mar 2026
AI · SWELast edited
Type-Aware Hybrid RAG for Factoid QA
Built a RAG system for short-answer factoid QA over UC Berkeley EECS pages. The pipeline crawls EECS subdomains into a JSONL corpus, generates candidate QA pairs with LLM assistance, and indexes chunks with enriched retrieval text (page title, URL host/path tokens, and content) to improve BM25 lexical matching.
At query time the system builds weighted query variants per question type (person, email, location, date/year, etc.), fuses BM25 scores, applies domain-specific reranking, and answers with an instruction-tuned LLM under strict short-answer formatting. A deterministic extractive fallback (regex + relation patterns + overlap scoring) takes over when the LLM is disabled.
Achieved token-level F1 of 0.76 on the validation set and 0.92 on the holdout-mini set with the LLM enabled. Inter-annotator agreement was 86.7% Exact Match and 92.4 token-level F1. Ablations isolate the LLM as the largest single contributor to end-to-end accuracy.
Affiliation
UC Berkeley
Partners
Report
- Manuscript
Keywords
- NLP
- RAG
- LLMs
- BM25
- BFS Crawl
- Query Expansion
- Slot Filling
- Python
- Slurm
▸ Deepdive
Under development.