Four Stages, Two ISAs: A Pipelined RV32IF Core on PYNQ-Z1 FPGA

This project implements a pipelined RISC-V SoC on the Digilent PYNQ-Z1 FPGA. The core is RV32I with CSR support, paired with a pipelined RV32F floating-point unit.

The four-stage in-order pipeline includes hazard detection, data forwarding, and precise control-flow handling, covering end-to-end hardware-software co-design from RTL to a processor that boots and runs C benchmarks.

The system was verified using the official RISC-V ISA tests and end-to-end workloads executed directly on FPGA hardware. Final numbers: 58 MHz operating frequency, ~1.16 integer CPI, ~1.83 floating-point CPI, and an FOM of 12.3.

Developed within a UC Berkeley hardware course supported by Apple's New Silicon Initiative and co-taught with industry researchers from NVIDIA.

Affiliation

UC Berkeley

Partners

Lawrence Rhee

Keywords

Verilog
Xilinx Vivado
RISC-V
FPGA
PYNQ-Z1
RTL Design
Hardware-Software Co-Design
ISA Compliance Testing

▸ Deepdive

Introduction

This project is a self-contained RV32I + RV32F SoC built for the Digilent PYNQ-Z1 (Xilinx Zynq-7020) as the final lab of UC Berkeley’s EE151. The deliverable is a four-stage in-order pipeline with single-cycle forwarding, a multi-stage pipelined floating-point unit, UART-tethered boot through an on-chip BIOS, local instruction/data BRAMs, and MMIO performance counters, small enough to fit on a $100 board, complete enough to compile and run the riscv-isa-tests harness, a C benchmark suite (matrix multiply, sorts, a BDD evaluator, a UART parser), and a floating-point matrix-multiply benchmark. The interesting part of the writeup isn’t that it works, RV32I cores are a solved problem, but the specific micro-architectural choices that fall out of fitting it on a Zynq-7020 with on-chip memories only.

Problem Definition

The core implements RV32I + the Zicsr extension (with tohost at CSR 0x51E as the test-completion signal) plus a subset of RV32F covering fadd / fsub / fmul / fmadd / fmsub / fnmadd / fnmsub, the sign-injection family, register-to-register moves, and int ↔ fp conversion. A program is a sequence of 32-bit instructions in $\mathcal{I}_{\text{int}} \cup \mathcal{I}_{\text{fp}}$ executed under the standard RISC-V architectural model: integer state $(x_0, \dots, x_{31})$ with $x_0 \equiv 0$ , FP state $(f_0, \dots, f_{31})$ , a single program counter, and the CSR file. The execution model is precise: at any commit boundary the architectural state must be exactly what an in-order interpretation of the program produces, which constrains how a pipelined FPU can interleave with the integer pipe (see Approach).

The performance figure of merit is cycles per instruction over a workload $W$ ,

\mathrm{CPI}(W) \;=\; \frac{\sum_{w \in W} \text{cycles}(w)}{\sum_{w \in W} \text{committed-instructions}(w)},

computed on-chip from the four memory-mapped counters at 0x8000_0010 through 0x8000_0020. The instruction counter increments only on committed (non-bubble, non-killed) instructions, so flushed branches and load-use bubbles correctly lower $\mathrm{CPI}$ if anything, the metric reflects useful work, not raw cycles per fetched word.

The system’s hard constraints are the Zynq-7020’s resources (BRAMs, LUTs, DSP slices), the PYNQ-Z1’s $125\,\text{MHz}$ FPGA clock that the PLL must derive the CPU clock from, and the requirement that everything from BIOS to user programs must boot through a $115\,200\,\text{baud}$ UART without an external memory controller, i.e., no DDR, no SD card boot.

Approach

EE151 RISC-V CPU datapath diagram: parameterised PC feeding an IMEM / BIOS mux into the fetch stage, an instruction-decode and immediate-generation block in the decode stage with the integer and FP register files, an execute/memory stage with ALU, branch comparator, forwarding muxes, DMEM, and MMIO ports, and a writeback stage with load extension, MMIO read mux, and integer/FP register-file writes. — Full CPU datapath. Four pipeline stages (fetch, decode, execute / memory, writeback) with forwarding muxes feeding the ALU operand network, a dedicated FP register file feeding a multi-stage FPU on the side, and a write port from EX to IMEM that lets the BIOS install user code over UART before jumping to it.

The system decomposes into the integer pipeline, the floating-point unit that hangs off it, the on-chip memory hierarchy, and the UART-tethered boot path. Each ### subsection below pulls one of these out.

Four-Stage Integer Pipeline

Fetch is parameterised on a base PC and muxes between IMEM and the BIOS ROM based on PC[31:28] (0x4 → BIOS, 0x1 → IMEM); both memories are synchronous BRAMs. Decode emits a control word, generates immediates, reads the integer and FP register files, and runs the hazard-detection unit that produces the kill signals for mispredicted branches and the stall signals for load-use bubbles and FP-busy. Execute/Memory contains the ALU, the branch comparator, the forwarding network (EX→EX and WB→EX, both single-cycle), the DMEM port with byte enables, the MMIO bus, and a write port back into IMEM that is what lets the BIOS load user code (more on that below). Writeback handles load sign/zero extension, the MMIO read mux, and the integer/FP register-file writes plus CSR commits.

Forwarding is single-cycle by construction: a result computed in EX is available for the next instruction’s operand mux without going through writeback, and a result already in WB is available for any instruction two slots behind it. That covers every back-to-back ALU dependency. The one case it can’t cover is a load followed immediately by a use, since the load result isn’t available until the end of the same cycle the dependent instruction would need it in EX; that injects one bubble via the hazard unit. Branches are resolved in EX and kill IF and ID; the pipeline takes one bubble per mispredict.

Multi-Stage Floating-Point Unit

The FPU has its own 3R/1W register file (three reads so a single FMA reads all three operands in one cycle) and is structured as a short pipeline: a combinational Stage 1 for the operations that are cheap (single-cycle multiply, sign-inject, moves, integer↔fp convert), then a Stage 2 align/normalize path for add/sub and the FMA mantissa work, then a retire stage that writes the FP register file. The latency of any given FP operation is tracked explicitly, and the integer pipeline asserts a stall on the writeback stage until the FP result retires.

That last detail is the design’s main constraint and the reason it’s done this way: holding the integer pipe while a long-latency FP op completes preserves precise architectural state, at any cycle boundary the architectural registers reflect exactly the in-order semantics of the program. Out-of-order retire would have required an explicit re-order buffer and dependency tracking on FP destinations, which is more area and complexity than this design budget allows. The trade-off is throughput: a back-to-back chain of FP adds bottlenecks on FPU latency rather than on the integer pipe’s $\mathrm{CPI}$ . The benchmark suite includes fpmmult precisely so this trade-off shows up in the FOM numbers.

Memory Map and Address Partitioning

Memory map of the SoC: IMEM and DMEM regions both at base 0x10000000 (PC reads IMEM, data accesses read/write DMEM, IMEM is write-only as a data address), BIOS ROM at 0x40000000 read-only, MMIO bus at 0x80000000 with eight word-aligned addresses for UART control / RX / TX, cycle / instruction / branch / branch-correct counters and a counter reset, and the tohost CSR at 0x51E for ISA-test completion signalling. A legend distinguishes the PC and DA (data-address) paths. — Memory map. Top-nibble routing decides where a load/store goes, DMEM, IMEM (write-only), BIOS ROM (read-only), or MMIO, and the same numeric address can refer to *different* physical memories depending on whether it’s used as a PC or as a data address.

The memory map collapses to a single observation: PC[31:28] and DA[31:28] are independent selectors. 0x1000_0000 as a PC fetches from IMEM; 0x1000_0000 as a data address writes to IMEM (loader path) or reads from DMEM (a separate physical BRAM). 0x4000_0000 is BIOS for the PC path only. 0x8000_xxxx is the MMIO bus, which spans the UART control/RX/TX trio, the four free-running performance counters (cycle, instruction, branch, branch-correct), and a store-to-clear at 0x8000_0018 that resets all four counters together. The dual interpretation isn’t a hack: it’s what lets the BIOS, which is itself executing from 0x4000_0000, write incoming UART bytes into IMEM at 0x1000_0000 and then jal to that address to run user code, all without an external memory controller. The IMEM write port from EX is the one piece of hardware that closes that loop.

UART-Tethered Boot

The BIOS is a UART command shell living in BIOS ROM. The host script hex_to_serial streams .hex images at $115\,200\,\text{baud}$ to the BIOS, which writes them into IMEM and DMEM via the EX-stage IMEM/DMEM write ports. From the BIOS prompt, jal 10000000 jumps to the loaded program; load/store commands (lw / lhu / lbu / sw / sh / sb) let the host inspect or patch any memory location. The performance counters at 0x8000_0010+ are exposed to user code via the 151_library MMIO helpers, so a benchmark can wrap its kernel between a counter-reset store and three loads to read cycles, retired instructions, and branch stats out to the UART before returning to BIOS.

Results

The design fits on the Zynq-7020 with the BIOS, IMEM, DMEM, and FPU all inferred as BRAMs, and passes the full regression suite from hardware/run_all_sims: the RISC-V ISA tests in software/riscv-isa-tests/, the C micro-tests in software/c_tests/ (fib, sum, strcmp, cachetest, vecadd, replace), and the directed assembly suite in software/asm/. Five end-to-end benchmarks run on real hardware:

mmult, integer matrix multiply
bsort, ssort, integer sorting
bdd, binary decision diagram evaluation
fpmmult, single-precision floating-point matrix multiply (exercises the FPU pipeline and the integer-pipe-stall logic)
echo and uart_parse, UART-driven workloads that exercise the MMIO RX/TX path

Each benchmark reports cycles, committed instructions, $\mathrm{CPI}$ , and branch-prediction correctness from the on-chip counters via the UART console. Self-modifying loader correctness is verified by the file BIOS command writing into IMEM and jal jumping into the loaded code, the same path used to install every benchmark in the first place. The tohost-based completion signal closes the loop for ISA-test PASS/FAIL automation.

Future Work

The single biggest performance lever remaining is the load-use bubble: every load followed immediately by a dependent use eats one cycle, and the bubble is unconditional because the hazard unit doesn’t peek inside the load to know whether the value would actually have been forwarded in time. A forwarding-with-stall hybrid that resolves the load address in EX, does the DMEM read in the same cycle (already true for BRAM with a one-cycle read latency), and forwards the result to the next instruction’s operand mux at the end of writeback would eliminate the bubble for the case where the dependent instruction’s EX falls in the same cycle as the load’s writeback. This is a small change to the forwarding mux and the hazard unit.

The FPU’s integer-pipe stall is correct but conservative. A scoreboard-style FP destination tracker would let integer instructions that don’t depend on the in-flight FP destination proceed in parallel, lifting fpmmult throughput at the cost of more elaborate forwarding logic and a slightly bigger hazard unit. The same change opens the door to issuing back-to-back FP adds without serialising on FPU latency, currently the FPU pipeline is filled and drained between dependent instructions, which is the dominant cost on the fpmmult benchmark.

Beyond micro-architecture, the next interesting direction is closing on a small instruction-cache between BIOS and IMEM. The current design has BIOS in one BRAM and IMEM in another, with no caching layer, which is fine for the $115\,200$ -baud UART boot path but starts to bite for code with mixed BIOS/user-code execution (e.g., user code that calls back into a BIOS routine for UART I/O). A small direct-mapped I-cache fronting both memories would reduce the visible latency of the boundary crossing without adding a memory controller or DRAM to the system.

Finally, the project’s biggest external opportunity is the empty space between this RV32I+F core and the Zynq-7020’s hard ARM cores: the same FPGA fabric that hosts this CPU sits next to a Cortex-A9 SMP cluster on the PS side. A small AXI bridge between the RISC-V SoC and the PS would let a Linux userspace on the ARM expose this core as an accelerator for whatever workload it’s been profiled on, currently the FOM-favourable ones, e.g., a tight bdd evaluator or a tuned integer matrix multiply. That converts the project from “a complete in-order RV32I+F SoC” into “a measured-yourself accelerator block on a heterogeneous SoC,” which is the more interesting framing for anything downstream.