Implementing a simple SoC in Migen

Tags:

In this note I’ll write about implementing a simple microcontroller based on an OpenRISC 1000 CPU core using Migen.

Prerequisites

To write this note, I was using:

  • migen commit 3cc73b9f4298a0a375b4bbd1bfc8807dfdb38ffd;
  • misoc commit b53b60c3f9aaa24388cccc96f9be367b92019533;
  • mor1kx commit fb519d011ae2524e3681f07b206df0a6c03f82a8;
  • Yosys version 0.6+205git42a9712;
  • IceStorm version 0~20160913git266e758-2;
  • arachne-pnr version 0.1+20151224git1a4fdf9;
  • uart.py from the earlier note;
  • binutils version 2.26.51.20160313 built with --target=or1k-none;
  • the iCE40-HX8K-B-EVN development board.

Newer versions will probably still work fine.

Implementation

The SoC consists of the three main parts: the mor1kx wrapper, the SoC gateware, and the ROM code.

mor1kx wrapper

The mor1kx wrapper simply instantiates the CPU core from an adjacent checkout of the mor1kx git repository, and configures it to remove as many features as realistically possible, with the exception of narrowing the register file. (There is an option OPTION_RF_ADDR_WIDTH, but enabling it broke stores with the address in r2 while stores with address in r1, r3, r4 and r5 worked, for a reason I am unable to comprehend. There is also an option OPTION_RF_WORDS, which does nothing.)

There is a version of this wrapper in MiSoC, but it’s not configurable and enables too many features, e.g. caches, such that the resulting core won’t fit into iCE40-HX8K.

mor1kx.py (download)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import os

from migen import *

from misoc.interconnect import wishbone


class MOR1KX(Module):
    def __init__(self, platform, reset_pc):
        self.ibus = i = wishbone.Interface()
        self.dbus = d = wishbone.Interface()
        self.interrupt = Signal(32)

        ###

        i_adr_o = Signal(32)
        d_adr_o = Signal(32)
        self.specials += Instance("mor1kx",
            p_FEATURE_INSTRUCTIONCACHE="NONE",
            p_FEATURE_DATACACHE="NONE",
            p_FEATURE_TIMER="NONE",
            p_FEATURE_SYSCALL="NONE",
            p_FEATURE_TRAP="NONE",
            p_FEATURE_RANGE="NONE",
            p_FEATURE_OVERFLOW="NONE",
            p_FEATURE_SRA="NONE",
            p_FEATURE_ADDC="NONE",
            p_FEATURE_CMOV="NONE",
            p_FEATURE_FFL1="NONE",
            p_FEATURE_ATOMIC="NONE",
            p_FEATURE_MULTIPLIER="NONE",
            p_FEATURE_DIVIDER="NONE",
            p_FEATURE_STORE_BUFFER="NONE",
            p_OPTION_CPU0="PRONTO_ESPRESSO",
            p_OPTION_RESET_PC=reset_pc,
            p_IBUS_WB_TYPE="B3_REGISTERED_FEEDBACK",
            p_DBUS_WB_TYPE="B3_REGISTERED_FEEDBACK",

            i_clk=ClockSignal(),
            i_rst=ResetSignal(),

            i_irq_i=self.interrupt,

            o_iwbm_adr_o=i_adr_o,
            o_iwbm_dat_o=i.dat_w,
            o_iwbm_sel_o=i.sel,
            o_iwbm_cyc_o=i.cyc,
            o_iwbm_stb_o=i.stb,
            o_iwbm_we_o=i.we,
            o_iwbm_cti_o=i.cti,
            o_iwbm_bte_o=i.bte,
            i_iwbm_dat_i=i.dat_r,
            i_iwbm_ack_i=i.ack,
            i_iwbm_err_i=i.err,
            i_iwbm_rty_i=0,

            o_dwbm_adr_o=d_adr_o,
            o_dwbm_dat_o=d.dat_w,
            o_dwbm_sel_o=d.sel,
            o_dwbm_cyc_o=d.cyc,
            o_dwbm_stb_o=d.stb,
            o_dwbm_we_o=d.we,
            o_dwbm_cti_o=d.cti,
            o_dwbm_bte_o=d.bte,
            i_dwbm_dat_i=d.dat_r,
            i_dwbm_ack_i=d.ack,
            i_dwbm_err_i=d.err,
            i_dwbm_rty_i=0)

        self.comb += [
            self.ibus.adr.eq(i_adr_o[2:]),
            self.dbus.adr.eq(d_adr_o[2:])
        ]

        # add Verilog sources
        vdir = os.path.join(
            os.path.abspath(os.path.dirname(__file__)),
            "mor1kx", "rtl", "verilog")
        platform.add_source_dir(vdir)
        platform.add_verilog_include_path(vdir)

SoC gateware

The overall architecture of the SoC can be seen on this wonderful diagram:

Simple SoC
 +----------+   +-----------------+
 |          |   |     Wishbone    |
 | CPU core +---+    arbiter &    |
 |          |   |     decoder     |
 +----------+   +-+------+------+-+
                  |      |      |
         +--------+-+ +--+---+ ++-----+ +-----+
         | 256B RAM | | LEDs | | UART +-+ PHY |
         +----------+ +------+ +------+ +-----+

The full code can be downloaded; I will describe it part-by-part here.

Digression: the Wishbone bus

All of the components in this SoC are connected using the Wishbone bus, which is actually fairly simple as far as buses go, but it has a number of signals that are dazzling at first, and the specification is very obtuse and probably creates more confusion than it solves.

When implementing a typical I/O peripheral (without any wait states and without support for any but word granularity access) in Migen, the following list can be used as a reference:

  • The rst and clk inputs are implicit as a part of the clock domain and not part of the Wishbone interface in Migen;
  • The cyc and stb inputs, when asserted together, indicate that there is a valid bus cycle and this peripheral is selected;
  • The adr input is the address of the access; it has the granularity of bus width and includes every bit even if it’s already partially decoded by the arbiter. For example, if a CPU is accessing the address 0x10001000 and a peripheral is mapped with a base address 0x10000000 on a 32-bit Wishbone bus, the peripheral will observe adr == 0x04000400.
  • The dat_r output is the word being read from the peripheral; since its value does not affect anything when the peripheral is not selected, it can (and should, for simplicity) be updated regardless of whether there is a bus cycle.
  • The dat_w input is the word being written into the peripheral, and we is the write enable strobe; dat_w is only valid when all of cyc, stb and we are asserted.
  • The ack output should be asserted once a transaction successfully completes, i.e. in response to cyc and stb being asserted.
  • The err output can be asserted to abort the transaction; this will raise a bus error exception or the like.
  • The cti and bte signals are related to burst transfers and can be ignored.

The peripheral should also use registered feedback, i.e. the dat_r, ack and (if any) err outputs should be asserted through combinatorial logic. This improves timing closure, since in this case accesses take two cycles instead of one, but the critical path for the Wishbone bus includes only signals in that bus, and not the rest of your design also.

GPIO peripheral

The GPIO peripheral maps consecutive 32-bit write-only registers (with one bit used) to the output array, of no more than 8 outputs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class SimpleGPIO(Module):
    def __init__(self, outputs):
        self.bus = bus = wishbone.Interface()

        ###

        self.sync += [
            bus.ack.eq(0),
            If(bus.cyc & bus.stb & ~bus.ack,
                bus.ack.eq(1),
                If(bus.we,
                    Array(outputs)[bus.adr & 0x3].eq(bus.dat_w[0])
                )
            )
        ]

Note this basic pattern:

1
2
3
4
5
6
7
8
self.sync += [
    bus.ack.eq(0),
    If(bus.cyc & bus.stb & ~bus.ack,
        bus.ack.eq(1),
        If(bus.we,
            ...)
    )
]

This makes sure that the code replaced by ... will execute only during a valid bus transaction (the bus.cyc & bus.stb part), execute exactly once per bus cycle (the “negative ack feedback”), and execute only during write transactions (the bus.we part).

UART peripheral

The UART peripheral has two 32-bit registers with the following layout:

(speaking of layout, do you have any idea how ridiculously hard it is to lay this out in HTML?)

Address 0 (RX Status/Data Register)
Bit 31:10 9 8 7:0
Type N/A R R/C1 R
Function N/A RX Error RX Full RX Data
Address 4 (TX Command/Data Register)
Bit 31:10 9 8 7:0
Type N/A R W W
Function N/A TX Empty TX Start TX Data

The bit types that are reasonable to use in peripheral registers are:

  • R means read-only (writes do nothing);
  • W means write-only (reads return zeroes);
  • R/W means read-write (reads return what was written, and any written value is valid);
  • R/C1 means read-only, cleared by writing one (writing zero does nothing, writing one clears the bit if it was set, or does nothing);
  • N/A means reserved and should be written as zero (reads return garbage, zero writes do nothing, non-zero writes result in unpredictable behavior).

The R/C1 bit type is particularly useful for event flags. The reason it’s specifically cleared by writing one is that this allows updating unrelated bits in the same register without accidentally clearing some interesting flags, even if the set of flags is not known beforehand and thus they cannot be explicitly masked; in general, it is otherwise impossible to safely update a part of the register using a read-modify-write cycle without introducing a race condition.

The N/A bit type is specified in a way that lets the bit gain new functionality later without breaking older software, if the software respects the associated restrictions.

Of course, when implementing a peripheral you have complete freedom over its behavior; and you could implement odd things, like registers that self-clear on reads, or bits that have completely different meaning when reading and writing, or somesuch. But this is error-prone and also annoys software developers, so maybe don’t do that.

The implementation of the peripheral is essentially the same as for the GPIO one, though it also has readable registers:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class SimpleUART(Module):
    def __init__(self, serial, clk_freq, baud_rate):
        self.bus = bus = wishbone.Interface()
        self.submodules.phy = phy = UART(serial, clk_freq, baud_rate)

        ###

        self.sync += [
            If((bus.adr & 1) == 0,
                bus.dat_r.eq(Cat(phy.rx_data, phy.rx_ready, phy.rx_error))
            ).Elif((bus.adr & 1) == 1,
                bus.dat_r.eq(Cat(Replicate(0, 9), phy.tx_ack))
            ),

            phy.rx_ack.eq(0),
            phy.tx_ready.eq(0),

            bus.ack.eq(0),
            If(bus.cyc & bus.stb & ~bus.ack,
                bus.ack.eq(1),
                If(bus.we,
                    If((bus.adr & 1) == 0,
                        phy.rx_ack.eq(bus.dat_w[8])
                    ).Elif((bus.adr & 1) == 1,
                        phy.tx_data.eq(bus.dat_w[:8]),
                        phy.tx_ready.eq(bus.dat_w[8])
                    )
                )
            )
        ]

Note how the strobe bits of the PHY (rx_ack and tx_ready) are assigned zero by default; this ensures that they are asserted for exactly one cycle after a write transaction that has the corresponding bits set.

SoC interconnect

The last part of the SoC is the one that brings it all together and in the darkness binds them. It consists of the system reset generator and the Wishbone interconnect:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def mem_decoder(address):
    return lambda a: (a << 2) & 0xf0000000 == address


class SimpleSoC(Module):
    def __init__(self, platform, code):
        clk12  = platform.request("clk12")
        serial = platform.request("serial")

        self.clock_domains.cd_por = ClockDomain(reset_less=True)
        self.clock_domains.cd_sys = ClockDomain()
        reset_delay = Signal(10, reset=1023)
        self.comb += [
            self.cd_por.clk.eq(clk12),
            self.cd_sys.clk.eq(clk12),
            self.cd_sys.rst.eq(reset_delay != 0)
        ]
        self.sync.por += \
            If(reset_delay != 0,
                reset_delay.eq(reset_delay - 1)
            )

        self.submodules.cpu = MOR1KX(platform, 0x00000000)
        self.submodules.ram = wishbone.SRAM(0x100, init=code)
        self.submodules.leds = SimpleGPIO([
            platform.request("user_led") for _ in range(8)
        ])
        self.submodules.uart = SimpleUART(serial, 12000000, 9600)
        self.submodules.wishbonecon = wishbone.InterconnectShared(
            masters=[
                self.cpu.ibus,
                self.cpu.dbus
            ],
            slaves=[
                (mem_decoder(0x00000000), self.ram.bus),
                (mem_decoder(0x10000000), self.leds.bus),
                (mem_decoder(0x20000000), self.uart.bus),
            ],
            register=True)

The reset generator is necessary because, while uploading a bitstream into the FPGA performs the basic functions of a reset—it initializes the registers to the known values and then ungates the clock among all other I/O pins—it does not do the latter deterministically, and in practice, while it would still work for simple designs, large one such as this SoC will break.

The Wishbone interconnect in this case includes an arbiter and a decoder, connected back-to-back inside of the MiSoC built-in wishbone.InterconnectShared module. The arbiter is necessary because the CPU has separate instruction and data buses; since we have only one SRAM block used for both instructions and data, if we want to be able to perform any loads and stores to RAM, it should be shared. The decoder allows us to simplify peripherals, as they can be ignorant of the exact address they are mapped at.

Software

The software I wrote as an example is very straightforward, and perhaps representative of the inner desires of us all: you can say anything to it through UART, and it will scream in response. It is implemented in OR1K assembly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
    l.xor   r0, r0, r0
    l.movhi r1, 0x1000
    l.movhi r2, 0x2000
    l.ori   r10, r0, 1
0:  l.lwz   r3, 0(r2)
    l.andi  r4, r3, 0x100
    l.sfeqi r4, 0
    l.bf    0b
    l.sw    0(r1), r10
    l.ori   r11, r0, 0x100
    l.sw    0(r2), r11
    l.ori   r12, r0, 16
1:  l.ori   r11, r0, 0x141
    l.sw    4(r2), r11
2:  l.lwz   r11, 4(r2)
    l.andi  r11, r11, 0x200
    l.sfeqi r11, 0
    l.bf    2b
    l.addi  r12, r12, -1
    l.sfnei r12, 0
    l.bf    1b
    l.sw    0(r1), r0
    l.j     0b

Note the absence of delay slots; the PRONTO_ESPRESSO pipeline used in our mor1kx instantiation does not include those, unlike the two other ones.

An interesting exercise would be to implement an UART bootloader, such that compiled programs would no longer require rebuilding the bitstream, which takes about a minute.

Demonstration


Want to discuss this note? Drop me a letter.