Implementing a simple SoC in Migen
Tags:In this note I’ll write about implementing a simple microcontroller based on an OpenRISC 1000 CPU core using Migen.
Prerequisites
To write this note, I was using:
-
migen commit
3cc73b9f4298a0a375b4bbd1bfc8807dfdb38ffd
; -
misoc commit
b53b60c3f9aaa24388cccc96f9be367b92019533
; -
mor1kx commit
fb519d011ae2524e3681f07b206df0a6c03f82a8
; -
Yosys version
0.6+205git42a9712
; -
IceStorm version
0~20160913git266e758-2
; -
arachne-pnr version
0.1+20151224git1a4fdf9
; - uart.py from the earlier note;
-
binutils version
2.26.51.20160313
built with--target=or1k-none
; - the iCE40-HX8K-B-EVN development board.
Newer versions will probably still work fine.
Implementation
The SoC consists of the three main parts: the mor1kx wrapper, the SoC gateware, and the ROM code.
mor1kx wrapper
The mor1kx wrapper simply instantiates the CPU core from an adjacent checkout of the mor1kx
git repository, and configures it to remove as many features as realistically possible,
with the exception of narrowing the register file.
(There is an option OPTION_RF_ADDR_WIDTH
, but enabling it broke stores with
the address in r2
while stores with address in r1
, r3
, r4
and r5
worked,
for a reason I am unable to comprehend. There is also an option OPTION_RF_WORDS
, which
does nothing.)
There is a version of this wrapper in MiSoC, but it’s not configurable and enables too many features, e.g. caches, such that the resulting core won’t fit into iCE40-HX8K.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import os
from migen import *
from misoc.interconnect import wishbone
class MOR1KX(Module):
def __init__(self, platform, reset_pc):
self.ibus = i = wishbone.Interface()
self.dbus = d = wishbone.Interface()
self.interrupt = Signal(32)
###
i_adr_o = Signal(32)
d_adr_o = Signal(32)
self.specials += Instance("mor1kx",
p_FEATURE_INSTRUCTIONCACHE="NONE",
p_FEATURE_DATACACHE="NONE",
p_FEATURE_TIMER="NONE",
p_FEATURE_SYSCALL="NONE",
p_FEATURE_TRAP="NONE",
p_FEATURE_RANGE="NONE",
p_FEATURE_OVERFLOW="NONE",
p_FEATURE_SRA="NONE",
p_FEATURE_ADDC="NONE",
p_FEATURE_CMOV="NONE",
p_FEATURE_FFL1="NONE",
p_FEATURE_ATOMIC="NONE",
p_FEATURE_MULTIPLIER="NONE",
p_FEATURE_DIVIDER="NONE",
p_FEATURE_STORE_BUFFER="NONE",
p_OPTION_CPU0="PRONTO_ESPRESSO",
p_OPTION_RESET_PC=reset_pc,
p_IBUS_WB_TYPE="B3_REGISTERED_FEEDBACK",
p_DBUS_WB_TYPE="B3_REGISTERED_FEEDBACK",
i_clk=ClockSignal(),
i_rst=ResetSignal(),
i_irq_i=self.interrupt,
o_iwbm_adr_o=i_adr_o,
o_iwbm_dat_o=i.dat_w,
o_iwbm_sel_o=i.sel,
o_iwbm_cyc_o=i.cyc,
o_iwbm_stb_o=i.stb,
o_iwbm_we_o=i.we,
o_iwbm_cti_o=i.cti,
o_iwbm_bte_o=i.bte,
i_iwbm_dat_i=i.dat_r,
i_iwbm_ack_i=i.ack,
i_iwbm_err_i=i.err,
i_iwbm_rty_i=0,
o_dwbm_adr_o=d_adr_o,
o_dwbm_dat_o=d.dat_w,
o_dwbm_sel_o=d.sel,
o_dwbm_cyc_o=d.cyc,
o_dwbm_stb_o=d.stb,
o_dwbm_we_o=d.we,
o_dwbm_cti_o=d.cti,
o_dwbm_bte_o=d.bte,
i_dwbm_dat_i=d.dat_r,
i_dwbm_ack_i=d.ack,
i_dwbm_err_i=d.err,
i_dwbm_rty_i=0)
self.comb += [
self.ibus.adr.eq(i_adr_o[2:]),
self.dbus.adr.eq(d_adr_o[2:])
]
# add Verilog sources
vdir = os.path.join(
os.path.abspath(os.path.dirname(__file__)),
"mor1kx", "rtl", "verilog")
platform.add_source_dir(vdir)
platform.add_verilog_include_path(vdir)
SoC gateware
The overall architecture of the SoC can be seen on this wonderful diagram:
The full code can be downloaded; I will describe it part-by-part here.
Digression: the Wishbone bus
All of the components in this SoC are connected using the Wishbone bus, which is actually fairly simple as far as buses go, but it has a number of signals that are dazzling at first, and the specification is very obtuse and probably creates more confusion than it solves.
When implementing a typical I/O peripheral (without any wait states and without support for any but word granularity access) in Migen, the following list can be used as a reference:
- The
rst
andclk
inputs are implicit as a part of the clock domain and not part of the Wishbone interface in Migen; - The
cyc
andstb
inputs, when asserted together, indicate that there is a valid bus cycle and this peripheral is selected; - The
adr
input is the address of the access; it has the granularity of bus width and includes every bit even if it’s already partially decoded by the arbiter. For example, if a CPU is accessing the address0x10001000
and a peripheral is mapped with a base address0x10000000
on a 32-bit Wishbone bus, the peripheral will observeadr == 0x04000400
. - The
dat_r
output is the word being read from the peripheral; since its value does not affect anything when the peripheral is not selected, it can (and should, for simplicity) be updated regardless of whether there is a bus cycle. - The
dat_w
input is the word being written into the peripheral, andwe
is the write enable strobe;dat_w
is only valid when all ofcyc
,stb
andwe
are asserted. - The
ack
output should be asserted once a transaction successfully completes, i.e. in response tocyc
andstb
being asserted. - The
err
output can be asserted to abort the transaction; this will raise a bus error exception or the like. - The
cti
andbte
signals are related to burst transfers and can be ignored.
The peripheral should also use registered feedback, i.e. the dat_r
, ack
and (if any) err
outputs should be asserted through combinatorial logic. This improves timing closure, since
in this case accesses take two cycles instead of one, but the critical path for the Wishbone
bus includes only signals in that bus, and not the rest of your design also.
GPIO peripheral
The GPIO peripheral maps consecutive 32-bit write-only registers (with one bit used) to the output array, of no more than 8 outputs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class SimpleGPIO(Module):
def __init__(self, outputs):
self.bus = bus = wishbone.Interface()
###
self.sync += [
bus.ack.eq(0),
If(bus.cyc & bus.stb & ~bus.ack,
bus.ack.eq(1),
If(bus.we,
Array(outputs)[bus.adr & 0x3].eq(bus.dat_w[0])
)
)
]
Note this basic pattern:
1
2
3
4
5
6
7
8
self.sync += [
bus.ack.eq(0),
If(bus.cyc & bus.stb & ~bus.ack,
bus.ack.eq(1),
If(bus.we,
...)
)
]
This makes sure that the code replaced by ...
will execute only during a valid bus transaction
(the bus.cyc & bus.stb
part), execute exactly once per bus cycle (the “negative ack feedback”),
and execute only during write transactions (the bus.we
part).
UART peripheral
The UART peripheral has two 32-bit registers with the following layout:
(speaking of layout, do you have any idea how ridiculously hard it is to lay this out in HTML?)
Bit | 31:10 | 9 | 8 | 7:0 |
---|---|---|---|---|
Type | N/A | R | R/C1 | R |
Function | N/A | RX Error | RX Full | RX Data |
Bit | 31:10 | 9 | 8 | 7:0 |
---|---|---|---|---|
Type | N/A | R | W | W |
Function | N/A | TX Empty | TX Start | TX Data |
The bit types that are reasonable to use in peripheral registers are:
- R means read-only (writes do nothing);
- W means write-only (reads return zeroes);
- R/W means read-write (reads return what was written, and any written value is valid);
- R/C1 means read-only, cleared by writing one (writing zero does nothing, writing one clears the bit if it was set, or does nothing);
- N/A means reserved and should be written as zero (reads return garbage, zero writes do nothing, non-zero writes result in unpredictable behavior).
The R/C1 bit type is particularly useful for event flags. The reason it’s specifically cleared by writing one is that this allows updating unrelated bits in the same register without accidentally clearing some interesting flags, even if the set of flags is not known beforehand and thus they cannot be explicitly masked; in general, it is otherwise impossible to safely update a part of the register using a read-modify-write cycle without introducing a race condition.
The N/A bit type is specified in a way that lets the bit gain new functionality later without breaking older software, if the software respects the associated restrictions.
Of course, when implementing a peripheral you have complete freedom over its behavior; and you could implement odd things, like registers that self-clear on reads, or bits that have completely different meaning when reading and writing, or somesuch. But this is error-prone and also annoys software developers, so maybe don’t do that.
The implementation of the peripheral is essentially the same as for the GPIO one, though it also has readable registers:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
class SimpleUART(Module):
def __init__(self, serial, clk_freq, baud_rate):
self.bus = bus = wishbone.Interface()
self.submodules.phy = phy = UART(serial, clk_freq, baud_rate)
###
self.sync += [
If((bus.adr & 1) == 0,
bus.dat_r.eq(Cat(phy.rx_data, phy.rx_ready, phy.rx_error))
).Elif((bus.adr & 1) == 1,
bus.dat_r.eq(Cat(Replicate(0, 9), phy.tx_ack))
),
phy.rx_ack.eq(0),
phy.tx_ready.eq(0),
bus.ack.eq(0),
If(bus.cyc & bus.stb & ~bus.ack,
bus.ack.eq(1),
If(bus.we,
If((bus.adr & 1) == 0,
phy.rx_ack.eq(bus.dat_w[8])
).Elif((bus.adr & 1) == 1,
phy.tx_data.eq(bus.dat_w[:8]),
phy.tx_ready.eq(bus.dat_w[8])
)
)
)
]
Note how the strobe bits of the PHY (rx_ack
and tx_ready
) are assigned zero by default;
this ensures that they are asserted for exactly one cycle after a write transaction that
has the corresponding bits set.
SoC interconnect
The last part of the SoC is the one that brings it all together and in the darkness binds them. It consists of the system reset generator and the Wishbone interconnect:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def mem_decoder(address):
return lambda a: (a << 2) & 0xf0000000 == address
class SimpleSoC(Module):
def __init__(self, platform, code):
clk12 = platform.request("clk12")
serial = platform.request("serial")
self.clock_domains.cd_por = ClockDomain(reset_less=True)
self.clock_domains.cd_sys = ClockDomain()
reset_delay = Signal(10, reset=1023)
self.comb += [
self.cd_por.clk.eq(clk12),
self.cd_sys.clk.eq(clk12),
self.cd_sys.rst.eq(reset_delay != 0)
]
self.sync.por += \
If(reset_delay != 0,
reset_delay.eq(reset_delay - 1)
)
self.submodules.cpu = MOR1KX(platform, 0x00000000)
self.submodules.ram = wishbone.SRAM(0x100, init=code)
self.submodules.leds = SimpleGPIO([
platform.request("user_led") for _ in range(8)
])
self.submodules.uart = SimpleUART(serial, 12000000, 9600)
self.submodules.wishbonecon = wishbone.InterconnectShared(
masters=[
self.cpu.ibus,
self.cpu.dbus
],
slaves=[
(mem_decoder(0x00000000), self.ram.bus),
(mem_decoder(0x10000000), self.leds.bus),
(mem_decoder(0x20000000), self.uart.bus),
],
register=True)
The reset generator is necessary because, while uploading a bitstream into the FPGA performs the basic functions of a reset—it initializes the registers to the known values and then ungates the clock among all other I/O pins—it does not do the latter deterministically, and in practice, while it would still work for simple designs, large one such as this SoC will break.
The Wishbone interconnect in this case includes an arbiter and a decoder, connected back-to-back
inside of the MiSoC built-in wishbone.InterconnectShared
module.
The arbiter is necessary because the CPU has separate instruction and data buses; since we have
only one SRAM block used for both instructions and data, if we want to be able to perform any
loads and stores to RAM, it should be shared.
The decoder allows us to simplify peripherals, as they can be ignorant of the exact address
they are mapped at.
Software
The software I wrote as an example is very straightforward, and perhaps representative of the inner desires of us all: you can say anything to it through UART, and it will scream in response. It is implemented in OR1K assembly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
l.xor r0, r0, r0
l.movhi r1, 0x1000
l.movhi r2, 0x2000
l.ori r10, r0, 1
0: l.lwz r3, 0(r2)
l.andi r4, r3, 0x100
l.sfeqi r4, 0
l.bf 0b
l.sw 0(r1), r10
l.ori r11, r0, 0x100
l.sw 0(r2), r11
l.ori r12, r0, 16
1: l.ori r11, r0, 0x141
l.sw 4(r2), r11
2: l.lwz r11, 4(r2)
l.andi r11, r11, 0x200
l.sfeqi r11, 0
l.bf 2b
l.addi r12, r12, -1
l.sfnei r12, 0
l.bf 1b
l.sw 0(r1), r0
l.j 0b
Note the absence of delay slots; the PRONTO_ESPRESSO
pipeline used in our mor1kx instantiation
does not include those, unlike the two other ones.
An interesting exercise would be to implement an UART bootloader, such that compiled programs would no longer require rebuilding the bitstream, which takes about a minute.