To write this note, I was using:
- uart.py from the earlier note;
- the iCE40-HX8K-B-EVN development board.
Newer versions will probably still work fine.
The SoC consists of the three main parts: the mor1kx wrapper, the SoC gateware, and the ROM code.
The mor1kx wrapper simply instantiates the CPU core from an adjacent checkout of the
git repository, and configures it to remove as many features as realistically possible,
with the exception of narrowing the register file.
(There is an option
OPTION_RF_ADDR_WIDTH, but enabling it broke stores with
the address in
r2 while stores with address in
for a reason I am unable to comprehend. There is also an option
There is a version of this wrapper in MiSoC, but it’s not configurable and enables too many features, e.g. caches, such that the resulting core won’t fit into iCE40-HX8K.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 import os from migen import * from misoc.interconnect import wishbone class MOR1KX(Module): def __init__(self, platform, reset_pc): self.ibus = i = wishbone.Interface() self.dbus = d = wishbone.Interface() self.interrupt = Signal(32) ### i_adr_o = Signal(32) d_adr_o = Signal(32) self.specials += Instance("mor1kx", p_FEATURE_INSTRUCTIONCACHE="NONE", p_FEATURE_DATACACHE="NONE", p_FEATURE_TIMER="NONE", p_FEATURE_SYSCALL="NONE", p_FEATURE_TRAP="NONE", p_FEATURE_RANGE="NONE", p_FEATURE_OVERFLOW="NONE", p_FEATURE_SRA="NONE", p_FEATURE_ADDC="NONE", p_FEATURE_CMOV="NONE", p_FEATURE_FFL1="NONE", p_FEATURE_ATOMIC="NONE", p_FEATURE_MULTIPLIER="NONE", p_FEATURE_DIVIDER="NONE", p_FEATURE_STORE_BUFFER="NONE", p_OPTION_CPU0="PRONTO_ESPRESSO", p_OPTION_RESET_PC=reset_pc, p_IBUS_WB_TYPE="B3_REGISTERED_FEEDBACK", p_DBUS_WB_TYPE="B3_REGISTERED_FEEDBACK", i_clk=ClockSignal(), i_rst=ResetSignal(), i_irq_i=self.interrupt, o_iwbm_adr_o=i_adr_o, o_iwbm_dat_o=i.dat_w, o_iwbm_sel_o=i.sel, o_iwbm_cyc_o=i.cyc, o_iwbm_stb_o=i.stb, o_iwbm_we_o=i.we, o_iwbm_cti_o=i.cti, o_iwbm_bte_o=i.bte, i_iwbm_dat_i=i.dat_r, i_iwbm_ack_i=i.ack, i_iwbm_err_i=i.err, i_iwbm_rty_i=0, o_dwbm_adr_o=d_adr_o, o_dwbm_dat_o=d.dat_w, o_dwbm_sel_o=d.sel, o_dwbm_cyc_o=d.cyc, o_dwbm_stb_o=d.stb, o_dwbm_we_o=d.we, o_dwbm_cti_o=d.cti, o_dwbm_bte_o=d.bte, i_dwbm_dat_i=d.dat_r, i_dwbm_ack_i=d.ack, i_dwbm_err_i=d.err, i_dwbm_rty_i=0) self.comb += [ self.ibus.adr.eq(i_adr_o[2:]), self.dbus.adr.eq(d_adr_o[2:]) ] # add Verilog sources vdir = os.path.join( os.path.abspath(os.path.dirname(__file__)), "mor1kx", "rtl", "verilog") platform.add_source_dir(vdir) platform.add_verilog_include_path(vdir)
The overall architecture of the SoC can be seen on this wonderful diagram:
The full code can be downloaded; I will describe it part-by-part here.
Digression: the Wishbone bus
All of the components in this SoC are connected using the Wishbone bus, which is actually fairly simple as far as buses go, but it has a number of signals that are dazzling at first, and the specification is very obtuse and probably creates more confusion than it solves.
When implementing a typical I/O peripheral (without any wait states and without support for any but word granularity access) in Migen, the following list can be used as a reference:
clkinputs are implicit as a part of the clock domain and not part of the Wishbone interface in Migen;
stbinputs, when asserted together, indicate that there is a valid bus cycle and this peripheral is selected;
adrinput is the address of the access; it has the granularity of bus width and includes every bit even if it’s already partially decoded by the arbiter. For example, if a CPU is accessing the address
0x10001000and a peripheral is mapped with a base address
0x10000000on a 32-bit Wishbone bus, the peripheral will observe
adr == 0x04000400.
dat_routput is the word being read from the peripheral; since its value does not affect anything when the peripheral is not selected, it can (and should, for simplicity) be updated regardless of whether there is a bus cycle.
dat_winput is the word being written into the peripheral, and
weis the write enable strobe;
dat_wis only valid when all of
ackoutput should be asserted once a transaction successfully completes, i.e. in response to
erroutput can be asserted to abort the transaction; this will raise a bus error exception or the like.
btesignals are related to burst transfers and can be ignored.
The peripheral should also use registered feedback, i.e. the
ack and (if any)
outputs should be asserted through combinatorial logic. This improves timing closure, since
in this case accesses take two cycles instead of one, but the critical path for the Wishbone
bus includes only signals in that bus, and not the rest of your design also.
The GPIO peripheral maps consecutive 32-bit write-only registers (with one bit used) to the output array, of no more than 8 outputs:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 class SimpleGPIO(Module): def __init__(self, outputs): self.bus = bus = wishbone.Interface() ### self.sync += [ bus.ack.eq(0), If(bus.cyc & bus.stb & ~bus.ack, bus.ack.eq(1), If(bus.we, Array(outputs)[bus.adr & 0x3].eq(bus.dat_w) ) ) ]
Note this basic pattern:
1 2 3 4 5 6 7 8 self.sync += [ bus.ack.eq(0), If(bus.cyc & bus.stb & ~bus.ack, bus.ack.eq(1), If(bus.we, ...) ) ]
This makes sure that the code replaced by
... will execute only during a valid bus transaction
bus.cyc & bus.stb part), execute exactly once per bus cycle (the “negative ack feedback”),
and execute only during write transactions (the
The UART peripheral has two 32-bit registers with the following layout:
(speaking of layout, do you have any idea how ridiculously hard it is to lay this out in HTML?)
|Function||N/A||RX Error||RX Full||RX Data|
|Function||N/A||TX Empty||TX Start||TX Data|
The bit types that are reasonable to use in peripheral registers are:
- R means read-only (writes do nothing);
- W means write-only (reads return zeroes);
- R/W means read-write (reads return what was written, and any written value is valid);
- R/C1 means read-only, cleared by writing one (writing zero does nothing, writing one clears the bit if it was set, or does nothing);
- N/A means reserved and should be written as zero (reads return garbage, zero writes do nothing, non-zero writes result in unpredictable behavior).
The R/C1 bit type is particularly useful for event flags. The reason it’s specifically cleared by writing one is that this allows updating unrelated bits in the same register without accidentally clearing some interesting flags, even if the set of flags is not known beforehand and thus they cannot be explicitly masked; in general, it is otherwise impossible to safely update a part of the register using a read-modify-write cycle without introducing a race condition.
The N/A bit type is specified in a way that lets the bit gain new functionality later without breaking older software, if the software respects the associated restrictions.
Of course, when implementing a peripheral you have complete freedom over its behavior; and you could implement odd things, like registers that self-clear on reads, or bits that have completely different meaning when reading and writing, or somesuch. But this is error-prone and also annoys software developers, so maybe don’t do that.
The implementation of the peripheral is essentially the same as for the GPIO one, though it also has readable registers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 class SimpleUART(Module): def __init__(self, serial, clk_freq, baud_rate): self.bus = bus = wishbone.Interface() self.submodules.phy = phy = UART(serial, clk_freq, baud_rate) ### self.sync += [ If((bus.adr & 1) == 0, bus.dat_r.eq(Cat(phy.rx_data, phy.rx_ready, phy.rx_error)) ).Elif((bus.adr & 1) == 1, bus.dat_r.eq(Cat(Replicate(0, 9), phy.tx_ack)) ), phy.rx_ack.eq(0), phy.tx_ready.eq(0), bus.ack.eq(0), If(bus.cyc & bus.stb & ~bus.ack, bus.ack.eq(1), If(bus.we, If((bus.adr & 1) == 0, phy.rx_ack.eq(bus.dat_w) ).Elif((bus.adr & 1) == 1, phy.tx_data.eq(bus.dat_w[:8]), phy.tx_ready.eq(bus.dat_w) ) ) ) ]
Note how the strobe bits of the PHY (
tx_ready) are assigned zero by default;
this ensures that they are asserted for exactly one cycle after a write transaction that
has the corresponding bits set.
The last part of the SoC is the one that brings it all together and in the darkness binds them. It consists of the system reset generator and the Wishbone interconnect:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 def mem_decoder(address): return lambda a: (a << 2) & 0xf0000000 == address class SimpleSoC(Module): def __init__(self, platform, code): clk12 = platform.request("clk12") serial = platform.request("serial") self.clock_domains.cd_por = ClockDomain(reset_less=True) self.clock_domains.cd_sys = ClockDomain() reset_delay = Signal(10, reset=1023) self.comb += [ self.cd_por.clk.eq(clk12), self.cd_sys.clk.eq(clk12), self.cd_sys.rst.eq(reset_delay != 0) ] self.sync.por += \ If(reset_delay != 0, reset_delay.eq(reset_delay - 1) ) self.submodules.cpu = MOR1KX(platform, 0x00000000) self.submodules.ram = wishbone.SRAM(0x100, init=code) self.submodules.leds = SimpleGPIO([ platform.request("user_led") for _ in range(8) ]) self.submodules.uart = SimpleUART(serial, 12000000, 9600) self.submodules.wishbonecon = wishbone.InterconnectShared( masters=[ self.cpu.ibus, self.cpu.dbus ], slaves=[ (mem_decoder(0x00000000), self.ram.bus), (mem_decoder(0x10000000), self.leds.bus), (mem_decoder(0x20000000), self.uart.bus), ], register=True)
The reset generator is necessary because, while uploading a bitstream into the FPGA performs the basic functions of a reset—it initializes the registers to the known values and then ungates the clock among all other I/O pins—it does not do the latter deterministically, and in practice, while it would still work for simple designs, large one such as this SoC will break.
The Wishbone interconnect in this case includes an arbiter and a decoder, connected back-to-back
inside of the MiSoC built-in
The arbiter is necessary because the CPU has separate instruction and data buses; since we have
only one SRAM block used for both instructions and data, if we want to be able to perform any
loads and stores to RAM, it should be shared.
The decoder allows us to simplify peripherals, as they can be ignorant of the exact address
they are mapped at.
The software I wrote as an example is very straightforward, and perhaps representative of the inner desires of us all: you can say anything to it through UART, and it will scream in response. It is implemented in OR1K assembly:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 l.xor r0, r0, r0 l.movhi r1, 0x1000 l.movhi r2, 0x2000 l.ori r10, r0, 1 0: l.lwz r3, 0(r2) l.andi r4, r3, 0x100 l.sfeqi r4, 0 l.bf 0b l.sw 0(r1), r10 l.ori r11, r0, 0x100 l.sw 0(r2), r11 l.ori r12, r0, 16 1: l.ori r11, r0, 0x141 l.sw 4(r2), r11 2: l.lwz r11, 4(r2) l.andi r11, r11, 0x200 l.sfeqi r11, 0 l.bf 2b l.addi r12, r12, -1 l.sfnei r12, 0 l.bf 1b l.sw 0(r1), r0 l.j 0b
Note the absence of delay slots; the
PRONTO_ESPRESSO pipeline used in our mor1kx instantiation
does not include those, unlike the two other ones.
An interesting exercise would be to implement an UART bootloader, such that compiled programs would no longer require rebuilding the bitstream, which takes about a minute.