7 (+4) Things That Don't Work in Zephyr

7 (+4) Things That Don't Work in Zephyr
Photo by Luis Villasmil / Unsplash

The open source Zephyr Project RTOS and build system is a tour de force. It is also highly likely to be a major and unavoidable choice for embedded firmware development, particularly for cellular projects. Like many modern software development projects however, it hasn't let the pursuit of perfection get in the way of shipping, to put it favourably. Developers may find themselves confused guinea pigs, discovering lots of cracks while trying to navigate a formidable new landscape.

Discovering egregious bugs while you learn something for the first time is a frustrating experience. So, what follows is a record of the top seven cracks I tripped on on my first trip down Zephyr lane.

Two important caveats:

  1. This was my experience targeting the nRF9160 from Nordic Semiconductor. Nordic wrap Zephyr in their nRF Connect SDK (NCS). Given the nature of Zephyr, bugs will vary significantly from target to target and manufacturer to manufacturer.
  2. Some of these bugs were fixed in updates released during the development process! Others may have been fixed since then too. One of the really challenging aspects of developing with Zephyr is that it is a fast moving target, so software repositories, reported issues and forum posts are more likely to be obsolete than not.

On to the sordid list:

1. The Installer doesn't work

Nordic provide Zephyr bundled with a bunch of Nordic-relevant additions in a software development kit called the nRF Connect SDK. It's an epic itself, and evolving just as rapidly as the elements it contains. The installer is a commendable effort at wrangling a very complex process. As is a common symptom of these sorts of projects, it suffers from too many well-meaning but incompatible sets of instructions, rather than one that just works.

The delta to a working installer is really quite small, but hard to ascertain. I restarted from scratch a couple of times to ensure I had a set of working instructions from start to finish, and published them here. They really amount to "follow Nordic's instructions, except for two or three bum-steers, to avoid 10 hours of goose chasing".

In the spirit of these things, I provided them back to Nordic who kindly responded saying that I shouldn't have followed their instructions because they're out of date. As you read on you'll notice this theme cropping up a bit: using what you can find online to guide you in Zephyr land is fraught - you should just know.

Update

The above was true for nRF Connect SDK 1.5.0. In version 1.6.1 the new Toolchain Manager method is recommended. It doesn't work. And when it fails you have a standalone 4GB installation that can't be remedied. I gave up and didn't lose anything. The manual method from 1.5.0 (with my extra notes) works fine and leaves you with a re-usable installation.

2. The IDE can't build a Zephyr project

A modified version of SEGGER Embedded Studio (SES) is offered as an IDE. The modifications allow a NCS project to be imported, which during the importation process means the crucial devicetree and Kconfig components are processed. Alas, if you make a change to those components (which happens a lot), you need to re-import the project because SES has no native support for those components.

A Zephyr project is essentially a devicetree component, a Kconfig component and a C component. Having no support in the IDE for two thirds of those components makes for a lacklustre experience. SES can't even list the files in a project navigator view because it's not aware of the project structure. Code insight, syntax checking and re-builds are all poor enough that it's borderline dangerous to use it as a development IDE. In my work I did retain it for debugging purposes though, since that was neatly handled. Using west from the command line is perfectly sufficient for managing builds, and VSCode works well as an editor that is devicetree and Kconfig aware.

3. Setting the SPI speed for SDHC (SD Card) access doesn't do anything

In Zephyr you configure SD Card access using devicetree. The provided driver establishes comms with the SD Card at a low speed and then negotiates a higher speed according to what you specify for the mandatory spi-max-frequency parameter.

Only it doesn't do anything. It quietly ignores your setting, operating at snail speed.

It turns out this has been fixed is "upstream" Zephyr, but because Nordic bundle their own version, you don't get the fix. So SD Card access is broken. There will be many more variants on this theme to come.

This can be fixed by applying this patch to the Zephyr source in the NCS. Just add .patch to the end of the URL to turn it into a patch file. But be careful! Nordic's version has the old disc API, so you need to add some extra DTS macros to get it to work. With some excruciating needle work (you've never seen misleading error messages until you've seen DTS preprocessor error messages!) I found I could just cherry-pick and apply these two patches first: one and two.

4. One-Shot Timers don't work if you follow the instructions

Being an RTOS, Zephyr supports one-shot timers that only ever fire once by accepting K_FOREVER as the period. Only it doesn't work. The timer will actually fire lots, leaving a bizarre trail of debugging to navigate, until you discover it was because one of the authors was being clever. It seems the definition of K_FOREVER took a big detour at some point (some of that sordid tale here), and given K_FOREVER appears in about 662 files in Zephyr (I counted), the detritus of regressions is far and wide.  One such unassuming victim is timers, where the "clever" trick gets the sign of K_FOREVER wrong and ends up firing your one-shot timer constantly.

Turns out this one has been fixed too. Patch away!

5. I2S corrupts data

Ah, the sweet sound of data glitches. Once you manage to get data off an SD Card at more than 27kB/s, and your one-shot timers stop firing constantly, audio pumped out via I2S will still exhibit bizarre glitches, sometimes, in an ear-grating heisenbug.

Turns out data you pass to the Zephyr I2S driver needs to be word aligned, or the driver will quietly drop bytes. The only place this is documented is deep in the code itself, in an ASSERT which is disabled by default. Unfortunately, enabling the ASSERT falsely triggers another, so best to leave those sleeping tigers alone. You just have to know. So make sure you __attribute__((aligned(4))) your buffers that you're passing to I2S.

And... it still wont work. The other undocumented little trick is that while the tx and rx buffer pointers are double-buffered, their contents are not! So the user must double buffer the data themselves and provide the alternating pointers to the driver (which will double-buffer them, as if doing you a favour).

Now I2S works.

6. I2C addressing will fight you if you want it to work

To specify the slave address of an I2C peripheral in Zephyr you use the reg devicetree property, obviously. Given there's no intuition to go by, you might be tempted to consult the reference instead. That's a trap. Firstly, there is no reference, you just have to piece it together from contradictory examples and guides. Secondly, it's wrong. It says not to specify hexadecimal with a 0x prefix. But if you want it to work, you do need to, and you need to ignore the build process warnings that appear as a result. Lesson learned: when Zephyr is telling you you're doing it wrong, you might be on the right track.

Also make sure you specify the 7-bit address as your hex value, not the shifted 8-bit value. This is a time-honoured source of confusion, solved over and over again for the last 20 years, that Zephyr has introduced a new spin on.

7. I2C write packets are malformed

Out of the box, the nRF9160 sends the following waveform on I2C when trying to send 0xFFFF to register 0x02 on slave address 0x74. As you can see, it sends 0x74, write (0), 0x02 and then the slave ACKs (0). It then, in gloriously neckbearded, over-complicated, out-of-touch nonsense, decides it can’t fit the rest of the data in the buffer, so terminates the transfer and starts again with 0x74 and the next byte (0xFF). At that point the slave sends the equivalent of WTF (actually a NACK - the red “1”) and it all falls down.

I2C_burst_write()

Someone pointed out that this is pretty freaking dumb in 2018, in a ticket called i2c_burst_write is not a burst write. They then looked into it further and made a statement that sums up Zephyr pretty well:

Reviewing the code it appears the I2C API is based on a generalization that does not account for all the ways subordinate devices might use the protocol.

In other words, we abstracted the hardware so far away that even the hardware doesn’t recognise what’s happening.

The author then tries to quantify “all the ways” and found that out of 4 platforms:

Two fail, one succeeds but would fail if the restart flag were added, one succeeds but ignores restart.

For a grand total of zero successfully compliant protocol implementations.

But it gets better.

In 2019 the same author then “fixes” the problem, with a definition of “fix” that must surely be Zephyr's raison d'etre:

This has been resolved within the constraints of Zephyr process: the problems with the API including that i2c_burst_write is not globally functional have been documented, and a warning placed in the documentation for the functions that don’t work.

Ladies and gentlemen, I give you that warning:

Warning: The combined write synthesized by this API may not be supported on all I2C devices.

To fix the issue, I did this (some brevity applied):

int write_port_regs(params)
{
  //Commented out because OMG.
  //i2c_burst_write(data);
  i2c_dont_be_fucking_dumb(data);
}

The entire patch (as applied to the gpio_pca95xx I2C driver) appears below, but seriously, that’s about it.

--- ncs/zephyr/drivers/gpio/gpio_pca95xx.c	2021-04-01 22:46:01.000000000 +1100
+++ ncs/zephyr/drivers/gpio/gpio_pca95xx.c	2021-08-23 00:10:30.000000000 +1000
@@ -190,8 +190,15 @@
 
 	port_data = sys_cpu_to_le16(value);
 
+	/* Modified by HR210822 because Zephyr is fucking broken.
+	   See: https://github.com/zephyrproject-rtos/zephyr/issues/11612
+ 	   And the result: https://devzone.nordicsemi.com/f/nordic-q-a/69828/i2c_burst_write-difference-nrf52-and-nrf91-series
+
 	ret = i2c_burst_write(i2c_master, i2c_addr, reg,
 			      (uint8_t *)&port_data, sizeof(port_data));
+	*/
+	uint8_t buf[] = { reg, value & 0xFF, value >> 8 }; //Normal fucking i2c packet, unlike the i2c_burst_write shit.
+	ret = i2c_write(i2c_master, buf, sizeof(buf), i2c_addr);
 	if (ret == 0) {
 		*cache = value;
 	} else {

And voila, no more WTF.

And that’s the story of why software people shouldn’t have their way with hardware. Thanks for coming to my TED talk.


Zephyr is such a bountiful source of content, this post is in danger of never finishing. Here I'll limit myself to 4 bonus cracks that are also worthy of mention...

8. The bundled FatFs is not re-entrant

Zephyr comes bundled with FatFs ready to use as a file system for external volumes. FatFs is a fine filesystem implementation with a bunch of features that make it suitable for embedded development. One of those features is designed for threaded environments like Zephyr, called _FS_REENTRANT, which ensures the module is re-entrant (thread safe). Only Zephyr decided to quietly disable that one, despite building file system access so deeply into the RTOS (think Linux style fstab mountpoints rather than runtime transactional access) that you're pretty much guaranteed to run into a thread concurrency issue. So every Zephyr user gets to roll their own mutex protection around FatFs calls.

9. There is no device for configuring GPIO

Zephyr uses devicetree to configure hardware devices. Long a part of Linux, devicetree has support for hundreds of devices, from generic SPI peripherals to specific ICs like the pi4ioe5v9539. It turns out it doesn't have one for GPIO though... yes, Zephyr abstracts the hardware so far away that there's no native method for configuring a GPIO pin.

Following the sinking-heart documentation trail, you eventually find this as the closest thing to a conclusion. Sheepishly, via long-winded discussions about what a GPIO pin is and what it means to configure it, and what exact line of code the problem is (remember, this is Zephyr, hardware is a figment of your imagination), there's some confirmation that the best option is to re-purpose the generic user-defined device zephyr-user. And it works! Now stop asking about weird hardware stuff like "GPIO".

10. WS2812 LED strips may work, but only by mistake

Our friend devicetree comes to the rescue if you want to drive a WS2812 LED strip (like some obscure weirdo). The worldsemi,ws2812-spi device will do the trick, although you will have to placate the enclosing nordic,nrf-spim device and give it decoy settings for the mandatory sck-pin, miso-pin and reg properties.

But take no inspiration from the suggested timing settings. It seems the authors aren't the oscilloscope and calculator types, and have just softwared it with trial and error until the lights came on. Incorrectly and unreliably, but it's software, so it's done.

The settings are actually vaguely intuitive - spi-one-frame specifies a bit mask for when a "one frame" should be high, and spi-zero-frame does the same for a "zero frame". spi-max-frequency is here to ruin the party again by taking your suggestion on a frequency and deciding better of it. But as long as you know you're not in charge, you can poke it in the right direction. Here's an example that worked for me:

	/*
	 * Sample values from Nordic for these settings are utter garbage. From first principles:
	 * WS2812 requires 1.25us +/- 0.6us frame period. For an 8-bit frame, that's 6.4MHz
	 * Zero frame is 0.4us high, 0.85us low. +/- 0.15us.
	 * One  frame is 0.8us high, 0.45us low. +/- 0.15us.
	 *
	 * Requesting 6.5MHz gets us 4MHz, so let's do 8MHz instead.
	 * At 8 MHz, 3 bits is 0.375us, 6 bits is 0.75us, period is 1.0us,
	 * all of which is good enough for us.
	 */
	spi-max-frequency = <8000000>;
	spi-one-frame = <0xFC>;
	spi-zero-frame = <0xE0>;

11. Thread start cannot be delayed by condition, only fixed time

When you create a thread in Zephyr you have two options - either it starts then and there, or it starts after a specified timeout value. Nothing says deterministic sequencing like a bunch of fixed timeout values right? The sorry state of K_FOREVER rears its head again.

Not wanting to tie the act of thread creation with the logic of sequencing, and not having found a way to wrangle join to delay thread start until certain pre-requisites are met (despite a long attempt to do so), I resorted to a novel method. It's extremely simple, and works perfectly, but entirely undocumented, which leaves me nervous. So to inspire a smidge of confidence for the next wary traveller, here's my method:

#define K_THREAD_NO_START -1

I simply pass that as the "delay" parameter when creating a thread, and the thread doesn't start. Later, at a time of my choosing, I simply call k_thread_start() and off it goes!

Conclusion

Zephyr is a phenomenal piece of work. The adoption by Nordic and others is very encouraging, and it's hard to see how the model of vendor-supported driverware can't be the way forward for the industry. But today Zephyr is complex, with fast moving innards, spotty support, and far too many glaring bugs. It is simply not the reliable OS that embedded engineers have come to expect. Out of the box it is of the quality that would suit a consumer device - where a human is at the ready to mitigate the weirdness, upgrade regularly, and reset the thing every now and then. It's an experience on par with consumer OS's, not with deeply embedded, standalone deployments.

That said, Zephyr is unavoidable. If you're working on a cellular device today, Zephyr really needs to be part of the toolbox. And on that front, the challenges are surmountable. But it takes a skilled developer, with very good appreciation of hardware, to work through the kinks. Having achieved that, the seas are much calmer.

In the past, firmware was written by electronics engineers. Today it can be written by software teams. This is clearly the future that Zephyr is heading towards. But today the abstractions are too full of holes. You need to be able to peek under the covers to make it work.