February 2, 2017 - Derek Foreman
A Curious Wayland Bug
Enlightenment has a slightly unconventional architecture. For many of its internal dialogues (settings, the file manager, etc.) it uses what we call “internal windows.” These are simply regular toolkit windows that use regular rendering back-ends; in the good ol’ days, this meant a connection to the X server which resulted in a render to a standard X window. In this scenario, the compositor gets the result back the same way as all X client windows, and it composites all internal and external windows together.
Under Wayland, things get a bit scarier because now the compositor is the display server. Enlightenment makes a Wayland connection to itself; this gives us all manner of game winning capabilities we never had before. For example, let’s say the internal window wants to sleep and wait for a reply from the compositor (such as a released buffer to render the next frame into). We deadlock and fall down a hole. But I digress.
Problems with internal windows also led us to this problem recently where the compositor exited, logged nothing, and gave only a cryptic message:
invalid object (3), type (zwp_linux_dmabuf_v1), message enter(uoa)
This prompts two questions
- What does that even mean?
- How do I debug something like this?
Since both client and compositor errors are logged in the same file, the first surprise was learning that this is a client error message. An internal window connection was closing itself on the client end because it received something from the compositor that didn’t make any sense.
This error message means the client received a message of the form: foo.enter(unsigned integer, object, array) where the object’s id was 3: a zwp_linux_dmabuf_v1 object. The client knows what type “object” should have been because it knows what type “foo” is (even though it doesn’t bother to include this bit of information in the log), and that type isn’t zwp_linux_dmabuf_v1. So, the client doesn’t have any idea about what to do with this event. It then has an existential crisis to ponder: what if that event is important in discerning the meaning of all future events? The only thing left for it to do is quietly disconnect and exit(); since this particular client is the same process as the compositor, it takes the rest of the desktop with it.
How do we Debug this Mess?
Well, a reasonable first step might be to add more instrumentation to libwayland-client to actually tell us the type of the object that received the bad message, but let’s be lazy.
Between the base Wayland protocol and extensions from the wayland-protocols repository, there are currently 6 possible “enter” events in Wayland. It doesn’t take too terribly long to match the signature from the error message – the (uoa) part – in the log for the keyboard enter event in the protocol XML file.
We know the client is exiting, but the client is likely not the location of the bug; rather, it seems to be exiting in response to garbage from the compositor. One of the several wl_keyboard_send_enter() call sites in enlightenment is probably the root cause.
In fact, what was happening is that under certain circumstances a change of keyboard focus wasn’t cleaning up all internal states, and a keyboard enter event was being sent to the wrong client. In some other client, object id 3 corresponded to a wl_surface object, which is the second argument that’s expected in a keyboard enter. It was actually fairly easy to sort out, but made much more difficult by the limited information in the log and the confusing way the client took the hit for a compositor bug.
Now that I’ve explained how to debug this error pattern, I’ll mention in passing that nobody else will have to quite these lengths again. We’ve recently landed changes to the wayland-server library which validate events before sending them to ensure it’s not mixing up objects from different clients. Now, the compositor will log an error with a much more coherent message instead of making it appear that the client is responsible for the wrong doings. This makes it easier to start the debugging process in the right place, and should help make future debug marathons shorter. The client still gets disconnected which is a little disruptive, but it’s a clean disconnect rather than an abort(), so the client is free to attempt a reconnect. Then, the server side prints an error message with the actual object interface name, making it so this kind of bug should be a little easier to sort out in the future.
About Derek Foreman
Derek Foreman is a Senior Open Source Developer with Samsung's Open Source Group, specializing in graphics work. Previously, he worked on the graphics team at an open source consultancy where his work primarily focused on hardware enablement and software optimization for embedded systems. His career started at a biomedical institute where he developed analysis and control software for medical imaging equipment.
Image Credits: Open Source Way