January 18, 2016 - Cedric Bail
How the User Space Can Help the Linux Kernel Scheduler Improve Energy Efficiency
As I illustrated in my previous article, the current capability of the Linux Kernel scheduler is far from giving us the most efficient use of the hardware we have; this needs to be fixed. The kernel community is hard at work attempting to fix this issue, and we should understand how they intend to do so to make sure that user space applications will be ready to take advantage of it.
The Direction Taken by the Kernel Community
Obviously this is easier said than done; even so, there is huge work being completed in the Kernel community to fix this issue. The solution is simple to describe, but very hard to implement as it touches one of the core components of Linux. Essentially, the scheduler should incorporate the work of cpuidle and cpufreq, and both cpuidle and cpufreq should be eliminated.
Amit Kucheria offers a great read on this subject, and I highly encourage you to read it, as anything I would say on this subject would simply repeat it. To sum up the work they are completing, there will be an improved oracle that will take the task and device history to decide where, when, and how a task should be run. This should solve our problem, no?
Adapting the User Space to Help Out
I’m not so sure. Referring back to the hypothetical application from my previous blog post, it would still have a hard time dynamically adapting the CPU profile, because the task is changing constantly from IO bound, to CPU bound, to memory bound! It is quite certain that the user space will be required to adapt to help the Kernel make the right decisions.
The Kernel is limited to history per task, meaning user space programs need to split and group executed code into different threads based on what they are doing. This would give each thread a constant profile from the kernel point of view, allowing it to make better decisions regarding CPU frequency and idling. I’m referring to a real system thread, not a light weight thread of course. This change doesn’t require the application to become parallel; each thread can still be scheduled one after the other using some kind of barrier. This should provide major benefits once the Kernel improvements are finished.
A better approach would be to use this opportunity to reorder the code to work with a pipeline and queue, a little bit like GStreamer does, but in a more general approach. A lot of code completes its work as a pipeline: a sequence of steps that are more or less dependent on each other. Using a pipeline and queue should make it possible to have an architecture where each thread will be blocking until the job on which it depends is complete. This kind of approach seems to fit quite well with scenegraph rendering logic that toolkits like EFL need.
How EFL is Going to Take Advantage of This Improved Scheduler
It shouldn’t be the responsibility of the application to worry about this problem, as most of them will just do IO-bound tasks in the main loop and rely on libraries to do the heavy lifting. Also, many developers simply don’t have the experience necessary to make the right decisions on this, so we should provide some help. The main loop is becoming a place that developers use to react to outside events, and every other operation is being moved to a specific thread. The toolkit should be the one doing the heavy lifting!
The EFL community has begun the process of rebuilding the internal EFL pipeline to be more threaded. Our primary goal is not simply to improve speed (most of our operations are memory bound and one CPU is enough to reach their limits), but rather, we primarily hope to improve our energy efficiency by making scheduling more efficient. Still, we should see marginally faster performance in a few cases where tasks are CPU bound, like up/down scaling, or vector graphics, and we can now make them run in parallel.
We are also rewriting our immediate rendering code to use a retained rendering design. The idea is to split all of the rendering code to create a prepare stage that is CPU intensive, and a rendering stage that is memory bound. This way, the canvas can run this code in different threads, max out performance by running multiple CPU intensive threads, and serialize memory bound jobs to limit the amount of memory bandwidth needed to match what the hardware offers.
Of course, such a job is likely going to take us years… This should be fine as it should be finished around the same time the Linux Kernel developers are done with their work! Considering how complex this work is, I would encourage anyone working on a graphical toolkit to investigate it to see how they might benefit from this idea. A funny side note: it seems this same design will work quite well when you eventually decide to move to Vulkan. Maybe that’s a story for another blog post.
About Cedric Bail
Cedric has been contributing for a long time to EFL. He is known as the borker due to his work on optimizing the core libraries and triggering side effect bugs which tend to take years to be discovered.