May 20, 2015 - Shuah Khan

How Can IOMMU Event Tracing Help You?

Input/Output Memory Management Unit (IOMMU) event tracing can be extremely beneficial when debugging IOMMU hardware, BIOS, and firmware problems. In addition, IOMMU event tracing can be used to debug  Linux kernel IOMMU driver, device assignment, and performance problems related to device assignment in virtualized and non-virtualized environments.

If you aren’t familiar with IOMMU Event Tracing, the first article in this series covered the fundamental concepts behind it, in addition to how tracing can be used to track information about devices as they are moved between guest and host environments. This article will focus on how to use IOMMU event tracing effectively, and will provide a few examples of IOMMU event tracing in action.

How to Enable IOMMU Event Tracing at Boot-Time

IOMMU trace events can be enabled using the Kernel boot option trace_event. The following enables all IOMMU trace events at boot-time:

The following enables map and unmap events at boot-time:

How to Enable IOMMU Event Tracing at Run-Time

A single IOMMU event or multiple events can be enabled at run-time. The following enables a single event:

The following will enable all events:

Where are The Traces?

Traces can be found in /sys/kernel/debug/tracing/trace file. The following shows the trace format. For more details on tracing, please refer to Documentation/trace. This directory contains the tracing documentation.

Traces provide insight into the state of the CPU on which the task is running on. The individual fields such as the irq-off, need-resched, and preempt-depth delay help debug problems. For example, long runs of a task with need-resched set might indicate problems in the code paths that could result in bad response times for other tasks running on the system. These problems could be solved by fixing the relevant code paths.

What do IOMMU Group Event Traces Look Like?

The following group event traces are from a test system with Intel VT-d support. These events show the device and group mapping.

What Does lspci Show?

The following lspci output from the same system gives you information on each of the devices IOMMU found and classified into groups.

 

The following images depict this lspci output:

device_topo1
device_topo2
device_topo3

IOMMU Groups and Device Topology

While comparing IOMMU traces with lspci device topology and IOMMU topology, you will notice that some groups contain more than one device. For example, ISA bridge, SATA controller, and SMBus are placed in group 12, and a PCI bridge and a PCIe root port are placed in group 10. You will also notice that the Network Controller (02:00.0) is in a separate Group 13 even though it is also under the PCI bridge (00:1c.0) and PCI Root Port #1 device hierarchy which are in Group 8.

This is a good example of a device isolation under a PCI bridge, and the PCI Root Port hierarchy. The PCI bridge and root port are placed in the same group, whereas the network controller is in its own group and can be assigned to a VM. In the case of the PCI bridge (00:1c.3), PCI root port #3, and PCI bridge (04:00.0) hierarchy, all of them are placed in group 10. Devices that have dependencies on each other are usually placed in a group together so they can be isolated as a group. The following shows the IOMMU topology derived from the trace events generated as devices get added to individual groups during boot-time.

IOMMU Topology

What do IOMMU Device Event Traces Look Like?

What do IOMMU map/unmap Event Traces Look Like?

Great, We Have Traces! Using Traces to Solve Problems

Using traces we can get insight into IOMMU device topology to see which devices belong to which groups, and run-time device assignment changes as devices move from host to guests and back to host. In turn, this makes it much easier to debug IOMMU problems, Device assignment problems, Detect and solve performance problems, and BIOS and firmware problems related to IOMMU hardware and firmware implementation.

VFIO Based Device Assignment Use-Case

Alex Williamson, a VFIO maintainer, enabled IOMMU traces for vfio-based device assignment and found the following VFIO problems:

  • A large number of unmap calls were being made on VT-d system without IOMMU super-page support. This is because VFIO unmap path is not optimized on a VT-d system without IOMMU super-page support. As a result, each single page is unmapped individually since the current unmap path optimization relies on IOMMU super-page support.
  • Unnecessary single page mappings for invalid and reserved memory regions, like mappings of MMIO BARs.
  • Too many instances of very long tasks being run with needs-resched set.

Result 1: VFIO Patch Series to Fix Problems!

The problems above were fixed, resulting in a reduction in the number of unmap calls to 2% of the original on Intel VT-d without IOMMU super-page support. Before the fix, traces showed 472,574 maps, and 5,217,244 unmaps. Unmaps were more than 10 times greater than the number of maps! After the fix, traces showed 9509 maps, and 9509 unmaps, an extremely significant reduction. Additionally, more sporadic long task runs were added with needs-resched set

Result 2: Improvements to the IOMMU Tracing Feature

Alex also found a few bugs and improvements that could be made to the IOMMU tracing API. I would like to acknowledge Alex for using IOMMU tracing for VFIO based device assignments and for his feedback on improving the IOMMU Event Tracing API. The following fixes and improvements to IOMMU tracing API went into Linux 4.0

  • trace_iommu_map() should report original iova and size the map routine is called with as opposed to the iova and size at the end of mapping.
  • trace_iommu_unmap() should report original iova, size, and unmapped size.
  • Size field is handled as int and could overflow

Does Run-Time Tracing Add Overhead?

If you are wondering, what kind of overhead IOMMU tracing code introduces. Tracepoint code can be triggered to be included at run-time only when the tracepoint is enabled. In other words, tracepoint code is inactive unless it is enabled. When it is enabled, code is modified to include the tracepoint code. It doesn’t add any conditional logic overhead to determine whether or not to generate a trace message. The tracepoints use jump-labels which is a code modification of a branch.

When it is disabled, the code path looks like:
When it is enabled, the code path looks like: (notice how the tracepoint code appears in the code path below)

Future Enhancements & Closing Thoughts

At the moment, IO Page Faults are the only type of autonomous errors that are traced. In the future, additional errors and faults could be traced, but this would largely depend on the support for such reports funneling into the IOMMU drivers from the IOMMU hardware and firmware layers. One area that has been considered is the addition of error reporting and tracing to ARM IOMMU drivers if the underlying IOMMU hardware and firmware supports autonomous error reporting to the kernel.

I hope this article will help Linux kernel developers and users learn about a feature that can aid during development, maintenance, and support activities on systems with IOMMU hardware support. Reading this article, Linux kernel developers and users how to enable IOMMY event tracing to get reports on IOMMU boot-time and run-time events, and errors. Please see the following references to learn more about the use of IOMMUs in virtualized Linux environments and Linux VFIO PCI device assignment feature.

References

Image Credits: OSDC
Shuah Khan

About Shuah Khan

Shuah Khan is a Senior Linux Kernel Developer at Samsung's Open Source Group. She is a Linux Kernel Maintainer and Contributor who focuses on Linux Media Core and Power Management. She maintains Kernel Selftest framework. She has contributed to IOMMU, and DMA areas. In addition, she is helping with stable release kernel testing. She authored Linux Kernel Testing and Debugging paper published on the Linux Journal and writes Linux Journal kernel news articles. She has presented at several Linux conferences and Linux Kernel Developer Keynote Panels. She served on the Linux Foundation Technical Advisory Board. Prior to joining Samsung, she worked as a kernel and software developer at HP and Lucent.

Image Credits: OSDC

Linux IOMMU Event Tracing /

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments Protected by WP-SpamShield Anti-Spam