November 30, 2015 - Mario Smarduch
An Introduction to Live Migration in NFV Deployment
This article is the first of two on live migration in NFV Deployment.
Live migration is a generic feature that is important in all cloud environments including Network Functions Virtualization (NFV) applications. This post covers likely NFV use cases for live migration, the type of loads that are likely to be executed on compute nodes, an overview of live migration, and some common issues to consider with focus on LTE Evolved Packet Core (EPC). This article uses VNF (Virtual Network Function) and guest interchangeably. VNF essentially is a guest equivalent to a hardware Network Element.
The Various Uses of Live Migration
Here are some common live migration uses:
- Load Balancing – Migrating VMs between compute nodes to balance the load
- Hardware Maintenance – Routine maintenance, such as hardware upgrades, or IPMI events such as thermal readings from sensors that indicate the node may fail soon for example cooling issues. This is a common NFV scenario.
- Point of Presence – Migration to a local facility to debug, analyze performance, and more. It’s common to debug a VNF during deployment.
- Enhancing Application Elasticity – For NFV, this might be complicated and require scaling out the VNF graph horizontally along the graph, and vertically across the hardware as illustrated below.
LTE procedures (standardized by 3gpp) result in messages being exchanged between network elements to carry out various procedures (i.e. attach, service request, paging). The routes taken by these messages create a VNF graph like the one illustrated above.
New VNF instances and an updated VNF graph are required to scale cloud applications; these are not trivial tasks. For example, in LTE there are endpoints that network elements must attach to; an MME pool area may have several Mobility Management Entities (MME), and the evolved Node B (eNB) must be connected to all of them. As the control plane traffic increases, MMEs must also scale. Similar events must occur between the Serving Gateway (SGW) and eNB for the data plane. Similarly, this holds true when creating a bearer; a GTP-U tunnel is created using the IP address of the VNF destination endpoint for default routes, such as the one between the eNB and SGW.
One approach is to launch all the VNFs to create a complete VNF graph. Elasticity can be created by migrating a VNF graph on demand, or predetermined through policies based on traffic patterns observed by time of day or special events. The following figure illustrates all VNFs executing on NEa-c; when a VNF load rises above certain threshold it’s migrated to NEa’-c’, and vice verse when the load drops below the threshold.
Tuning the System for Live Migration
The advantage of launching all VNFs (built for maximum load) at once is better elasticity. VNFs build up state and run ‘hot’, as opposed to launching a fresh VNF. As the load increases, the VNF graph is migrated to other network elements intact.
However, the loads executed by VNFs tend to be real-time and deterministic with strict message latencies. The VNF is designed to process a certain number of procedures and message types per second. For any load beyond these specs, the VNF will go into load shedding mode where it will drop requests to handle the current load; this is typically based on CPU utilization.
Live migration in this environment requires some detailed knowledge about what happens underneath the hood in KVM, QEMU, the host kernel, and the hardware. The goal here is to identify the ‘problem area’ so that the correct knobs can be tuned. There is no single magic tuning guide, which is typically the case with wireless infrastructure development because it is an iterative process the requires several development cycles before deployment. Any introduction of new hardware requires the entire cycle to be repeated.
A VNF is composed of several components and all run-time states must be migrated to the destination hardware node. The guest state is stored in QEMU, KVM, and the host kernel. In order to live migrate a VNF, all these states must be migrated with minimum downtime. This involves:
- Live migration of the guest CPU interrupt and emulated or virtual IO state.
- Memory, which is the most intensive VNF state to migrate, and can be updated in several ways. Virtual I/O may write directly into memory, or in some cases KVM may write into guest memory for para-virtualized operations. However, it’s the guest’s Kernel and applications that by far write most to memory
A guest must go through two levels of translation in order to access physical memory. First, the guest translates a Virtual Address (VA) to a Guest Physical Address (GPA). Then, the GPA is used as an input to the 2nd stage Memory Management Unit (MMU) tables that translate from a GPA to a Host Physical Address (HPA) that points to physical ram. KVM sets up 2nd stage tables to cover all of the physical regions of the emulated platform the guest has access to. QEMU interfaces with KVM through an ioctl interface to retrieve a dirty pages map, and to write protect GPA ranges. When the Guest writes to a memory location, a 2nd stage fault occurs, giving KVM the opportunity to mark the page dirty. In short, the live migration process of memory follows these steps:
- QEMU enables dirty page logging.
- KVM walks 2nd stage page tables and write-protects the page table entries.
- Upon guest access, a 2nd stage fault is triggered, allowing KVM to mark the page in the dirty page log.
- QEMU reads the dirty page log and KVM write protects the dirty pages again, clearing the log bits.
- QEMU walks the dirty page log, transfers the pages to the remote QEMU, and returns to step 4.
Steps 1 and 2 are a one-time initialization. Step 3 executes repeatedly initially as a result of step 2 and repeatedly as result of step 4, and is handled by KVM MMU. Finally steps 4 and 5 are QEMU code that’s executed at a rate determined by other factors such as the network bandwidth that’s available to transmit the dirty pages.
It’s important to understand it’s not how often the guest writes to memory that’s important, but rather the range and rate of updates. A guest may rapidly write to few pages but live migration will converge instantly.
The Biggest Challenges of Live Migration
To close, I’ll take a closer look at some of the challenges live migration presents; the primary challenge is to keep performance and latencies near equal to normal guest execution.
- When live migration is launched and a new thread is created to manage migration it may be scheduled anywhere. NFV compute node resources are precisely partitioned between the host and VNFs. However this thread may be scheduled anywhere and interfere with either the QEMU IO thread or the RT-vCPU thread.
- The logging of guest write requires MMU page table write protection. This means guest will take a performance hit processing more 2nd stage faults. The load threshold that triggers live migration must be taken into account, that load should be comfortably handled during live migration.
- Regarding the migration thread, its placement is important because it walks the guest memory regions (from the QEMU address space) and will cause a lot of cache line bouncing as guest vCPUs access the same memory. The further away the thread is, the more expensive cache coherency will be, resulting in lower overall performance.
- Another consideration is network bandwidth. A separate, physical subnet should be used for live migration so the overall NFV infrastructure doesn’t take a hit. Again, the migration thread is reading and sending memory over the NIC, so it should be running on a node with a direct connection to the NIC
- For live migraton a lot of knobs and statistics have been added to QEMU, and should be considered for NFV infrastructure:
- Bandwidth limitation – If network bandwidth is exceeded, vCPUs will be throttled or stunned which means vCPU PREEMT_RT scheduling won’t matter much.
- Downtime – QEMU calculates a downtime on the fly through auto-convergence; this may lead to indefinite pre-copy. You fix the downtime roughly based on ‘(dirty rate/s * page size in bits)/bandwidth in mbps’ and tell QEMU to enforce it.
- There are some additional options in QEMU, like compression and dirty page bytes tracking to look at. However, these add unpredictability to overall live migration and source VNF execution. Performance is based on guest execution and lack of determinism which is not acceptable NFV, the worst case scenario should be planned for.
- Huge pages – These are great for performance by optimizing TLB reuse and shortening page table walk. However, huge pages are bad for live migration since a write to a huge page requires a copy of the entire page (2MB and 1MB vs 4KB and 16KB). This needs to be addressed in KVM MMU code to break up huge pages and support memory intensive loads.
- Bandwidth utilization will improve, but other problems are introduced such as page faults and flushing. Nevertheless, migration speed may be increased by a huge factor.
Overcoming The Challenges of Live Migration
If you are building cloud applications, it is almost certain you will need to perform live migration at some point. This is a complicated process that has no single solution, so it is vital to understand the hurdles that must be overcome. This article explains the root of these hurdles to prepare you to understand how to mitigate them. The next article in this series will have a closer look at these challenges and provide practical guidelines to mitigate their impact. It will explore these concepts more deeply to address the various challenges encountered while migrating a guest.
About Mario Smarduch
Mario Smarduch is no longer employed by Samsung. We thank him for all of his great work and wish him the best in his future endeavors. Mario Smarduch was a Senior Virtualization Architect at the Samsung Open Source Group in Silicon Valley. His primary areas of expertise are ETSI, ISG, NFV, and KVM-ARM virtualization extensions, real-time, zero copy, inter-guest communication, and optimized device passthrough. He's been involved with Linux Kernel for past 15 years and virtualization for the last 5. He has worked mostly on the product side, but also interacts on and off with open source communities as needed. In the past Mario has worked for Huawei Technologies, Motorola Mobility/Solutions, Compaq, Tandem, Qualix/EMC, Sun Microsystems, Integrated Micro Products, and AT&T Bell Labs.
Image Credits: Perspecsys Photos