Traditionally, all data centers use VLANs to enforce Layer2 isolation. As data centers grow and needs arise for extending Layer2 networks across data center or may be beyond a data center, the shortcomings of VLANs are evident. These shortcomings are –
- In a data center, there are requirements of thousands of VLANs to partition traffic in a multi-tenant environment sharing the same L2/L3 infrastructure for a Cloud Service Provider. The current limit of 4096 VLANs (some are reserved) is not enough.
- Due to Server virtualization, each Virtual Machine (VM) requires a unique MAC address and an IP address. So, there are thousands of MAC table entries on upstream switches. This places much larger demand on table capacity of the switches.
- VLANs are too restrictive in terms of distance and deployment. VTP can be used to deploy VLANs across the L2 switches but most people prefer to disable VTP due to its destructive nature.
- Using STP to provide L2 loop free topology disables most redundant links. Hence, Equal-Cost Multi-Path (ECMP) is hard to achieve. However, ECMP is easy to achieve in IP network.
VXLAN – Virtual eXtensible LAN
VXLAN addresses above challenges. VXLAN technology is meant to provide same services connected to Ethernet end systems that VLANs do today and also provide a means to stretch L2 network over a L3 network. VXLAN ID (called VXLAN Network Identifier or VNI) is 24-bits long compared to 12-bits of VLAN ID. Hence, it provides over 16 million unique IDs.
VXLAN is defined in¬†this draft document
. The draft document introduces a new entity called¬†VXLAN Tunnel End Point (VTEP)
. The VTEPs connect an access switch (currently virtual switch) to the IP network. The VTEP is located within the hypervisor which houses the VMs. The function of VTEP is to encapsulate the VM traffic within an IP header to send across an IP network. The VMs are unaware of VXLANs.
As mentioned above, each VTEP has 2 interfaces – one interface to the Bridge Domain (trunk port) to the access (virtual) switch, and the other is an IP interface to the IP network. The VTEP is assigned an IP address and acts as an IP host to the IP network. The Bridge Domain is associated with a VXLAN ID (sometimes called Segment ID), and in turn, each VXLAN ID is associated with an IP multicast group. There are 2 types of VM-to-VM communications discussed next.
1. VM-to-VM communication : Unicast traffic
A VM in the left server in the figure above sends traffic to a VM in the right server. Based on the configuration in the Bridge Domain, the VM traffic is assigned the VNI. The VTEP then determines if the destination VM is on the same segment. The VTEP encapsulates the original Ethernet frame with an outer MAC header, outer IP header and a VXLAN header. The complete packet is sent out to the IP network with the destination IP address of remote VTEP connected to the destination VM. The remote VTEP decapsulates the packet and forwards the frame to the connected VM. The remote VTEP also learns the inner Source MAC and outer Source IP address.
2. VM-to-VM communication : Broadcast and Unknown Unicast traffic
A VM in the left server wants to communicate with a VM in the right server on the same subnet. The VM then sends out an ARP Broadcast packet. The VTEP encapsulates this ARP packet with an IP header, UDP header and VXLAN header. However, this packet is sent out to the IP Multicast group that is associated with the VXLAN ID. The VTEP sends out the IGMP Membership Report packets to the upstream router to join/leave the VXLAN related IP multicast groups. The remote VTEP which is also the receiver for the same IP multicast group, receives the traffic from the source VTEP. Again, it creates a mapping of the inner Source MAC and the outer Source IP address, and forwards the traffic to the destination VM. The destination VM sends a standard ARP response using IP unicast. The VTEP encapsulates the packet back to the VTEP connecting the originating VM using IP unicast VXLAN encapsulation.
VXLAN Frame Format
A VXLAN header is 8 bytes. It contains –
- Flag field (8 bits) – The¬†I¬†flag is set to 1 which means the header contains a valid VXLAN ID. The rest of the bits are reserved (R) and set to 0.
- VXLAN Network ID VNI (24 bits) – This designates an individual VXLAN network on which VMs can communicate. The VMs in different VXLAN network cannot communicate with each other.
In addition to the IP header and the VXLAN header, the VTEP also inserts a UDP header. During ECMP, the switch/router includes this UDP header to perform the hash function. The VTEP calculates the source port by performing the hash of the inner Ethernet frame’s header. The Destination UDP port is the VXLAN port.
The outer IP header contains the Source IP address of the VTEP performing the encapsulation. The destination IP address is the remote VTEP IP address or the IP Multicast group address. VXLAN is sometimes also referred to as¬†MAC-in-IP encapsulation¬†technology.
VXLAN adds an overhead of 50 bytes. To avoid fragmentation and reassembly, all physical network devices transporting VXLAN traffic must accommodate this overhead. Therefore, it is recommended that the MTU be adjusted accordingly.
Deploying VXLAN in Cisco Nexus 1000v
The prerequisites for running VXLAN in Cisco Nexus 1000v with VMware vSphere are –
- Cisco Nexus 1000v version release 4.2(1)SV1(5.1a)
- vCenter version 4.1 or 5.0
- ESX version 4.0 or 5.0
The Virtual Ethernet Module (VEM) encapsulates the original Ethernet frame from the VM. Each VEM is assigned an IP address which is used as the Source IP address of the outer IP header when encapsulating the original frame using VXLAN. This is accomplished by creating Virtual Network Adapters¬†VMKNICs (VMKernels).
There are 6 steps to deploy VXLAN in the above network topology.
Step 1 : Turn on VXLAN feature in Cisco Nexus 1000v VSM
To verify the feature is enabled, use¬†show feature¬†command.
Step 2 : Create Port-Profile with Capability VXLAN
port-profile type vethernet N1KV-VXLAN-PP
¬† ¬† vmware port-group
¬† ¬† switchport mode access
¬† ¬† switchport access vlan 10
¬† ¬† capability vxlan
¬† ¬† no shutdown
¬† ¬† state enabled
¬† ¬† end
VLAN 10 in above configuration is used to carry all VXLAN encapsulated traffic from VMKNIC to upstream switch. The VXLAN capability can be verified usingshow port-profile¬†command.
Step 3 : Assign ESX host to the VMkernel interface
From the vSphere Client, create a new VMkernel interface and assign the port-group created in step 2. Also assign an IP address to the VMKernel interface from the subnet to which VLAN 10 belongs. Each ESX host is assigned a different IP address from the subnet.
Once the VMkernel interfaces are successfully created on both ESX hosts, the VSM identifies 2 vEthernet interfaces assigned to the port-profile. This can be verified using¬†show port-profile¬†command.
Step 4 : Adjust MTU on Uplink ports and Uplink switches
port-profile type ethernet Uplink-Port-Group
On Uplink switches
interface port-channel 10
Step 5 : Enable IGMP Snooping and IGMP Snooping Querier
On most Cisco switches, IGMP Snooping is enabled by default. When IGMP Snooping querier is enabled, it sends out periodic IGMP queries that trigger IGMP Report messages from VTEP (VMkernel interfaces) that want to receive IP multicast traffic. IGMP Snooping listens to these IGMP messages to establish appropriate forwarding.
interface vlan 10
ip igmp snooping querier 10.10.10.10
Step 6 : Create VXLAN Bridge Domain in VSM
segment id 5000
The “segment id” is VXLAN ID and the IP Multicast group assigned to that VXLAN ID. The state of bridge-domain can be verified using¬†show bridge-domain¬†command.
port-profile type vethernet BD5000-PP
switchport mode access
switchport access bridge-domain VXLAN-5000-BD
The bridge-domain is assigned to a port-profile. This port-profile is available as port-group in VMware vCenter. This port-group can now be assigned to VMs.
In the VSM, all the VMs connected to the port-profile can be verified using¬†show port-profile¬†command.
Thus, VXLAN is a great technology to extend L2 networks without the headaches of L2 loops. However, it does require multicast-enabled IP network and adjustment of MTU size on physical devices. Also, the lack of control-plane means MAC address learning is done dynamically which could cause scalability problems. Nicira’s NVP solution is also L2-over-IP but with control-plane located in a controller. It uses OpenFlow protocol for communication between Open vSwitch and Controller.