First of all, this blog is based on my own opinion and not a company or someone else view. That’s why I call it “network architect perspective”. Therefore, this blog dose not aim to promote a vendor solution over another, instead to discuss how and why Cisco ACI can be more reliable using a “Network-based” architectural analysis.
Also, I’d like to Thank Juan Lage for taking the time out of his busy schedule to review this blog.
Before we go into the analysis let’s first define what “reliable” or reliability” means. A system can be considered as reliable when functioning without failure over a certain period of time.
One of the key measurements to a system’s reliability is the level of its operational quality within a given period of time.
The term operational quality here refers to some scenarios where a system might be technically up, yet is not performing its functions at the minimum required or expected level. This means, that the system is not delivering the intended service or performing a function reliably even if is technically up.
It is important to note that system reliability is one of the primary contributing elements to achieving the ultimate level of system or network availability.
One of the key influencing factors to the level of a system or technology solution reliability is complexity, mainly operational complexity, which in turn constructed of different sub-influencing factors such as control plane complexity..
Note: there are other aspects impact the level of reliability such as; MTTR, and time to service/upgrade
Reducing or avoiding Operational Complexity is key, because, no matter how nice and fancy design you propose and build to your customer, it has “No value” until it’s proven when it’s in-operation.
For more insight around this topic refer to my pervious blog “Why is OPEX one of the key aspects to consider in your design?”
Now, it’s time to review the high level architectural concept of both ACI and NSX forwarding plane. Both solutions employ MAC in IP (VxLAN) encapsulation to forward traffic within the data center network using different deployments models:
In building construction, the foundation carries the load of the structure on top of it, in addition to any future anticipated loads, such as people and furniture. Everything goes on top of the foundation. In other words, a solid and reliable foundation form the basis of everything that comes after.
Similarly, in network architectures, the foundation of the network (which is the underlying network infrastructure) is where the entire traffic load is handled. It can be a critical influencer with regard to the adoption of certain designs or goals. For instance, the core network nodes in the figure below interconnected over a physical fiber infrastructure in a ring topology. On top of this physical ring, there is a full mesh of VxLAN overlay peering sessions between the edge nodes.
From the control plane and logical point of view of the edge nodes with the VTEPs, these nodes are seen as directly interconnected, However, the actual traffic load and forwarding is handled by the physical core ring network, which means if any core router in the path has a congested link, it can affect all the edge nodes communicating over this path.
First conclusion: building an overlay network in isolation of the underlay network, most probably will lead to unreliable solution ( also this referred to as “communications islands”). As a network architect you must always look at the big picture, to see how the different components integrate, interact and communicate.
In any data center environment, to reduce operational complexity, you should aim to build a homogeneous data center network, which in turn will optimize the overall DC network reliability.
Here I am only referring to the network infrastructure, because it acts as the heart of the entire DC infrastructure (the more vendors you have the more complex it will be [more integrations, most probability different SLAs], thus, when you try to troubleshoot an issue you need to have different people with different skill sets as well as engage different vendors who typically don’t work together). Remember, always things will work smoothly and nicely when you do a demo or a lab, you won’t feel the aforementioned complexities, until you have it in a production environment.
With NSX, the VxLAN VTEP reside at the host/hypervisor level (host based) and build the VxLAN tunnel among the hosts. In this case, the network is seen as a black box to forward packets only with zero visibility of what’s going on across the underlay network. in other words, it does not address any issues related to improving the behavior of the physical infrastructure and you must do so in any way to build a reliable network.
Typically, hosts connect to the network over at least dual links..
Based on this there can be at least two different deployments scenarios:
Since the VTEP has single IP and all the traffic will be sourced and destined to this IP, to optimize the forwarding NSX uses the concept of Multiple VTEPs where two or more VTEP kernel interfaces can be added in an NSX vSwitch to provide 1:1 mapping with the physical uplinks of the vSwitch. With this scenario, The NSX vSwitch selects an uplink based on the virtual machine portID or MAC address. As you can see in the figure below, a single link can easily get congested or the quality is impact if one of the VMs sharing same link with other VMs, while its sending large amount of traffic (elephant flows). This behavior may also be replicated when the flows reach the physical network, because the NSX has no control or visibility of how the undelay will forward the packets/flows.
For instance, all the traffic between any two hosts may share single link across the entire underlay DC fabric as the network will see traffic between two VTEPs as a single flow (source and destination IP).
This means even if you tried to optimize or balance your traffic at the overlay you still need to fix it at your underlay otherwise there will be congestion, packet lose, delay jitter etc. and depends on your organization this might be done by the same team or different teams.
What if you have 4 physical ports per server, and you have 100 servers in your network. my questions to this simple scenario, are: does this mean you will have 400 VTEPs “only for the hots to communicate”? isn’t it a control plane complexity? how big the mesh between these hosts will be? how complex this approach will be when you have a forwarding issue and you need to track it over two isolated networks (the overlay and underlay networks)?
This mode overcome some (not all) of the issues of the previous scenario, with this mode you can have single VxLAN VTEP per host. However, if the LACP PortChannel hashing is based on IP, the overlay will suffer from the same issues described above.
If the hashing is configured based on L4 source port, (based on the VxLAN created a random UDP source port value based on the layer 2, 3 or layer 4 headers present in the original frame), this will offer better load distribution across the host to access/leaf switch member links of the LACP PortChannel per source VM flow. Nevertheless, this does not guarantee the end to end forwarding, simply because the overlay is working independently of the underlay network. for instance, if the underlay network is not aligned to do the LACP hashing based on source L4 port, then a single link will be selected by the underlay switch(s) to forward the traffic of between any VTEPs pair (per source and destination VTEPs)
In addition, if the access/leaf switches configured to communicate over L3 (routed leaf and spine architecture) with per flow ECMP load balancing enabled, then all the traffic between any VTEP pair will use single path at any given time even though you have multiple uplinks. Which ultimately may lead to inefficient utilization to the available bandwidth and unexpected capacity issues!
In Contrast, The ACI fabric provides several load balancing options for balancing the traffic among the available uplink links. Static hash load balancing is the traditional load balancing mechanism used in networks where each flow is allocated to an uplink based on a hash of its 5-tuple. This load balancing gives a distribution of flows across the available links that is roughly even. Usually, with a large number of flows, the even distribution of flows results in an even distribution of bandwidth as well. However, if a few flows are much larger than the rest, static load balancing might give suboptimal results.
Dynamic load balancing (DLB) adjusts the traffic allocations according to congestion levels. It measures the congestion across the available paths and places the flows on the least congested paths, which results in an optimal or near optimal placement of the data.
The ACI fabric adjusts traffic when the number of available links changes due to a link going off-line or coming on-line. The fabric redistributes the traffic across the new set of links.
Dynamic Packet Prioritization (DPP), while not a load balancing technology, uses some of the same mechanisms as DLB in the switch. DPP configuration is exclusive of DLB. DPP prioritizes short flows higher than long flows; a short flow is less than approximately 15 packets. Because short flows are more sensitive to latency than long ones, DPP can improve overall application performance.
Although, we are moving into a DC world that is based on multi-tenancy and virtualization (end to end), we still need to be realistic because, you will need to:
So how the Network and Host based overlays handle such communications?
In Network based overlay design, typically each Leaf/ToR Switch is a L2 VxLAN Gateway
VxLAN <To> VLAN. In addition in ACI, each Leaf/ToR act as a L3 gateway (anycast HSRP) which offer optimal traffic routing and forwarding.
On the other hand with host based overlay design (e.g. NSX), the host (vSwitch) is the L2 VxLAN Gateway VxLAN <To> VLAN and L3 within the overlay/NSX domain.
What if you need to extend L2 domain to external or existing L2 domain using VLANs, or you need to communicate with a physical node such as firewall or external network such as physical edge router, How would you achieve this with the host-based overlay/VxLAN?
The answer is by using a dedicated host(s) to perform the gateway functions to the external or physical networks, in NSX this function/node referred to as edge services gateway ESG
The concerns about this model in general are
Note: as mentioned earlier, the focus of this blog, is only on the communication model with regard to “reliability”, for instance; this blog will not discuss the limitation of the ESG where you can have either ECMP or stateful inspection capability and NOT both for some reason !
For example let’s assume you need to extend the DC tenants to an MPLS L3VPN backbone from the DC LAN, how would you do this with the ESG ? typically you will configure a sub interface/VLAN per tenant at the LAN/DC facing side to the ESG to map it to the corresponding VRF at the DC PE node because you cannot extend the MP-BGP to the NSX domain
If you have 50 tenants, you will need to build 50 sub-interfaces, 50 routing peering sessions between the ESG and the edge/PE node, this means anytime you add or remove a tenant you need to manually do this.
While the ACI “Golf” design approach you may extend the iVxLAN domain to the edge/PE node where you can carry the corresponding tenant routes over BGP EVPN then you can map it to the relevant VRF at the edge/PE node, and the automation of configuration with the Cisco ACI fabric helps ensure end-to-end connectivity between the MPLS core and the fabric resources. Also, ACI uses Opflex to automate fabric-facing tenant provisioning on the DCI and WAN edge devices.
In other words, the gateway to the external or physical world with the host based overlay such as the ESG, can add operational complexity and introduce bottleneck to the network that collectively may lead to degraded operational quality.
To sum it up: