Why Cisco ACI Can Be More Reliable Than NSX? – A Network Architect Perspective

First of all, this blog is based on my own opinion and not a company or someone else view. That’s why I call it “network architect perspective”. Therefore, this blog dose not aim to promote a vendor solution over another, instead to discuss how and why Cisco ACI can be more reliable using a “Network-based” architectural analysis.

Also, I’d like to Thank Juan Lage for taking the time out of his busy schedule to review this blog.

Before we go into the analysis let’s first define what “reliable” or reliability” means. A system can be considered as reliable when functioning without failure over a certain period of time.

One of the key measurements to a system’s reliability is the level of its operational quality within a given period of time.

The term operational quality here refers to some scenarios where a system might be technically up, yet is not performing its functions at the minimum required or expected level. This means, that the system is not delivering the intended service or performing a function reliably even if is technically up.

It is important to note that system reliability is one of the primary contributing elements to achieving the ultimate level of system or network availability.

One of the key influencing factors to the level of a system or technology solution reliability is complexity, mainly operational complexity, which in turn constructed of different sub-influencing factors such as control plane complexity..

Note: there are other aspects impact the level of reliability such as; MTTR, and time to service/upgrade

Reducing or avoiding Operational Complexity is key, because, no matter how nice and fancy design you propose and build to your customer, it has “No value” until it’s proven when it’s in-operation.

For more insight around this topic refer to my pervious blog “Why is OPEX one of the key aspects to consider in your design?”

Now, it’s time to review the high level architectural concept of both ACI and NSX forwarding plane. Both solutions employ MAC in IP (VxLAN) encapsulation to forward traffic within the data center network using different deployments models:

  • Network based: VTEP is created at the access/leaf switch (used by ACI)


  • Host based: VTEP is created at the host (Used by NSX)


The Myth of Overlay Has No Interaction With The Underlay

In building construction, the foundation carries the load of the structure on top of it, in addition to any future anticipated loads, such as people and furniture. Everything goes on top of the foundation. In other words, a solid and reliable foundation form the basis of everything that comes after.

Similarly, in network architectures, the foundation of the network (which is the underlying network infrastructure) is where the entire traffic load is handled. It can be a critical influencer with regard to the adoption of certain designs or goals. For instance, the core network nodes in the figure below interconnected over a physical fiber infrastructure in a ring topology. On top of this physical ring, there is a full mesh of VxLAN overlay peering sessions between the edge nodes.


From the control plane and logical point of view of the edge nodes with the VTEPs, these nodes are seen as directly interconnected, However, the actual traffic load and forwarding is handled by the physical core ring network, which means if any core router in the path has a congested link, it can affect all the edge nodes communicating over this path.

First conclusion: building an overlay network in isolation of the underlay network, most probably will lead to unreliable solution ( also this referred to as “communications islands”). As a network architect you must always look at the big picture, to see how the different components integrate, interact and communicate.

In any data center environment, to reduce operational complexity, you should aim to build a homogeneous data center network, which in turn will optimize the overall DC network reliability.

Here I am only referring to the network infrastructure, because it acts as the heart of the entire DC infrastructure (the more vendors you have the more complex it will be [more integrations, most probability different SLAs], thus, when you try to troubleshoot an issue you need to have different people with different skill sets as well as engage different vendors who typically don’t work together). Remember, always things will work smoothly and nicely when you do a demo or a lab, you won’t feel the aforementioned complexities, until you have it in a production environment.

What about forwarding efficiency and recovery time following a physical network failure event (node or link failure)?

With NSX, the VxLAN VTEP reside at the host/hypervisor level (host based) and build the VxLAN tunnel among the hosts. In this case, the network is seen as a black box to forward packets only with zero visibility of what’s going on across the underlay network. in other words, it does not address any issues related to improving the behavior of the physical infrastructure and you must do so in any way to build a reliable network.


Typically, hosts connect to the network over at least dual links..

Based on this there can be at least two different deployments scenarios:

Scenario-1 underlay switches do not support LACP or not configured with LACP for whatever reason:

Since the VTEP has single IP and all the traffic will be sourced and destined to this IP, to optimize the forwarding NSX uses the concept of Multiple VTEPs where two or more VTEP kernel interfaces can be added in an NSX vSwitch to provide 1:1 mapping with the physical uplinks of the vSwitch. With this scenario, The NSX vSwitch selects an uplink based on the virtual machine portID or MAC address. As you can see in the figure below, a single link can easily get congested or the quality is impact if one of the VMs sharing same link with other VMs, while its sending large amount of traffic (elephant flows). This behavior may also be replicated when the flows reach the physical network, because the NSX has no control or visibility of how the undelay will forward the packets/flows.

For instance, all the traffic between any two hosts may share single link across the entire underlay DC fabric as the network will see traffic between two VTEPs as a single flow (source and destination IP).


This means even if you tried to optimize or balance your traffic at the overlay you still need to fix it at your underlay otherwise there will be congestion, packet lose, delay jitter etc. and depends on your organization this might be done by the same team or different teams.

What if you have 4 physical ports per server, and you have 100 servers in your network. my questions to this simple scenario, are: does this mean you will have 400 VTEPs “only for the hots to communicate”? isn’t it a control plane complexity? how big the mesh between these hosts will be? how complex this approach will be when you have a forwarding issue and you need to track it over two isolated networks (the overlay and underlay networks)?

Scenario-2 Hosts and upstream access/leaf switch configured with link aggregation (LACP mode):

This mode overcome some (not all) of the issues of the previous scenario, with this mode you can have single VxLAN VTEP per host. However, if the LACP PortChannel hashing is based on IP, the overlay will suffer from the same issues described above.

If the hashing is configured based on L4 source port, (based on the VxLAN created a random UDP source port value based on the layer 2, 3 or layer 4 headers present in the original frame), this will offer better load distribution across the host to access/leaf switch member links of the LACP PortChannel per source VM flow. Nevertheless, this does not guarantee the end to end forwarding, simply because the overlay is working independently of the underlay network. for instance, if the underlay network is not aligned to do the LACP hashing based on source L4 port, then a single link will be selected by the underlay switch(s) to forward the traffic of between any VTEPs pair (per source and destination VTEPs)

In addition, if the access/leaf switches configured to communicate over L3 (routed leaf and spine architecture) with per flow ECMP load balancing enabled, then all the traffic between any VTEP pair will use single path at any given time even though you have multiple uplinks. Which ultimately may lead to inefficient utilization to the available bandwidth and unexpected capacity issues!


In Contrast, The ACI fabric provides several load balancing options for balancing the traffic among the available uplink links. Static hash load balancing is the traditional load balancing mechanism used in networks where each flow is allocated to an uplink based on a hash of its 5-tuple. This load balancing gives a distribution of flows across the available links that is roughly even. Usually, with a large number of flows, the even distribution of flows results in an even distribution of bandwidth as well. However, if a few flows are much larger than the rest, static load balancing might give suboptimal results.

Dynamic load balancing (DLB) adjusts the traffic allocations according to congestion levels. It measures the congestion across the available paths and places the flows on the least congested paths, which results in an optimal or near optimal placement of the data.

The ACI fabric adjusts traffic when the number of available links changes due to a link going off-line or coming on-line. The fabric redistributes the traffic across the new set of links.

Dynamic Packet Prioritization (DPP), while not a load balancing technology, uses some of the same mechanisms as DLB in the switch. DPP configuration is exclusive of DLB. DPP prioritizes short flows higher than long flows; a short flow is less than approximately 15 packets. Because short flows are more sensitive to latency than long ones, DPP can improve overall application performance.

Interaction and communication with External networks, legacy networks and physical nodes

Although, we are moving into a DC world that is based on multi-tenancy and virtualization (end to end), we still need to be realistic because, you will need to:

  • Communicate with physical network(s) at some point such as; Firewalls, external internet/WAN edge routers (could be managed by you or your SP) etc.
  • Communicate with existing legacy network or physical hosts

So how the Network and Host based overlays handle such communications?

In Network based overlay design, typically each Leaf/ToR Switch is a L2 VxLAN Gateway

VxLAN <To> VLAN. In addition in ACI, each Leaf/ToR act as a L3 gateway (anycast HSRP) which offer optimal traffic routing and forwarding.


On the other hand with host based overlay design (e.g. NSX), the host (vSwitch) is the L2 VxLAN Gateway VxLAN <To> VLAN and L3 within the overlay/NSX domain.


What if you need to extend L2 domain to external or existing L2 domain using VLANs, or you need to communicate with a physical node such as firewall or external network such as physical edge router, How would you achieve this with the host-based overlay/VxLAN?

The answer is by using a dedicated host(s) to perform the gateway functions to the external or physical networks, in NSX this function/node referred to as edge services gateway ESG

The concerns about this model in general are

  • Scalability: The more bandwidth you need to more host(s)/ESGs you require (there is always a limit to this scale compared to the network based overlay that you don’t need to worry about such limits and licenses etc. where you can get line rate 10/40/100Gbps easily)
  • Flexibility: in a typical multi-tenant environment you need to configure your physical LAN switch/router facing the ESG with the sub interfaces/VLANs to extend the tenants to the outside.

Note: as mentioned earlier, the focus of this blog, is only on the communication model with regard to “reliability”, for instance; this blog will not discuss the limitation of the ESG where you can have either ECMP or stateful inspection capability and NOT both for some reason !

For example let’s assume you need to extend the DC tenants to an MPLS L3VPN backbone from the DC LAN, how would you do this with the ESG ? typically you will configure a sub interface/VLAN per tenant at the LAN/DC facing side to the ESG to map it to the corresponding VRF at the DC PE node because you cannot extend the MP-BGP to the NSX domain

If you have 50 tenants, you will need to build 50 sub-interfaces, 50 routing peering sessions between the ESG and the edge/PE node, this means anytime you add or remove a tenant you need to manually do this.

While the ACI “Golf” design approach you may extend the iVxLAN domain to the edge/PE node where you can carry the corresponding tenant routes over BGP EVPN then you can map it to the relevant VRF at the edge/PE node, and the automation of configuration with the Cisco ACI fabric helps ensure end-to-end connectivity between the MPLS core and the fabric resources. Also, ACI uses Opflex to automate fabric-facing tenant provisioning on the DCI and WAN edge devices.


In other words, the gateway to the external or physical world with the host based overlay such as the ESG, can add operational complexity and introduce bottleneck to the network that collectively may lead to degraded operational quality.

To sum it up:

  • Focusing only on the overlay and its capability is a myth and will not provide what you are expecting in terms of reliability.
  • Building two independent networks from different vendors for your core data center network (overlay and underlay) will probably introduce operational complexity
  • You should aim to have a network that is capable to measure end to end (leaf to leaf) network health, capacity and utilization in real or near real time and able to take action accordingly.
  • Your gateway to external and legacy networks can be the bottleneck to your DC fabric if the solution can not provide flexible and scalable connectivity.
  • With ESG your routing has more failure vectors: a failure (or maintenance) of the datastore where the ESG resides now impacts the network! … Failure scenarios of ESG include: OS failure of the ESG, failure of the hypervisor, failure of the server and/or server links, failure of the storage supporting the datastore, etc. And recovery from these failures isn’t subsecond in any case. In Contrast, with ACI you don’t need any special gateway to connect to any internal or external endpoints as you rely on well known hardware-routing at the border leafs with sub-second convergence.
Marwan Al-shawi – CCDE No. 20130066, Google Cloud Certified Architect, AWS Certified Solutions Architect, Cisco Press author (author of the Top Cisco Certifications’ Design Books “CCDE Study Guide and the upcoming CCDP Arch 4th Edition”). He is Experienced Technical Architect. Marwan has been in the networking industry for more than 12 years and has been involved in architecting, designing, and implementing various large-scale networks, some of which are global service provider-grade networks. Marwan holds a Master of Science degree in internetworking from the University of Technology, Sydney. Marwan enjoys helping and assessing others, Therefore, he was selected as a Cisco Designated VIP by the Cisco Support Community (CSC) (official Cisco Systems forums) in 2012, and by the Solutions and Architectures subcommunity in 2014. In addition, Marwan was selected as a member of the Cisco Champions program in 2015 and 2016.


Leave a Reply

Your email address will not be published. Required fields are marked *

Order Now