Google Cloud Kubernetes Networking – Deep Dive

Having fully functional node(s) has zero value without the ability to connect to the hosted applications as well as inter-connect applications tiers among each other. That’s why networking is crucial when it comes to designing and building containerized applications with Kubernetes clusters. To better understand how communication and networking works, this blog will cover briefly how networking within a Pod and a Node happens then will look into GKE cluster wide communications and connectivity (Node to Node as well as to external networks and users).

As highlighted in the previous part of this blog series “POD Design Considerations“, a pod technically can host one or more containers on a worker node, and these containers share Pod resources including its networking stack.

Within a Pod, Kubernetes has special container for each pod sometimes referred to as a Pod infrastructure container, which is mainly used to provide a networking interface for the containers within a Pod. This, containers also referred to as “PAUSE container” because its in a “Paus” state, nevertheless, this container is key as it holds a network namespace (netns) to provide an interface of virtual networking to the containers on the same Pod for Container to Container communications as well as to connect to outside the Pod.

With the ‘per Pod IP’ approach, communication within the host will avoid any IP or port conflicts (which port to be mapped to which application).

The next question is, how Pods can communicate within the same worker node?

As we know, a Kubernetes worker node in GCP, runs Linux (Container-Optimized OS, Ubuntu) on a VM (GCE). This VM has its own root network namespace (root netns, eth0) used to connect the VM to the rest of the GCP network in a project.

In a similar, fashion, the Pod has its own netns – eth0, therefore, from the Pod point of view, the Pod sees itself, has its own root netns, but in reality it is through the underlying host (worker node). Each Pod in a worker node VM, will have its eth0 connected to the worker node root netns ethxx. As illustrated below, the VM (Node) network interface eth0, is attached to a custom bridge (cbr), also, the bridged Pod(s) private IP range within a node, is on a completely different network than the VM node eth0 IP. Does this indicate there might be any communication implication here? Will find out later in this post.

Note: the 1:1NAT is deployed at the GCP network project level for the VM public IP reachability.

So far so good, Pods are connected and able to communicate within a node, however, there is no value of having a Pod or set of Pods that has containers communicate within its Pods only, because this does not give us a complete solution, in reality, the ultimate goal of using Kubernetes cluster is to run Containers anywhere in a cluster and to be able to have the different application tiers to connect to each other whether they are hosted on the same local node or a different node. This take us to the next step in the Kubernetes networking topic, which is Node to Node communication.

If we zoom out, and have a look at the ‘packet’s life’ between two Pods across different nodes, it should be something like the one illustrated in the figure below

Note: in order for packets forwarding and routing happens between the containers > Pods hosted by different GCE VMs, GKE disable anti-spoofing on these VMs to accept traffic sourced from IPs other than the VM IP, also setup routes (static routes) that point to each of the GCE VMs along with the allocated Pods IPs for traffic routing within the GCP network.

Again, so far so good, we have containers resides in different Pods and able to communicate with each other within the same worker node or among different nodes.

Still, this is not the ultimate goal, because a Kubernetes cluster is a living dynamic system, where Pods can be teared down and brought up manually and dynamically very frequently due to varies events such as: scale up and down events, Pod crashes, rolling updates, worker node restart, image updates etc.

The main issue with regard to the Pods IP communications here, is due to the ephemeral nature of a Pod, where Pod’s IP is not static and can change as a result to any of the aforementioned events above. This is a real communication issue for Pod to Pod communications as well as for communications with outside networks or users (especially when, there are 10s or 100s or Pods in a cluster).

For Pod to Pod internal or backend communication, Kubernetes address this by using an object known as Kubernetes Services, which acts like a service abstraction that automatically maps a virtual IP (VIP) to a set/group of Pods. The selection of which Pod to be part of which service/VIP is decided by a metadata called “Label”. Labels are arbitrary (key:value) pairs acting as tags that can be assigned to Pods. At the Services configuration level, the selector can specify which Pods should be part of the Service, this lose coupling, offers more flexible and dynamic Pod to Service/VIP association. As it shown in the figure below, the system admin defines the YAMIL for the service mainly with a ‘Name’ and ‘Selector’ specifying the labels, then Kubernetes automatically will create distributed load balancer for this service with a VIP.

For a Pod that runs more than one container, typical these containers need to listen to different ports. This can be achieved by either having one unified Service abstraction with single VIP, and at the Service configuration, multiple ports can be exposed, or Service per backend service/container(s) (1:1 >  Service:Port).

Kube-proxy it’s a pod that lives in every worker node and maintain the Linux IPTables, in which all the processing happens at the kernel level (no proxying nor being in the data path), thats why it has no performance implications with its functions like NAT, etc. in the figure below, IPTables keeps track of every NAT session using what is known as Contrack (Linux kernel connection, hold the addresses translation).

With Kubernetes Services concept, multitier and micro-services applications architecture can be achieved easily as illustrated below.

Although, Kubernetes services referred to in this blog as ‘service abstraction’, it is not an actual abstraction, instead you can think of it as way of unifying a service reachability information (IP, port, DNS etc.) while maintaining the distributed nature of the actual implantation with the aid of the Kube-Porxy/IPTbles in each Node for each service.

Note: for simplicity, to avoid remembering and hardcoding the required service VIP, Kubernetes Services DNS can be used, in which you can hardcode a host for the required Service.

What was covered up until now, ‘inbound and outbound’ communications that are all within the a cluster. However, external connectivity is essential: there are two type of connectivity models:

  • Pod level, connectivity to outside
  • Service level connectivity to outside

In order for a Pod to connect to outside world over the Internet, typically it need a Public IP.  We know the worker node that host the Pod ( GCE VM) can have a 1:1 NATed Public  IP, in which the Pod can use it. In order for Pod to use the NATed node Public IP, you need to use a Service with NodePort “NodePort: Exposes the service on each Node’s IP at a static port (the NodePort).”

Connecting to certain Pod(s) to outside world is important especially for OS/App upgrade over the Internet etc. From outside, ingress traffic point of view, no one can offer or sell a service at scale with this connectivity.

To overcome this, Kubernetes offer the ability to expose Kubernetes Service to accept external traffic over Google load balancing services. When the system admin specify the service type as ‘LoadBalancer’ Kubernetes API calls will automatically create Google Network Load Balancer  with external Public IP, which “Exposes the service externally using a cloud provider’s load balancer. NodePort and ClusterIP services, to which the external load balancer will route, are automatically created”.

Technically, the API will be creating LB forwarding rules pointing towards the respective VMs part of the Service. Since this load balancer type, is a network load balancer, it acts as packet forwarder without changing source (original client IP), as there is no proxy or session termination at the LB level.

One of the key design consideration here, is that the LB is node or VM aware only, while from Containerized application architecture point of view, it is almost always VM:Pod is not 1:1. Therefore this may introduce imbalanced load distribution issue here.

Consequently, As illustrated in the figure above, if traffic evenly distributed between the two available nodes (50:50) with Pods part of the targeted Service, the Pod on the left node will handle 50% of the traffic while each Pod hosted by the right node will receive about 25%.

IPTebles here deals with the distribution of the traffic to help considering all the Pod part of the specific Service across all nodes.

As it shown in the figure below, because the backend Service (IPTables) can randomly pick a Pod that potentially residing in a different node, there will be an extra network hop, for the incoming and return traffic. As result, this will create what is commonly known as “Traffic Trombone”.

Also, you may have noticed that there are source and destination NAT has been done. The destination NAT, in order to send traffic to the selected Pod, while the source NAT, is to ensure the traffic will come return to the same originally selected node by the LB, in order to do source NAT to the LB IP before sending the traffic back to the client otherwise there will be mismatch in the session (original client request is toward the LB IP).

Practically, this imbalance issue may not be always a big problem if there the ration of VM:Pod is well balanced and the added latency is not an problem.

Using GCP Network Load Balancer as part of your K8s cluster, is an effective and simple method to achieve reliable load balancing for your TCP and UDP workloads without complex configuration. However, keep in mind, that scope of the GCP Network Load Balancer is regional, therefore, it can only balance traffic for pods running within the same region.

Last but 100% not least, considering L7 HTTP(s) Google Global Load Balancing, which offers more advanced and granular forwarding rules (application/URL based) at a global scale, is a recommended load balancing model with GKE.

Also the good news with the HTTP(s) Google Global Load Balancing, is that recently GCP announced (at Next18) a new capability in GCP load balancing with Kubernetes, which is ‘container native load balancing’ using network endpoint group, in which the LB will be container/POD aware, which means the LB will targets Pods (or containers, in case of 1:1 PODtoContair) directly and not limited to the node/VM. This native container support, offer more efficient load distribution, as well as more accurate health checks (container level visibility), without the need for multiple NATing. From external clients’ point of view, this will provide better users experience, due to the optimized data path as there is no proxy in between which reduces the possible latency of multiple hopes packets’ forwarding

For more details, refer to the following blog:

Google Cloud Global Load Balancer – Deep Dive

What if a Canary deployment is requed for certain application(s), how this can be achived in contorlled manner?

The concept of the canary deployment has become fairly popular in the last few years. The name “canary deployment” comes from the “canary in the coal mine” concept. Miners used to take a canary in a cage into the mines to detect whether there were any dangerous gases present because the canaries are more susceptible to poisonous gases than humans. The canary would not only provide nice musical songs to entertain the miners, but if at any point it collapsed off its perch, the miners knew to get out of the coal mine rapidly.”

In applications deployment world, a canary deployment, refer to the deployment of a new release or version of an application, however, developers or QA normally release this new version with a subset of traffic, it could be only users from a certain region or internal employees of the company, or it could be users using certain type of device(s) or browser such as Android users.

following this canary deployment, the technical or QA teams can observe, monitor the behavior of the new release, and if all is good, then it can be gradually to scaled out to handle more traffic.

Although, Kubernetes provide the ability to load balancing traffic in a round-robin  fashion, across the pods behind a service, it’s not easy (if doable) to distribute load with certain percentage, e.g. if want only 20% of all clients’ traffic to be redirect to a the new containerized application release (should you deploy 20 to 1 ratio of old pods to the new pod? how scalable and manageable this approach?).

This is where Istio Service Mesh can help.

With Istio, Kubernetes admins can be much more precise when there is a need to specify certain percentage of traffic to be routed/redirect to certain Pods (e.g. a new application release). And then, with Istio the admin will be able to increase the redirected traffic gradually to the new release, to achieve a full migration in a gradual controlled manner.

Note: Istio offer more functions and capabilities than just traffic routing, such as security, deep visibility on micro-services, monitoring and logging.

For more details, refer to:

Marwan Al-shawi – CCDE No. 20130066, Google Cloud Certified Architect, AWS Certified Solutions Architect, Cisco Press author (author of the Top Cisco Certifications’ Design Books “CCDE Study Guide and the upcoming CCDP Arch 4th Edition”). He is Experienced Technical Architect. Marwan has been in the networking industry for more than 12 years and has been involved in architecting, designing, and implementing various large-scale networks, some of which are global service provider-grade networks. Marwan holds a Master of Science degree in internetworking from the University of Technology, Sydney. Marwan enjoys helping and assessing others, Therefore, he was selected as a Cisco Designated VIP by the Cisco Support Community (CSC) (official Cisco Systems forums) in 2012, and by the Solutions and Architectures subcommunity in 2014. In addition, Marwan was selected as a member of the Cisco Champions program in 2015 and 2016.