How DevOps is affecting NetOps: An Insider's View
DevOps principles are contagious. The benefits of collaboration, automation and open source allowing scrutiny of source code by community is no doubt a game changer. On top of that, the flexibility and resiliency of coding using Microservices architecture, splitting applications into smaller chunks of functional services, led networks to adapt to the new requirements to cope with the significant increase of intra-application traffic.
Because of these changes and the success of distributed applications, Network Engineers are increasingly adopting a DevOps-like mindset.
The server's world has been plagued with Configuration Management tools such as Ansible, Puppet, Chef and the like that are now extending to network equipment like routers and switches. For example, Ansible has a list of modules for network automation.
We all know how easy it is to boot up anything in the cloud using Terraform, for example. Why not come up with something where we power up switches/routers and they automatically download configuration and auto-configure themselves? Wouldn't it be nice?
I've witnessed these changes throughout the years both within F5 and with some of our customers.
In this article, I'll mention a couple of things worth noting that in my understanding have been changing fuelled by the advent of DevOps.
Networking world is adapting to Cloud-Native Applications
The changes brought forward by DevOps principles are also changing the network world.
In Data centres and the Enterprise world, the old access-aggregation-core network is not compatible with distributed application requirements.
The reason is simply because it was initially designed based on the old client-server model where most of the traffic is between client-server but not server-server.
In today's modern networks, there's a lot of intra-application traffic. In some cases it's really A LOT.
I'm not only talking about Webserver <-> Database, for example.
I'm talking about internal components of the same application that is now distributed but was once part of a single monolithic application.
These applications usually run in containers (typically Docker) and are orchestrated by a tool such as Kubernetes.
Kubernetes not only abstracts the infrastructure from applications but also constantly monitor its internal components in a distributed manner too.
This leads us to the next question...
Why can't we run distributed applications on access-aggregation-core network?
It's not that we can't, we definitely can. However, as internal traffic increases significantly, the complexity of not having full mesh among layers and having to deal with Spanning Tree Protocol (STP) and possibly suboptimal topologies or broadcast storms, should the wrong link or device goes down, becomes unbearable. We're now dealing with applications that may require intra-application authentication, authorisation and troubleshooting with potentially lots of internal network communication. So, reducing network complexity is of paramount importance. This also includes making a more homogeneous and simple network that is simpler to troubleshoot and to replace equipment.
This lead us to the modern CLOS design...
The CLOS design
In order to adapt to new application and traffic requirements, we need a topology that is more uniform, simpler to troubleshoot and scale, with mostly homogeneous devices that are possibly cheaper, can be easily replaced, ideally not tied to a specific vendor, and are more adaptable to changes in application and bandwidth requirements.
The CLOS topology or a variation of it is now used across larger companies:
In such topology, every leaf is connected to every spine and spines connect leaves to one another. No spine connects to other spines or leaves to other leaves. Leaves connect to servers and for this reason, they're also known as Top of Rack Switches (ToR) since they're typically also in the same rack.
Both leaves and spines can be (and usually are) the same type of device to reduce complexity. Replacing hardware also becomes less operationally expensive, as switches of the same type can be more easily stocked and if they're not tied to any particular vendor, can be more
If we need more capacity, we can add more servers and leaves. If we need more bandwidth between leaves, we can just add more spines. Notice the similarity with Kubernetes in DevOps world of adding/removing nodes/pods. In access-aggregation-core topology, we'd add more CPU or memory to aggregation switches instead.
Multipath Routing instead of Spanning Tree
Also, CLOS topology can support multiple spines because it uses IP Routing with Equal-Cost Multipath (ECMP) in its favour, i.e. a leaf can reach any other leaf using any spine and the cost to reach any leaf is literally the same.
Embedded Onboarding/`Bootstrap and Automation
The idea is that eventually all switches support an API for automation. While that is not the case, there are options. We can use Netmiko, which is an SSH Python library that connects via SSH/telnet to network devices, including F5, issues commands and then log out. Ansible is another tool with plenty of modules available for different vendors, including F5. F5 devices can also be configured with templates (see AS3) and its onboarding configuration using F5 declarative onboarding.
We all know how easy it is to spin up instances in the Cloud with templates, e.g. using Terraform. However, Open Network Install Environment (ONIE) is changing the game for physical devices. Endorsed by the Open Compute Project (OCP) and developed by Cumulus Networks, ONIE kicks in when a Network OS (NOS) isn't installed and can bootstrap any OCP-compliant switch by looking for a NOS and/or device configuration by reading a URL via DHCP. It can also use alternative methods such as Local USB flash drive, URL from DNS, or even probing IPv4/IPv6 neighbours.
There's a lot of progress in this area with vendors such as Cisco, CheckPoint and F5. F5, more specifically, is available everywhere from Cloud, physical boxes, etc. We're now moving towards an era where it's now possible to build a generic equipment and use the NOS of our preference. For example, we can now buy a generic switch with Broadcom chipset, Intel CPU and use Cumulus Linux as the network OS.
As we can see, in order to better serve modern application requirements, network design also had to change. Changes also open up people's mind to other ideas. The idea of network disaggregation (with plug-and-play parts), onboarding and the automation of configuration using templates, reducing complexity are perfectly aligned with the fast-paced and efficient DevOps world. In fact, from what I noticed, the positive changes brought forward by DevOps adoption in organisations seem to be positively influencing everything else, including the networking world.