Thursday, July 31, 2014

Next Generation Data Center - Spine Leaf Fabric Design

Posted: July 31st, 2014
Authors: Chad Hintz and Cesar Obediente

Next Generation Data Center - Spine Leaf Fabric Design

In this blog we are going to focus on how new data centers are getting built and the benefits of the newer platforms that are in the market today from a Modular Chassis and Top of the Rack.  The goal will be to give you a clear understanding of the new architectures in the Data Center and its benefits.  As we are accustomed to doing, we have invited a very special guest for a quick interview at the end of this blog.

Before we start explaining the new ways to design a new Data Center, lets take a step back and understand how Data Centers have been built in the past and the reason why.   

In the past the majority of networks we’ve built are what we call a 3-Tier model, which is represented in the following topology






The idea behind this topology is that almost the entire traffic was North-South traffic, meaning the traffic that was destined to the Data Center was also leaving the Data Center.  That’s the reason we were building a 3 Tier topology with Core, Distribution and the Access Layer.  Then if we have to insert Network Services, those services such as Load-Balancer, Firewall, etc would be attached to the aggregation layer.  As noted with this architecture, this was excellent for North-South traffic.  The problem happens when the new set of applications are created that require communication between PODs, or what we are calling East-West traffic or server to server communication, requires a different type of architecture to be designed.

It is estimated in that today’s Data Center that 76% of the traffic stays within the Data Center, 17% of the traffic leaves the Data Center and 7% of the traffic is between the Data Center.  Now the question becomes what kind of topology is the most desired to address today’s Data Center requirements.   Because of the server-to-server communication, we had to find an architecture that has the following characteristics:


  • ·      Equal amount of hop count between any two devices
  • ·      Consistent Latency between any two devices
The best way to accomplish these requirements is by building what is called a CLOS Network or also referred to Spine/Leaf network.  Charles Clos invented the CLOS network in 1952 where a CLOS network has three stages: the ingress, middles state and the egress stage.  All of these stages are connected via a crossbar.   In today’s Spine/Leaf network every Spine connects to every Leaf but the Spine doesn’t connect to each other. See the diagram below.











Now that we understand the concept of why we are migrating from a traditional 3 tier architecture to a Spine/Leaf architecture, it is important to understand that the best way to build this architecture is to choose the best possible set of hardware components and the right amount of bandwidth/oversubscription ratio.

Leaf Layer

Let us begin by analyzing the Leaf Layer, as this is probably the most important layer while deciding how to build your fabric because this is the layer where the servers are going to be connected - especially since in this layer is where the “incast situation” is going to occur. Before we keep analyzing the Leaf Layer, let us understand what is “incast”. Incast is where many devices are communicating to one device.  You have a network with 10 nodes, and 9 of those nodes are talking to 1 node.  You may ask, what kind of application is designed that way?  In reality there are several applications that behave that way, for example Hadoop, MapReduce, multicast application, to name a few. Because of this behavior, the Leaf layer needs to have specific characteristics in order to handle this “incast” situation.  One of the requirements that needs to be realized is how much buffer does a Leaf switch provide because of this “incast” problem, but also in this layer is where the speed mismatch occurs between host ports and uplink ports. Servers would be connected at either 1GbE or 10GbE, but the uplink ports are going to be 40GbE. 

If we compare your typical Leaf switch, they are made from what is called “Merchant Silicon”, “Custom Silicon” and the latest category “Merchant+”.

Here is a table with the main difference between the different types of ASICs:


Merchant
Custom
Merchant+
Companies using
Cisco, JNPR, HP, Arista
Cisco
Cisco
Buffer
Trident+ 9MB
Trident2 12MB
Alta           9.5 MB

Depends on the ASIC
52MB
VxLAN Routing
Alta Yes
Trident family no
Depends on the ASIC
Yes


Spine Layer

Then we take a closer look at the Spine Layer; this is where the Leaves would connect.  Depending on the size of your Data Center fabric you could chose to build this layer with a Modular Chassis or a fix switch.  The placement of this layer in your Data Center is very important because you want to make sure that it is centrally located in your Data Center in order for the leaves to have about the same distance.

In contrast between the Spine Layer and the Leaf Layer, is that in this layer you require “enough” buffer in order to sustain a small amount of burst in the network.  This is because in this layer, every link is the same i.e. there is no speed mismatch.


Oversubscription Ratio

Finally we are going to take a closer look at the two most common questions we get asked, “Should I use 40GbE or 10GbE as my uplink ports and how much oversubscription should I have in my Fabric?” As you can imagine, every networking answer has its “It depends” answer J. Lets start by answering the first question - should we use 10GbE or 40GbE?  I think with cost point of today’s 40GbE optics, there is no doubt we should be building our Fabric link with 40GbE. Another reason we recommend the use of 40GbE as the uplinks is because of the “speed-up” effect at the uplink.  Historically there has always been a speed-up between the server connection and the uplink; servers connected 1GbE uplinks as 10GbE.  The main reason for this speed-up is to avoid collision at the uplink in case multiple servers are sending sustain amount of data.

The second question with regards to the oversubscription is going to depend on the number of servers you attach to the leaf, but more importantly the type of leaf you decide to purchase.  For example your “typical” leaf today made out of Broadcom Trident+ ASIC contains 48 x 1/10GbE x 48 plus 40GbE x 4, then you move to the newer Trident 2 family switches where there are different form factors from 96 x 1/10GbE plus 8 x 40GbE and 48 x 1/10GbE plus 6 x 40GbE, and finally you have the Merchant+ from Cisco where you have 1/10GbE x 48 plus 12 x 40GbE to name a few.

Here is the formula on how to calculate the Oversubscription Ratio:
Oversubscription Ratio =  (Host ports * bandwidth) / (Uplink ports * bandwidth)

Different scenarios:

Trident+
48 servers connected at 10GbE with 4 uplinks at 40GbE. You would have a 3:1 oversubscription ratio.

Trident 2
 48 servers connected at 10GbE with 6 uplinks at 40GbE .You would have a 2:1 oversubscription ratio.

Merchant+
48 servers connected at 10GbE with 12 uplinks at 40GbE. You would have a 1:1 oversubscription ratio.

As you can see the oversubscription is variable and depends on a couple of variables:

·      Application resiliency
·      Overall Budget

Once you have decided the right oversubscription ratio, the next question we need to address is how wide the Spine is going to be, is it going to be 4, 6, 8 or 12 Spines?  In order to answer this question we need to look at different factors:

1      1) One uplink per Spine
        2) Multiple uplink per Spine

Our recommendation is to map one uplink per Spine.  This means that if you were using a traditional Trident+ box, which has four uplinks, you would have four Spine boxes in your fabric.  Each uplink from the Leaf would connect to each Spine.

Closing

This post has introduced several key components on how to build a next generation data center, from the evolution of a 3 Tier Data Center to a Spine/Leaf architecture, and the different components in this architecture.  

Bonus Material

Here is our interview with Dr. Mohammad Alizadeh. Dr. Alizadeh works for Cisco in the office of the CTO with the INSBU.  He has a PhD. From Stanford University and he has concentrated his area of research in Data Center Congestion Control.  Some of his works include Data Center TCP (TCP) congestion control algorithm, which has been implemented into the Windows Server 2012 operating system.


Dr. Alizadeh is going to cover his latest research on Data Center Congestion Control and his finding.