Posted: July 31st, 2014
Authors: Chad Hintz and Cesar Obediente
Authors: Chad Hintz and Cesar Obediente
Next Generation Data Center - Spine Leaf Fabric Design
In this blog we are going to
focus on how new data centers are getting built and the benefits of the newer
platforms that are in the market today from a Modular Chassis and Top of the
Rack. The goal will be to give you a
clear understanding of the new architectures in the Data Center and its
benefits. As we are accustomed to doing, we have invited a very special guest for a quick interview at the end of
this blog.
Before we start explaining
the new ways to design a new Data Center, lets take a step back and understand
how Data Centers have been built in the past and the reason why.
In the past the majority of
networks we’ve built are what we call a 3-Tier model, which is represented in
the following topology
The idea behind this topology
is that almost the entire traffic was North-South traffic, meaning the traffic
that was destined to the Data Center was also leaving the Data Center. That’s the reason we were building a 3 Tier
topology with Core, Distribution and the Access Layer. Then if we have to insert Network Services,
those services such as Load-Balancer, Firewall, etc would be attached to the aggregation
layer. As noted with this architecture,
this was excellent for North-South traffic.
The problem happens when the new set of applications are created that
require communication between PODs, or what we are calling East-West traffic or
server to server communication, requires a different type of architecture to be
designed.
It is estimated in that
today’s Data Center that 76% of the traffic stays within the Data Center, 17%
of the traffic leaves the Data Center and 7% of the traffic is between the Data
Center. Now the question becomes what
kind of topology is the most desired to address today’s Data Center
requirements. Because of the
server-to-server communication, we had to find an architecture that has the
following characteristics:
- · Equal amount of hop count between any two devices
- · Consistent Latency between any two devices
The best way to accomplish
these requirements is by building what is called a CLOS Network or also
referred to Spine/Leaf network. Charles
Clos invented the CLOS network in 1952 where a CLOS network has three stages:
the ingress, middles state and the egress stage. All of these stages are connected via a
crossbar. In today’s Spine/Leaf network
every Spine connects to every Leaf but the Spine doesn’t connect to each other.
See the diagram below.
Now that we understand the
concept of why we are migrating from a traditional 3 tier architecture to a
Spine/Leaf architecture, it is important to understand that the best way to
build this architecture is to choose the best possible set of hardware
components and the right amount of bandwidth/oversubscription ratio.
Leaf Layer
Let us begin by analyzing the
Leaf Layer, as this is probably the most important layer while deciding how to
build your fabric because this is the layer where the servers are going to be connected
- especially since in this layer is where the “incast situation” is going to
occur. Before we keep analyzing the Leaf Layer, let us understand what is
“incast”. Incast is where many devices are communicating to one device. You have a network with 10 nodes, and 9 of
those nodes are talking to 1 node. You
may ask, what kind of application is designed that way? In reality there are several applications
that behave that way, for example Hadoop, MapReduce, multicast application, to
name a few. Because of this behavior, the Leaf layer needs to have specific
characteristics in order to handle this “incast” situation. One of the requirements that needs to be realized
is how much buffer does a Leaf switch provide because of this “incast” problem,
but also in this layer is where the speed mismatch occurs between host ports
and uplink ports. Servers would be connected at either 1GbE or 10GbE, but the
uplink ports are going to be 40GbE.
If we compare your typical
Leaf switch, they are made from what is called “Merchant Silicon”, “Custom
Silicon” and the latest category “Merchant+”.
Here is a table with the main
difference between the different types of ASICs:
Merchant
|
Custom
|
Merchant+
|
|
Companies using
|
Cisco, JNPR, HP, Arista
|
Cisco
|
Cisco
|
Buffer
|
Trident+ 9MB
Trident2 12MB
Alta 9.5 MB
|
Depends on the ASIC
|
52MB
|
VxLAN Routing
|
Alta Yes
Trident family no
|
Depends on the ASIC
|
Yes
|
Spine Layer
Then we take a closer look at
the Spine Layer; this is where the Leaves would connect. Depending on the size of your Data Center
fabric you could chose to build this layer with a Modular Chassis or a fix
switch. The placement of this layer in
your Data Center is very important because you want to make sure that it is
centrally located in your Data Center in order for the leaves to have about the
same distance.
In contrast between the Spine
Layer and the Leaf Layer, is that in this layer you require “enough” buffer in
order to sustain a small amount of burst in the network. This is because in this layer, every link is
the same i.e. there is no speed mismatch.
Oversubscription Ratio
Finally we are going to take
a closer look at the two most common questions we get asked, “Should I use
40GbE or 10GbE as my uplink ports and how much oversubscription should I have
in my Fabric?” As you can imagine, every networking answer has its “It depends”
answer J. Lets start by answering the first question - should
we use 10GbE or 40GbE? I think with cost
point of today’s 40GbE optics, there is no doubt we should be building our
Fabric link with 40GbE. Another reason we recommend the use of 40GbE as the
uplinks is because of the “speed-up” effect at the uplink. Historically there has always been a speed-up
between the server connection and the uplink; servers connected 1GbE uplinks as
10GbE. The main reason for this speed-up
is to avoid collision at the uplink in case multiple servers are sending
sustain amount of data.
The second question with
regards to the oversubscription is going to depend on the number of servers you
attach to the leaf, but more importantly the type of leaf you decide to
purchase. For example your “typical”
leaf today made out of Broadcom Trident+ ASIC contains 48 x 1/10GbE x 48 plus
40GbE x 4, then you move to the newer Trident 2 family switches where there are
different form factors from 96 x 1/10GbE plus 8 x 40GbE and 48 x 1/10GbE plus 6
x 40GbE, and finally you have the Merchant+ from Cisco where you have 1/10GbE x
48 plus 12 x 40GbE to name a few.
Here is the formula on how to
calculate the Oversubscription Ratio:
Oversubscription Ratio = (Host ports * bandwidth) / (Uplink ports *
bandwidth)
Different scenarios:
Trident+
48 servers connected at 10GbE
with 4 uplinks at 40GbE. You would have a 3:1 oversubscription ratio.
Trident 2
48 servers connected at 10GbE with 6 uplinks
at 40GbE .You would have a 2:1 oversubscription ratio.
Merchant+
48 servers connected at 10GbE
with 12 uplinks at 40GbE. You would have a 1:1 oversubscription ratio.
As you can see the
oversubscription is variable and depends on a couple of variables:
· Application resiliency
· Overall Budget
Once you have decided the
right oversubscription ratio, the next question we need to address is how wide
the Spine is going to be, is it going to be 4, 6, 8 or 12 Spines? In order to answer this question we need to
look at different factors:
1 1) One uplink per
Spine
2) Multiple uplink per Spine
2) Multiple uplink per Spine
Our recommendation is to map
one uplink per Spine. This means that if
you were using a traditional Trident+ box, which has four uplinks, you would
have four Spine boxes in your fabric.
Each uplink from the Leaf would connect to each Spine.
Closing
This post has introduced
several key components on how to build a next generation data center, from the evolution
of a 3 Tier Data Center to a Spine/Leaf architecture, and the different
components in this architecture.
Bonus Material
Here is our interview with
Dr. Mohammad Alizadeh. Dr. Alizadeh works for Cisco in the office of the CTO with
the INSBU. He has a PhD. From Stanford
University and he has concentrated his area of research in Data Center
Congestion Control. Some of his works
include Data Center TCP (TCP) congestion control algorithm, which has been
implemented into the Windows Server 2012 operating system.
Dr. Alizadeh is going to
cover his latest research on Data Center Congestion Control and his finding.