Layer 2 Fabrics¶
This blueprint item primarily covers the following topics:
Note
Both Virtual Chassis and Virtual Chassis Fabric are hardware-based architectures. Unfortunately, I do not have access to real hardware; all of my labbing is based on the vQFX0000. This means that although there may be configurations presented, I don’t currently have a way of validating. If you have access to physical gear and can validate the configurations (or present more interesting topologies/configurations), I would love to see them contributed!
Virtual Chassis¶
The EX4300, QFX3500, QFX3600, and QFX5100 can form a mixed mode Virtual Chassis. If you are familiar with mixed mode Virtual Chassis from the Enterprise track (EX4200/EX4500/EX4550), the concept is very similar. Up to 10 switches are supported in a stack, although some switches, such as the QFX5100-96S, may not be supported.
The first step in creating a mixed mode Virtual Chassis is to tell each
individual switch that it will be participating in a mixed mode VC with
the request virtual-chassis mode mixed reboot
command. However,
when building a new stack, the
request virtual-chassis mode mixed all-members reboot
command can be
used to set all members in the stack to mixed mode at the same time.
Note
When operating in a mixed mode Virtual Chassis, the scaling numbers are reduced to the lowest common denominator. This severely limits the scalability of a deployment. The lowest common denominator is the lowest scaling factor of the smallest possible PFE. This means that if you have a mixed Virtual Chassis of QFX3500 and QFX5100, even if there is no EX4300, each device will still be limited to the maximum scale of an EX4300.
The benefits of a mixed mode Virtual Chassis are the same as those of a regular Virtual Chassis:
- Redundant Routing Engines
- NSR and NSB
- One control plane for multiple data planes
- Potential elimination of xSTP
By default, the 40G QSFP+ ports on an EX4300 are enabled for Virtual Chassis; however, on the QFX5100, these ports are disabled for Virtual Chassis. When a mixed mode VC contains QFX5100 switches, only the QFX5100 can become an RE. As with many protocols consisting of primary and secondary nodes, the VC mastership process follows an election process. The first tie-breaker is priority, which is 128 by default. Higher is better. If the priority values are the same, the next factor to consider is which node was the master prior to a reboot. Next, the member with the highest uptime; however, the difference in uptime must be more than 60 seconds. Finally, all else being equal (and difference in uptime being less than 60 seconds), the member with the lowest MAC address will be elected as the master. The backup is elected according to the same criteria.
When implementing Virtual Chassis, you can use the
virtual-chassis auto-sw-upgrade
configuration to automatically
upgrade members with the LC (Line Card)
state when their software
version does not match that of the Master RE. Additionally, Split
Detection can be disabled with the
virtual-chassis no-split-detection
configurtation. However, this
should only be done when there are only two members in the Virtual
Chassis.
On a QFX5100, you will need to set ports as Virtual Chassis Ports with
the request virtual-chassis vc-port set pic-slot 0 port 48
operational command.
Virtual Chassis supports a special deviation of ISSU called NSSU:
Non-Stop Software Upgrade. For NSSU to work, the physical topology must
be a ring – it cannot be a braid. The master and backup must be
adjacent; this means that the roles of each switch must be
deterministic. For this reason, only pre-provisioned Virtual Chassis is
supported. Additionally, both NSR and GRES must be configured; NSB is
optional. To initiate an NSSU, issue the
request system software nonstop-upgrade [<path_platform1> <path_platform2>]
command.
Virtual Chassis Fabric¶
Virtual Chassis Fabric is an extension of Virtual Chassis. It works with QFX3500, QFX3600, QFX5100, and EX4300 series switches. New switches added to a VCF are automatically discovered and brought online.
Note
The node limit is unclear. In Chapter 5 of the O’Reilly QFX5100 Series book [1], a passage indicates that the maximum number of switches is 32; however, recent Juniper documentation publications [2] indicate that the limit is 20. Because the documentation is more recent than the published book, even though there is no confirmed and published errata [3], readers should assume a limit of 20 nodes.
Like Virtual Chassis, VCF uses IS-IS between switches. The links between switches, however, come up as “Smart Trunks.” This is part of what enables VCF to perform unequal cost multipath load balancing in certain designs and failure scenarios. The other technology that enables unueqal cost multipath is Adaptive Load Balancing. ALB hashes TCP flowlets to different links.
These flowlets are tracked in a hash bucket table. This table can hold hundreds of thousands of entries – enough to prevent “elephant flows” from overloading a given link. When the flowlet egresses the switch, the hash table is updated with a timestamp and the link via which it egressed. When a new packet for the same flowlet egresses, it is checked against an expiration or inactivity timer. If the time since the last packet was seen is greater than the inactivity timer, then the flowlet is hashed to a new uplink. The egress link selection is also based on a moving average of the load and queue depth on each interface.
ALB is disabled by default; to enable it in a VCF, use the
set fabric-load-balance flowlet
configuration command.
Not all switches can be spine switches, but all switches can be leaf switches. A general rule of thumb is that a fiber-based QFX5100 can be a leaf switch or a spine switch; any other switch can only be a leaf switch.
Provisioning Options¶
When configuring a VCF, you have three options: auto-provisioned, pre-provisioned, and non-provisioned. Each has its own benefits and drawbacks; auto-provisioned is less secure, while the non-provisioned mode is more configuration-intensive and less predictable.
With an auto-provisioned VCF, you must specify the role and serial number for each spine switch; the leaf switches are automatically added. The Virtual Chassis Ports are automatically discovered and added.
With a pre-provisioned VCF, you specify each spine and leaf member. Virtual Chassis Ports are also automatically discovered and added. Configuring a VCF in this mode is the same as configuring a Virtual Chassis in pre-provisioned mode.
Note
If you do not want links between switches to be converted to VCPs automatically, delete the LLDP configuration before powering on additional switches.
Note
If you’re using a mixed mode Virtual Chassis Fabric, you need to disable the VCPs on any EX4300 switch in order for the VCPs to autonegotiate successfully. Converting VCPs to network interfaces is covered in the Data Plane section.
The non-provisioned mode is similar to the pre-provisioned mode, except that the Virtual Chassis Ports are not automatically discovered and added, and the roles are not automatically defined; instead, a priority-based election process occurs.
To create a VCF, you need to set the master RE switch into the VCF mode
with request virtual-chassis mode fabric reboot
. At least one leaf
switch needs to be installed next, and it should be cabled to the second
spine switch before bringing up the second spine switch.
Note
If you need a Mixed Mode Virtual Chassis Fabric, such as when building a fabric
with the QFX5100 and EX4300, you need to use the
request virtual-chassis mode fabric mixed reboot
operational
command. When operating a mixed mode Virtual Chassis Fabric, you can set the
master’s mode to mixed
, then add all members, and then set all
switches to mixed
mode at the same time with the operational
request virtual-chassis mode fabric mixed all-members reboot
command.
Mastership Election¶
In auto-provisioned and pre-provisioned, modes, the QFX5100 that has the highest uptime is elected the master. The QFX5100 with the second-highest uptime is elected the backup. Any other QFX5100s in the spine role are line cards. If one of the masters fails, then one of the QFX5100 spines operating as a line card will be elected the new backup following the same uptime rules.
For a non-provisioned VCF, the following rules dictate master selection:
1: Highest priority (default is 128) 2: QFX5100 operating as master prior to reboot 3: QFX5100 with longest uptime (greater than one minute) 4: QFX5100 with lowest MAC address
For the backup RE, the process is repeated.
Note
You might notice that this the same mastership election process as for Virtual Chassis.
Control Plane¶
vccpd
runs on all nodes and is based on IS-IS. It is responsible
for topology discovery. It also distributes any VCCP-specific state
information. For unicast traffic, shortest path first is used; however,
to support BUM traffic, bidirection multicast trees are used. Finally,
for control plane traffic, a unique Class of Service queue is
automatically created and used. All of this operational complexity is
abstracted by Virtual Chassis Fabric.
When deploying a VCF, GRES, NSR, and NSB are used to keep the master and backup REs in sync.
For console access, each switch runs a virtual console server. When you
attach to the console of any member switch, this virtual console server
software automatically redirects your connection to the master RE. Once
you’re on the master RE, you can access a specific node with the
request session member <id>
command.
As with Virtual Chassis, the OOB management interface becomes a vme
interface.
When a switch is removed, its member ID does not get released
automatically. If you want to release the member ID to be used by the
next switch attached, you can use the
request virtual-chassis recycle member-id <id>
operational cmmand.
When adding a new switch, the software versions must be compatible. You
can either upgrade the devices manually, or you can use the
auto-sw-upgrade
configuration. When using this, you must have the
images for each series (EX4300, QFX3500, QFX5100) in your fabric on the
master RE or a remote URL. Use the
set virtual-chassis auto-sw-upgrade ex-4300 <path>
configuration
command to set the path for an EX4300. Replace ex-4300
with
qfx-3
or qfx-5
for the QFX3500 or QFX5100, respectively.
When performing a software upgrade, the Non-Stop Software Upgrade (NSSU)
feature can be used if using the preprovisioned
mode. Additionally,
no-split-detection
(covered in the Fabric Partition section)
must be configured.
Data Plane¶
Virtual Chassis Fabric has a concept of “Smart Trunks.” When two or more links between two devices are connected, they will automatically form a LAG. Each path is weighted based on the bandwidth ratio. Traffic is distributed across multiple unequal paths, taking into account the minimum possible bandwidth on any links in the path.
A 16 byte Fabric Header is added to each packet received or sent by an ingress or egress device, similar to MPLS. It contains the incoming member ID, incoming port ID, destination member ID, and destination port ID, among other fields.
For load balancing hashing, the following fields are used:
Layer 2+Fabric Header:
- Source MAC
- Destination MAC
- Ethertype
- VLAN ID
- Incoming Port ID
- Incoming Member ID
Layer 3+4:
- Source IP
- Destination IP
- Source Port
- Destination Port
- Protocol
- Incoming Port ID
- Incoming Member ID
- Next Header (IPv6 Only)
If you need to convert an interface to a VCP, the
request virtual-chassis vc-port set pic-slot <id> port <id> member <id>
command can be used. The member <id>
corresponds to the FPC number
in the interface’s representation. To do the opposite, replace set
with delete
. For example,
request virtual-chassis vc-port delete pic-slot 0 port 1 member 7
.
Finally, MAC learning is similar to a Virtual Chassis: when a member learns a new MAC address, it notifies the master of the MAC address. The master then programs all other members with the MAC-to-interface entry.
BUM Traffic¶
BUM traffic is distributed according to a Multicast Distribution Tree
(MDT). There are multiple trees in a Virtual Chassis Fabric, each rooted at each
switch. Therefore, there are N
MDTs, where N
is the number of
switches in the Virtual Chassis Fabric. Each switch can load balance across all of
the available MDTs for sending BUM traffic. This traffic is hashed
based on the VLAN ID.
Note
In a Virtual Chassis Fabric, all members receive a copy of all BUM traffic.
Fabric Partition¶
Sometimes, a fabric may become partitioned or “split.” This occurs when one or more switches become isolated from one or more other switches in the fabric. When this happens, one of the new fabrics will remain active, and the others will be deactivated.
Note
“Isolated” refers to communications via the Virtual Chassis Ports. Even if IP connectivity would otherwise exist, the fabric is considered partitioned if it cannot communicate over the VCPs.
To determine which fabric will remain active, the following rules are evaluated, in order:
- 1: The fabric contains both the master and the backup RE from the
- previous fabric
- 2: The fabric contains the original master RE and at least half of the
- members from the previous fabric
- 2: The fabric contains the backup master RE and at least half of the
- members from the previous fabric
If your design can function when a partition happens, you can disable
the default behavior with the set virtual-chassis no-split-detection
configuration command. This disables the deactivation of partitioned
fabrics described above.
Footnotes
[1] | Juniper QFX5100 Series |
[2] | Planning a Virtual Chassis Fabric Deployment |
[3] | Errata for Juniper QFX5100 Series |