Juniper QFX10000 HMC Failures
I’ve spent the past several years managing a fleet of several hundred Juniper QFX10000 switches. These have primarily been the QFX10002 fixed switches, but there’s been a few chassis switches in there too. As more and more of these were deployed, I started to notice an increasing number of hardware failures, specifically with the HMC. (HMC stands for “hybrid memory cubes”, which is the memory used by the forwarding plane.)
Detecting HMC failures
HMC failures will generate a “red” chassis alarm. Many experienced organizations are already monitoring for these alarms. If you’re interested in specifically detecting HMC failures, I’ve found it’s easiest to simply monitor syslog for “HMC”, and use this to generate any alerts that you need.
HMC Versions
- HMC 2.0 is indicated by FW_Set: 0x0090
- HMC 2.1 is indicated by FW_Set: 0x009a
- HMC 2.2A is indicated by FW_Set: 0x009b
- HMC 2.2 is indicated by FW_Set: 0x009c
- HMC 2.3 is indicated by FW_Set: 0x0100
How to determine your HMC version
You’ll need to access a PFE shell. If you have a chassis, you’ll also need to specify a FPC.
Here’s an example from my QFX10002-36Q:
user@qfx10002-36q> start shell pfe network fpc0
Switching platform (2499 Mhz Pentium processor, 3071MB memory, 0KB flash)
FPC0(qfx10002-36q vty)# show hmc asic
chip ID chip name FW_Set Product_Rev chip num
6 HMC06-06-10 0x0100 0x0025 00
7 HMC07-06-11 0x0100 0x0025 01
8 HMC08-06-12 0x0100 0x0025 02
9 HMC09-07-10 0x0100 0x0025 03
10 HMC0a-07-11 0x0100 0x0025 04
11 HMC0b-07-12 0x0100 0x0025 05
12 HMC0c-08-10 0x0100 0x0025 06
13 HMC0d-08-11 0x0100 0x0025 07
14 HMC0e-08-12 0x0100 0x0025 08
FPC0(qfx10002-36q vty)# exit
{master:0}
user@qfx10002-36q>
If you look at the FW_Set column, you’ll see that all show 0x0100, so I have HMC 2.3.
Failure Rates
The following failure rates were shared by Juniper in mid-2024. It’s disappointing [but unsurprising] that they don’t publish this in their own KB.
- HMC 2.0 has a failure rate of 8%
- HMC 2.1 has a failure rate of 4%
- HMC 2.2A has a failure rate of 4%
- HMC 2.2 has a failure rate of 1%
- HMC 2.3 has a failure rate of 1%
While Juniper didn’t go into much detail, my personal suspicion is that newer HMC versions may see their failure rates increase as they’re deployed for longer periods of time.
(With failure rates this high, it’s easy to see why Juniper is looking to EOL all hardware that uses HMC.)
Recovering from HMC failures
Upon detecting a HMC failure, I generally reboot the switch as quickly as possible. Commonly, this is within an hour of the HMC alarm. When rebooting this quickly, all such failures appear to have recovered after the reboot.
There was a single instance where business reasons prevented an urgent reboot of the switch. From the time of the initial HMC alert until I rebooted the switch, I had to wait approximately 3 days. The reboot worked as I had come to expect, and the switch recovered after the reboot. However, just a few days later, the HMC failure returned; having back to back failures like this was new, and I haven’t seen it since. Upon rebooting this switch for the second HMC failure, the FPC wouldn’t come online, and I couldn’t recover it. The end result was a full hardware replacement.
It’s difficult to say if waiting several days had anything to do with the permanent hardware failure, but my personal suspicion is that it’s related.