Juniper QFX10000 HMC Failures

November 1, 2024

(Update: The following information also applies to Juniper’s MX204 routers. While the output will be a little different than the examples I provided, the same commands can be used to identify the HMC version.)

I’ve spent the past several years managing a fleet of several hundred Juniper QFX10000 switches. These have primarily been the QFX10002 fixed switches, but there’s been a few chassis switches in there too. As more and more of these were deployed, I started to notice an increasing number of hardware failures, specifically with the HMC. (HMC stands for “hybrid memory cubes”, which is the memory used by the forwarding plane.)

Detecting HMC failures

HMC failures will generate a “red” chassis alarm. Many experienced organizations are already monitoring for these alarms. If you’re interested in specifically detecting HMC failures, I’ve found it’s easiest to simply monitor syslog for “HMC”, and use this to generate any alerts that you need.

HMC Versions

HMC 2.0 is indicated by FW_Set: 0x0090
HMC 2.1 is indicated by FW_Set: 0x009a
HMC 2.2A is indicated by FW_Set: 0x009b
HMC 2.2 is indicated by FW_Set: 0x009c
HMC 2.3 is indicated by FW_Set: 0x0100

How to determine your HMC version

You’ll need to access a PFE shell. If you have a chassis, you’ll also need to specify a FPC.

Here’s an example from my QFX10002-36Q:

user@qfx10002-36q> start shell pfe network fpc0    


Switching platform (2499 Mhz Pentium processor, 3071MB memory, 0KB flash)

FPC0(qfx10002-36q vty)# show hmc asic
chip ID    chip name    FW_Set  Product_Rev  chip num
      6    HMC06-06-10  0x0100  0x0025       00
      7    HMC07-06-11  0x0100  0x0025       01
      8    HMC08-06-12  0x0100  0x0025       02
      9    HMC09-07-10  0x0100  0x0025       03
     10    HMC0a-07-11  0x0100  0x0025       04
     11    HMC0b-07-12  0x0100  0x0025       05
     12    HMC0c-08-10  0x0100  0x0025       06
     13    HMC0d-08-11  0x0100  0x0025       07
     14    HMC0e-08-12  0x0100  0x0025       08

FPC0(qfx10002-36q vty)# exit





{master:0}
user@qfx10002-36q>

If you look at the FW_Set column, you’ll see that all show 0x0100, so I have HMC 2.3.

Failure Rates

The following failure rates were shared by Juniper in mid-2024. It’s disappointing [but unsurprising] that they don’t publish this in their own KB.

HMC 2.0 has a failure rate of 8%
HMC 2.1 has a failure rate of 4%
HMC 2.2A has a failure rate of 4%
HMC 2.2 has a failure rate of 1%
HMC 2.3 has a failure rate of 1%

While Juniper didn’t go into much detail, my personal suspicion is that newer HMC versions may see their failure rates increase as they’re deployed for longer periods of time.

(With failure rates this high, it’s easy to see why Juniper is looking to EOL all hardware that uses HMC.)

Recovering from HMC failures

Upon detecting a HMC failure, I generally reboot the switch as quickly as possible. Commonly, this is within an hour of the HMC alarm. When rebooting this quickly, all such failures appear to have recovered after the reboot.

There was a single instance where business reasons prevented an urgent reboot of the switch. From the time of the initial HMC alert until I rebooted the switch, I had to wait approximately 3 days. The reboot worked as I had come to expect, and the switch recovered after the reboot. However, just a few days later, the HMC failure returned; having back to back failures like this was new, and I haven’t seen it since. Upon rebooting this switch for the second HMC failure, the FPC wouldn’t come online, and I couldn’t recover it. The end result was a full hardware replacement.

It’s difficult to say if waiting several days had anything to do with the permanent hardware failure, but my personal suspicion is that it’s related.