arp

ESXi 6.0.x host doesn’t register Cisco ACI’s ARP reponses with Mellanox 10/40 Gb Nics and nmlx4_en driver loaded

August 8, 2016 - - 3 Comments

I’m currently working in a project designing and delivering a private cloud platform based on VMware vRealize and Cisco ACI as the SDN solution.

For almost two days we weren’t able to ping from the ESXi host (Mellanox) to its default gateway provided by a subnet within the Cisco ACI Bridge Domain (BD). However, a physical Windows box (Broadcom) member of the same EPG than the ESXi hosts, was able to ping the same default gateway. This behavior was odd since the ping between members of the same EPG worked fine like between ESXi hosts, or also with the physical Windows machine.

ACI

The first thought that comes to your mind is that you’re missing some setting in your ACI. Why?, because we’re talking about SDN solutions, the philosophy and logic behind that change radically. Now you must know about multi-tenancy, bridge domains, endpoint groups, contracts and so on, so it’s really easy to miss something during the configuration.

Environment

  • ESXi host.
    • HP DL360 Gen9
    • Mellanox 10/40 Gb – MT27520 Family (affected with ARP bug)
      • NIC Driver info:
        • Driver: nmlx4_en
        • Firmware Version: 2.35.5100
        • Version: 3.1.0.0
  • Cisco ACI version 2.0(1n)
  • VMware ESXi 6.0.x
    • Update 1
    • Update 2
    • VMware and HPE OEM ISOs tested

Symptom

  • ESXi host doesn’t reach its default gateway (ACI BD IP).
  • Any traffic routed through the gateway doesn’t reach its destination.
  • ACI replies the ARP request from ESXi but the last one doesn’t register that

Tcpdump-uw in ESXi didn’t show the ACI responses. When we run Wireshark in the physical machine, we could see to ACI reply the ARP requests from ESXi.

capture2

Resolution

After the installation of the last version of Mellanox driver available in the VMware website, the ESXi host began to see the ARP responses. These responses were registered and the communication from the ESXi hosts to the default gateway and other networks worked properly.

Troubleshooting Commands

The following commands were used to perform the troubleshooting from the ESXi host side.

# Display physical network adapter information (counters, ring and driver)
/usr/lib/vmware/vm-support/bin/nicinfo.sh

# Display ARP table
esxcli network ip neighbor list

# Display VMkernel network interfaces
esxcli network ip interface list

# Display the virtual switches
esxcli network vswitch standard list

# Verify port connection
nc -z IP Port

# Capture traffic
tcpdump-uw -vv