Monday, November 18, 2013

Suricata capture.kernel_drops caused by interrupt problems from single queue network cards

(update: added an even more simple solution )

For quite some time I was confronted with a huge amount of kernel_drops with Suricata. After quite some time of debugging, and with the help of the Suricata developers, I was able to pinpoint the problem to the NIC and e1000e driver.

Confirmation of the problem

The of top -H shows me that only a single AFPacketeth thread is processing incoming traffic.
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND             
28769 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.49 Suricata-Main       
28770 root      20   0 2792m 1.8g 822m S   65 22.9   1:49.98 AFPacketeth11       
28771 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth12       
28772 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth13       
28773 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth14       
28774 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth15       
28775 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth16       
28776 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth17       
28777 root      20   0 2792m 1.8g 822m S    0 22.9   0:00.04 AFPacketeth18 

Also looking at the stats.log file confirms that only one single thread receives the traffic:
# tail -f /var/log/suricata/stats.log  | fgrep kernel_packet
capture.kernel_packets    | AFPacketeth11             | 42477691
capture.kernel_packets    | AFPacketeth12             | 609
capture.kernel_packets    | AFPacketeth13             | 283
capture.kernel_packets    | AFPacketeth14             | 408
capture.kernel_packets    | AFPacketeth15             | 436
capture.kernel_packets    | AFPacketeth16             | 464
capture.kernel_packets    | AFPacketeth17             | 613
capture.kernel_packets    | AFPacketeth18             | 307

The problem is that only one CPU core is receiving interrupts of the network card.
           CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   CPU6   CPU7       
 54:   14302207      0      0      0      0      0      0      0   PCI-MSI-edge   eth1-rx-0
 55:          6      0      0      0      0      0      0      0   PCI-MSI-edge   eth1-tx-0
 56:          5      0      0      0      0      0      0      0   PCI-MSI-edge   eth1

Note: it's possible that other cores 'sometimes' get a few interrupts, but the majority of them will go to one core. This is caused by your network card that has only one receive queue. The e1000e driver is one example. On the e1000e mailinglist an Intel developer confirms this:
On Thursday 25 December 2008 17:59:40,Jeff Kirsher wrote:
> While the hardware supports 2 Tx and 2 Rx queues we do not have the
> full implementation in our Linux e1000e driver to support this.
> The performance to be gained from multiple queues is very small, and
> was not a requirement for our Linux software.
> So its not supported right now, and I doubt it will be implemented for
> e1000e as there is likely to be very little benefit.
> Although we may reconsider based on customer feedback.

Solution 1: Set AF_PACKET cluster_type to cluster_flow

In the configuration file you can specify the technique used for loadbalancing. 
The documentation states:
Default AF_PACKET cluster type. AF_PACKET can load balance per flow or per hash.
This is only supported for Linux kernel > 3.1
possible value are:
  * cluster_round_robin: round robin load balancing
  * cluster_flow: all packets of a given flow are send to the same socket
  * cluster_cpu: all packets treated in kernel by a CPU are send to the same socket
To benefit from a similar effect of solution 2 RPS and RFS you should set cluster-type: cluster_flow

Solution 2: RPS and RFS

After hours of searching and reading I ended up on the wiki of FreeBSD and the where the multiqueue support is compared between Linux and FreeBSD. It is also explained in the documentation of Linux. In short it seems that activating RPS (Receive Packet Steering) and/or RFS (Receive Flow Steering) could solve my problem as it offers packet distribution functionality to mono-queue NICs. 

The graphs below (taken from the FreeBSD wiki) make it a little bit more visual.
Receive Packet Steering
Receive Flow Steering



Configuration

The trick is to first tell irqbalance to stop balancing for the specific IRQs. Edit /etc/default/irqbalance and set IRQBALANCE_BANNED_INTERRUPTS to a space separated list of the IRQs he shouldn't look at.
Don't forget to restart irqbalance: stop irqbalance; start irqbalance

Then I activate affinity for the first CPU core to pin the interrupts on one specific CPU.
# first core is boss over the first IRQ step
echo 1 > /proc/irq/${irq}/smp_affinity

And activate RPS and RFS for the interface as explained in the article. 
echo "fe" > /sys/class/net/${iface}/queues/rx-0/rps_cpus
echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 4096 > /sys/class/net/${iface}/queues/rx-0/rps_flow_cnt
Notice that the "fe" is a binary mask, the lowest core is on the right:
    CPU cores
ff  1111 1111
fe  1111 1110

Confirmation

Without stopping Suricata we now see with top -H  (do a few > to sort on 'command') that all the receive threads are busy now:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND      
28769 root      20   0 2792m 1.9g 822m S    0 24.4   0:00.51 Suricata-Main
28770 root      20   0 2792m 1.9g 822m S    1 24.4   1:58.39 AFPacketeth11
28771 root      20   0 2792m 1.9g 822m S   11 24.4   0:01.42 AFPacketeth12
28772 root      20   0 2792m 1.9g 822m S    9 24.4   0:01.31 AFPacketeth13
28773 root      20   0 2792m 1.9g 822m S   14 24.4   0:01.48 AFPacketeth14
28774 root      20   0 2792m 1.9g 822m S   10 24.4   0:01.26 AFPacketeth15
28775 root      20   0 2792m 1.9g 822m S   10 24.4   0:01.32 AFPacketeth16
28776 root      20   0 2792m 1.9g 822m S   10 24.4   0:01.35 AFPacketeth17
28777 root      20   0 2792m 1.9g 822m S   10 24.4   0:01.34 AFPacketeth18
And a second confirmation from the stats.log file:
capture.kernel_packets    | AFPacketeth11             | 73211942
capture.kernel_packets    | AFPacketeth12             | 5193446
capture.kernel_packets    | AFPacketeth13             | 5176265
capture.kernel_packets    | AFPacketeth14             | 4997172
capture.kernel_packets    | AFPacketeth15             | 5059325
capture.kernel_packets    | AFPacketeth16             | 5727925
capture.kernel_packets    | AFPacketeth17             | 5499172
capture.kernel_packets    | AFPacketeth18             | 4503364

Do note that your /proc/interrupts will stay the same, unbalanced. This is because one CPU core is handling these interrupts and placing them in the software steering queues. 

Of course do not forget to finetune the hardware buffers and offloading as explained in Erics blogpost here.

Now the IRQ problem is gone and we'll see (almost) no drops anymore (numbers are not raising).
capture.kernel_drops      | AFPacketeth11             | 250
capture.kernel_drops      | AFPacketeth12             | 322
capture.kernel_drops      | AFPacketeth13             | 37
capture.kernel_drops      | AFPacketeth14             | 147
capture.kernel_drops      | AFPacketeth15             | 184
capture.kernel_drops      | AFPacketeth16             | 130
capture.kernel_drops      | AFPacketeth17             | 358
capture.kernel_drops      | AFPacketeth18             | 91