I just discovered major performance regression in socfpga-3.18 kernel git branch. When large amount of small packets are communicated through HPS ethernet (stmmac linux driver), the 3.18 is more then 10 times slower when compared to 3.10 kernel. I tested current 3.10-ltsi and also almost a year old 3.10-ltsi-rt.
With our userspace application, we are able to do 1000 packets (60 to 200 bytes size) in 0.25 seconds on 3.10. But on 3.18, 1000 packets needs more then 6 seconds in the same scenario! This is probably caused by some latency in kernel - the application works on request-response pattern, single TCP connection from one remote client.
Same test using TSE ethernet in FPGA gives the same (and expectable good) performance on both 3.10 and 3.18 kernels. So I guess its an stmmac drivers specific issue, not some general issue in network stack.
I tried to copy a whole drivers/net/ethernet/stmicro directory from current socfpga-3.10-ltsi to socfpga-3.18. It was possible to compile the kernel with a few fixes, but it does not fix the problem - still bad on 3.18. A bit weird…
Anyone have seen such issue? Any idea how to debug this?
The problem can be demonstrated using Hpcbench/tcp tool from http://hpcbench.sourceforge.net/
tcpserver on some
<host> computer (192.168.1.1 for me), and from HPS Linux:
./tcptest -N -m32 -h 192.168.1.1 (1) : 0.033703 Mbps (2) : 0.033334 Mbps (3) : 0.033295 Mbps (4) : 0.033276 Mbps (5) : 0.033319 Mbps (6) : 0.033313 Mbps (7) : 0.033310 Mbps (8) : 0.033299 Mbps (9) : 0.256513 Mbps
0.03 Mbps is very poor result. With 3.10 kernel, you can get about 1.3 Mbps.
I also found workaround for this issue, by setting coalesce parameters with
ethtool -C eth0 tx-frames 0
Then the tcptest gives expected performance:
./tcptest -N -m32 -h 192.168.1.1 (1) : 1.528941 Mbps (2) : 1.523969 Mbps (3) : 1.597586 Mbps (4) : 1.517090 Mbps (5) : 1.540663 Mbps (6) : 1.585813 Mbps (7) : 1.541587 Mbps (8) : 1.963557 Mbps (9) : 1.559859 Mbps (10) : 1.554648 Mbps
and also our application works OK.
But this setting hurts high troughput load, such as TCP
iperf, which can reach only about 50% throughput after the tx-frames setting.
Performance without tx-frames=0 setting is comparable with situation when TCP_NODELAY is not set on socket (tcptest without -N option). It seems like not forcing TCP_NODELAY behaviour at driver/NAPI level in newer kernels.