HPS ethernet (stmmac) performance regression in 3.18 in compare to 3.10-ltsi

Hi,
I just discovered major performance regression in socfpga-3.18 kernel git branch. When large amount of small packets are communicated through HPS ethernet (stmmac linux driver), the 3.18 is more then 10 times slower when compared to 3.10 kernel. I tested current 3.10-ltsi and also almost a year old 3.10-ltsi-rt.

With our userspace application, we are able to do 1000 packets (60 to 200 bytes size) in 0.25 seconds on 3.10. But on 3.18, 1000 packets needs more then 6 seconds in the same scenario! This is probably caused by some latency in kernel - the application works on request-response pattern, single TCP connection from one remote client.

Same test using TSE ethernet in FPGA gives the same (and expectable good) performance on both 3.10 and 3.18 kernels. So I guess its an stmmac drivers specific issue, not some general issue in network stack.

I tried to copy a whole drivers/net/ethernet/stmicro directory from current socfpga-3.10-ltsi to socfpga-3.18. It was possible to compile the kernel with a few fixes, but it does not fix the problem - still bad on 3.18. A bit weird…

Anyone have seen such issue? Any idea how to debug this?

Thanks.


Update:

The problem can be demonstrated using Hpcbench/tcp tool from http://hpcbench.sourceforge.net/

Run tcpserver on some <host> computer (192.168.1.1 for me), and from HPS Linux:

./tcptest -N -m32 -h 192.168.1.1
 (1) : 0.033703 Mbps
 (2) : 0.033334 Mbps
 (3) : 0.033295 Mbps
 (4) : 0.033276 Mbps
 (5) : 0.033319 Mbps
 (6) : 0.033313 Mbps
 (7) : 0.033310 Mbps
 (8) : 0.033299 Mbps
 (9) : 0.256513 Mbps

0.03 Mbps is very poor result. With 3.10 kernel, you can get about 1.3 Mbps.

I also found workaround for this issue, by setting coalesce parameters with ethtool:

ethtool -C eth0 tx-frames 0

Then the tcptest gives expected performance:

./tcptest -N -m32 -h 192.168.1.1
 (1) : 1.528941 Mbps
 (2) : 1.523969 Mbps
 (3) : 1.597586 Mbps
 (4) : 1.517090 Mbps
 (5) : 1.540663 Mbps
 (6) : 1.585813 Mbps
 (7) : 1.541587 Mbps
 (8) : 1.963557 Mbps
 (9) : 1.559859 Mbps
 (10) : 1.554648 Mbps

and also our application works OK.

But this setting hurts high troughput load, such as TCP iperf, which can reach only about 50% throughput after the tx-frames setting.

Performance without tx-frames=0 setting is comparable with situation when TCP_NODELAY is not set on socket (tcptest without -N option). It seems like not forcing TCP_NODELAY behaviour at driver/NAPI level in newer kernels.