Stale data read issue due to improper cache invalidation in Altera Cyclone V SOC running Linux

Hi all,

We are conducting a feasibility study on our Altera Cyclone V SOC.
We are having some streaming data receiving at FPGA at 300 Megabits per second.
We have to transfer this data without any packet loss to a server machine over WiFi connection.
UDP client application works on Altera SOC and server application works on an Ubuntu desktop machine.

Initially we tried out a sample UDP_client and UDP_Server application which sends hardcoded buffer of data [from user space only. Not data read from FPGA via kernel space] to measure the maximum data rate over WiFi.
We got the range of 250Mbps to 500Mbps. [Without tuning and tuning some network parameters as suggested in the link [https://opensourceforu.com/2016/10/n...e-monitoring/]

We are running Angstrom Linux supplied by Altera (Linux kernel version 4.9).

We are using interrupt driven design with FPGA writing data to a ring buffer implemented in DDR.
This ring buffer is mmapped to user space during application start.

We are experiencing data miss which we later found that data is actually missed even between kernel space and user space. Hence we limited our analysis to client side by commenting out the send() function to server.

On analysing the sequence numbers, we understood that some of the sequence numbers are stale data residing in the memory. This we have confirmed by doing a simple exercise.

After reading each data packet including sequence number, we write the sequence number back to memory as zero. When we run this code, sequence number mismatch is not seen.

We would like to know how can implement this in software by including any cache invalidation routines.
We have tried out APIs like __cpuc_flush_kern_all(); and flush_cache_all(); before every time informing FPGA to write data to DDR. But these APIs do not work as we expected.

Could you please help us with the cache invalidation APIs and the exact workflow in which cache invalidation APIs are to be used in Altera Linux SOC?

Thank you,
Lullaby

-------- Reply to below response from Roberbot----------------------------

Hi,

Thank you very much for your valuable response.
I am trying out dma_alloc_coherent() as you suggested.

Some curious questions from my side.

Do you have any example code on how to use dma_alloc_coherent() and mmap implementation in driver part to remap it to user space. ? If so, please post. I am yet to get a good example to implement this.

Have you tried out this in Altera Cyclone V SOC platform?

Do you experience any performance tradeoff when caching is entirely disabled? We are also considered about our 300Mb/s data rate. Also each one packet is 30KB size.
We have not yet tried out enabling DMA for data transfer from FPGA DDR to user space.

Please share your valuable thoughts on this?

Please note:- Due to some unknown technical issue, I am not able to post any reply to this thread. Hence went for edit.

Thank you,
Lullaby

Hi,

Thank you very much for your valuable response.
I have tried out dma_alloc_coherent() as you suggested.

My code extract look like this:-

/* Allocate not-cached memory area with dma_map_coherent. */
printk(KERN_INFO “Use dma_alloc_coherent\n”);
alloc_ptr = dma_alloc_coherent (NULL, (NPAGES + 2) * PAGE_SIZE, &dma_handle, GFP_KERNEL);

if (!alloc_ptr) 
{
    printk(KERN_ERR "mmap_alloc: dma_alloc_coherent error\n");
    ret = -ENOMEM;
    goto out;
}

In my mmap implementation in driver:- I have written-

int mmap_kmem(struct file *filp, struct vm_area_struct *vma)
{
int ret;
long length = vma->vm_end - vma->vm_start;

/* check length - do not allow larger mappings than the number of pages allocated */
if (length > NPAGES * PAGE_SIZE){
		printk("mmap_kmem Error: Length %ld\n", length);
       return -EIO;
}

printk(KERN_ALERT "\nMmap length:: %d", length);


vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);     

if ((ret = remap_pfn_range(vma, vma->vm_start, virt_to_phys((void *)alloc_ptr) >> PAGE_SHIFT,
                                         length,  vma->vm_page_prot)) < 0)
{
   return ret;
}                                                                    

return 0;

}

But I am seeing much performance drop in this case. Could you please tell if there is anything wrong in the usage?

Some curious questions from my side.

Do you have any example code on how to use dma_alloc_coherent() and mmap implementation in driver part to remap it to user space. ? If so, please post. I am yet to get a good example to implement this.

Have you tried out this in Altera Cyclone V SOC platform?

Do you experience any performance tradeoff when caching is entirely disabled? We are also considered about our 300Mb/s data rate. Also each one packet is 30KB size.
We have not yet tried out enabling DMA for data transfer from FPGA DDR to user space.

Please share your valuable thoughts on this?

Thank you,
Lullaby

Hi, have you tried dma_alloc_coherent to allocate the ring buffer? This buffer allocs a buffer that is living in SDRAM and that is not cached. The OS automatically bypasses the cache when using this buffer.

What i do to send images very fast is to have 2 buffers that the FPGA goes alternating and the CPU reads the
one that is full while the FPGA writes the new image in the other. In my case if the CPU is not fast enough the one buffer is skipped by the processor and one image is lost. In my case is not big deal, just less frames per second.
In your case you can do more than 2 buffers or stall the FPGA so the processor never misses a buffer.

Hope it helps,
regards

Hi all,

We are waiting for any help from you. We are almost blocked now. Not able to solve our issues with the APIs we found out.

Just point out to us… which are the cache invalidation APIs that could be used for Altera Cyclone V SOC running Linux?
Or do we need to go for configuring ACP (just read in manual)? Which is effective?
Could you please give some valuable thoughts on this? Any examples that we can refer for either case?

Thank you,
Lullaby

You dont need to bypass the cache. The OS automatically does it for you when you use dma_alloc_coherent in your driver. USe that function to get a buffer. The physical address you pass it to hardware. ANd the virtual address you use it inside the driver.
It is better you use this kind of buffer than a cacheable one (to be accessed from ACP) because with large tada sizes (above 128kB or so) accessing to SDRAM is faster than doing it through ACP.