ACP port setup, Linux, cache coherent from FPGA

Hello guys,

Does somebody know how to setup the ACP port so I can use a FPGA master to access the SDRAM data with cache coherency?
I have a FPGA module which does something like a DMA transfer, I have it working by using the FPGA-SDRAM ports, but I would like to use the FPGA-HPS with the ACP port.
Someone told me to pass the RAM physical address of the data I want to handle “| 0x80000000” so I can access the ACP window, but after doing that I got a kernel panic,
I am not sure if I need to do some initial setup for that ACP port, the FPGA master module is connected via the FPGA-HPS bridge and I have changed the AW[AR] Cache attributes of the AXI port to enable the cache for both writes and reads.

I have a DE1-SoC boad, with Linux 3.10 LTSI, it was built using Yocto and a guide from rocketboards.org.

I am using some ideas from the project: http://rocketboards.org/foswiki/view/Projects/Datamover. There is an axi_conduit_merger module which allowed me to changes the AXI attributes from the Linux application.

Not sure what I am missing.
Any help would be appreciated.

Thanks.

Hi, I am working with DE1-SoC doing DMA transfers with the DMA Controller in HPS.
I was able to copy from SDRAM locations using the DMA baremetal example from Altera website. Now I have modified the example to run with cache memories on and copying data through ACP using coherent transfers. I have the DMA working with HPS On-Chip RAM locations so basic config is fine. But when I tell the program to read from coherent memory (setting source buffer=array source pointer created in software + 0x8000000) it does not work properly: DMA transfer has no errors but data copied is not the one in source buffer. It seems to me that it is not doing coherent access and it is doing some non-coherent access.

What I think one has to configure is:

  1. ACP ID mapper: either dynamic or fixed ID mapping with ACP HW LIB functions. In dynamic DYNRD and DYNWR of ID mapper must be configured. In fixed VIDxRD and VIDxWR must be configured. I tried both:
    -for dynamic: page 0, user bits 0b11111 (as page 9-34 of CycloneVSoC Handbook states)
    -for fixed: force=1, mid=DMA ID = 0x1000001, page=0, user=0b11111

  2. ARCACHE and AWCACHE signal. I used ARCHACHE[3:0]=0111 so operation is cacheable, write-through, and a miss in read is allocated (as states page 5-3 in AMBA3 AXI protocol specification). AWCAHE[3:0]=1011. (The bits with 0 are forced by DMAC, I can not edit them).

And then be careful that masters in L3 access adding 0x8000000 to their address. I think nothing more.

If you solved the problem, please, tell me the configuration you used for these signals. I will keep trying it and if I solve something I will post it.

Cheers

SOLVED IT!

All the key points are in page 9-29 of CycloneVSoC Handbook.

What I did was:

  • Switch on caches: I used Legup functions (http://legup.eecg.utoronto.ca/wiki/doku.php?id=using_arm_caches)
    A small change is needed to set S bit in table descriptors defining normal memory region attributes. Original value of L1_NORMAL_111_11 constant in arm_cache.s is 0x00007c0e. I changed it to 0x00017c0e to set S bit and define normal memory as shareable (coherent).

  • Initialize ID Mapper. It is set dynamic ID3 for reads and ID4 for writes using the Altera HW library functions. The access is to page 0 where the Off-chip 1GB RAM is in DE1-SoC board.ARUSER and AWUSER bits are set to 1 to do coherent cacheable accesses (as page 9-34 of CycloneVSoC Handbook explains):
    const uint32_t ARUSER = 0b11111; //coherent cacheable reads
    const uint32_t AWUSER = 0b11111; //coherent cacheable writes
    status = alt_acp_id_map_dynamic_read_set(ALT_ACP_ID_OUT_DYNAM_ID_3);
    status = alt_acp_id_map_dynamic_write_set(ALT_ACP_ID_OUT_DYNAM_ID_4);
    status = alt_acp_id_map_dynamic_read_options_set(ALT_ACP_ID_MAP_PAGE_0, ARUSER);
    status = alt_acp_id_map_dynamic_write_options_set(ALT_ACP_ID_MAP_PAGE_0, AWUSER);

  • L3 master (DMA Controller in my case) accessing to ACP through the buffer pointer + 0x80000000. ARCACHE and AWCACHE bits of AXI bus are all set to 1. Since I was using Altera HW library functions to do so (alt_dma_memory_to_memory from alt_dma.c) I opened the file and changed ALT_DMA_CCR_OPT_SC_DEFAULT by 0x00003800 and ALT_DMA_CCR_OPT_DC_DEFAULT by 0x0E000000 in all places thet appear in the file. This way acesses are to write-back cacheable memory.

So to solve yours @norxander check the S (shareable) bit in memory regions definitions (section B3.8 of ARMv7-AR architecture reference manual). You should do it from kernel space (driver). I dont think you can access that info from user space (regular app). If it is 0 that is the problem for sure.

Hi @roberbot,

Thanks for all that information, so you used the code from legup and the ALT acp function from Altera HW library without using an operating system, right? I am using the Linux kernel 3.10-LTSI, I have developed a kernel driver, so I can access special registers from there. From the device driver I developed I can allocate some physical contiguous memory and then I send the start of that memory to the user app, from there I call mmap to allocate virtual addresses and access that physical memory from the user app.

From the arm_cache.s code I see you are setting the S bit in all the TTB entries, but how can I do something similar when using Linux?

So I wonder where do I have to check for that shareable bit?

Thanks for your help!

Hi @norxander,

Thanks for your information too. Yes, now I am working in baremetal (no OS) but in few days I am going to try to do the same in Angstrom 12 with also Linux kernel 3.10-LTSI so I am very interested in what you do. And most of the things with ACP should be the same in baremetal and using OS.

I used Legup arm_cache.c functions to configure the memory of the processor. It creates 4096 pages of 1MB, each with its TLB entry with access properties. It sets the first half of the memory as normal memory and the rest as device memory. I don´t really know where this entries are stored. To store each entry Legup use STR instruction (write new entry to the translation table) (I quickly looked at page B3-1384 of ARMv7-A-R_architecture_reference_manual). I supose that entries are stored in MMU because they use specific instructions to store them (plus I don´t think that the properties and permissions to access memory are stored in memory).

What legup code does for me the OS is doing for you at start up. The first step would be to check if at start up the OS is setting S bit in the entries of the pages it creates. I dont know which instruction to use to read the entries. After finding how to read it, If we are lucky S bit is set in the pages of the region you want to access to and the problem for you is other. Otherway S bit needs to be set. Thats a problem because entries should be modified either reading-modifying-rewriting the entries on TLB (it could not work but looks the easiest) or modifying source code in OS and compiling OS again (I know how to compile angstrom but I dont have a clue what file configures memory and it looks hard).

Anyway all the info seems to be in B3 chapter of the ARMv7-A-R_architecture_reference_manual. This week I will finish the work in baremetal. Next week I will start working in OS doing access through ACP. I will post my findings.

Thanks again. Cheers

Hello @norxander and @roberbot,

this thread looks very interesting to us :smiley: I would like to know if any further progress have been made here. Especially on the Linux front, have you been able to write the correct kernel module/driver to perform Outer-Shareable buffers mappings so that the external cacheable accesses performed through the ACP are coherent for those buffers ?

We are currently working on some modifications of this project https://rocketboards.org/foswiki/Projects/Arria10PCIeRootPortWithMSI
Our modifications consist in connecting the RootPort to the FPGA2HPS port instead of FPGA2SDRAM in the initial project. Our goal is that DMA accesses from PCIe End points are cache coherent with the CPUs. We are working with the Aria10 SoC but I believe it is very similar to your project.
Like norxander our system is based on Linux. At that point we have routed the RootPort accesses via the FPGA2HPS port but we are having the same behavior as via the FPGA2SDRAM. We have to follow your 3 bullets in previous messages but I have to real idea how to get linux set those Shareability attributes in the MMU descriptors.

Any help would be very appreciated here since you seem to work on the same topic.
Thanks you,
Cheers !

Hi,

At the end I could not get my implementation to use the cached way.
I finished my project with that limitation - no cache coherency -
I tried implementing all suggestion from Robertbot with no luck.
Sorry :frowning:

Hi @norxander and @Jean-Paul_PONCELET:

Hello. Yes i was able to do it. In linux they are correctly configured. So the only thing to write into a cacheable buffer from ACP is to activate the ACP, allocate a buffer with kmalloc() in the driver, get the physical address of that buffer with virt_to_phys() and pass the physical address + 0x80000000 to the DMA so the access is done through ACP.

I did all combinations HPS-FPGA, FPGA-HPS, with ACP or SDRAM, baremetal Linux. But the FPGA-HPS ones I am still working on them and they are still not uploaded in the master of my repo https://github.com/robertofem/CycloneVSoC-examples.
But you can find the examples for HPS-FPGA. They must work for you since they have all the buffer allocation process. You can first check this driver (https://github.com/robertofem/CycloneVSoC-examples/tree/soft_dma_baremetal/Linux-modules/DMA_PL330_LKM_basic) that is a basic driver that transfers from HPS to FPGA using the HPS DMA and you can select ACP or SDRAM. After doing this driver I created a more complex driver (https://github.com/robertofem/CycloneVSoC-examples/tree/master/Linux-modules/DMA_PL330_LKM) that is a generic driver to move data from a program and FPGA. Its the same than the basic but with a (character) interface to programs with open() read() write() and close() functions.

They work perfect. I had several problems with alignment. I dont remember well but in example in reads from ACP the buffer you read must be aligned to the transfer size. I also had problems with the DMA program. I had to place it in HPS On-Chip RAM (i could not put it in SDRAM) and also aligned to 32. So my advice is you to try to do a transfer with small data like 4B and aligned. ANd start by writes. They give less problems than reads. And then when it works you can start modifying it until it crashes.

Hope it helps.
Regards