Why is mmap memory access slow?

I have a design loaded in FPGA that collects data and writes it to the upper 512M of memory and I have reserved the top 512M of memory on my Altas SoC board using “mem=512M”. In my application, I open a socket and then listen (error code removed for clarity):

sockfd = socket(p->ai_family, p->ai_socktype, p->ai_protocol);
setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(int));
bind(sockfd, p->ai_addr, p->ai_addrlen);
listen(sockfd, BACKLOG);
new_fd = accept(sockfd, (struct sockaddr *)&their_addr, &sin_size);

And gaining access to upper memory:

devmem_fd = open("/dev/mem", O_RDWR | O_SYNC);
upper_map = (uint32_t*)mmap(NULL,0x20000000, PROT_READ | PROT_WRITE, MAP_SHARED, devmem_fd, 0x20000000);
upper_map_ptr = (uint32_t*)(fpga_bridge_map);

When a message is received to send data:

Offset = 0;
SdramReadLen = 0x8000000;
DataStruct[0].bDataReady = 0;
DataStruct[1].bDataReady = 0;
NextBuffer = 0;
while (SdramReadLen != 0) {
  BlockLen = SdramReadLen;
  if (BlockLen > 4000000) BlockLen = 4000000;
  while (1) {
    if (!DataStruct[NextBuffer].bDataReady) {
    memcpy(&DataStruct[NextBuffer].DataBuffer[0], &upper_map_ptr[Offset], BlockLen * 4);
    DataStruct[NextBuffer].DataLength = BlockLen * 4;
    DataStruct[NextBuffer].bDataReady = 1;
    NextBuffer = (NextBuffer + 1) & 1;
    break;
  } else {
    usleep(100);
  }
Offset += BlockLen;
SdramReadLen -= BlockLen;
}

To actually send the data, I have the following thread:

void *thread_function( void *ptr) {
   int         NextBuffer;

  NextBuffer = 0;
  while (1) {
    if (DataStruct[NextBuffer].bDataReady) {
        write(new_fd, &DataStruct[NextBuffer].DataBuffer[0], DataStruct[NextBuffer].DataLength);
        DataStruct[NextBuffer].bDataReady = 0;
        NextBuffer = (NextBuffer + 1) & 1;
    } else {
        usleep(1000);
    }
  }
return NULL;
}

By copying the data to the double buffer and using the thread, the data is transmitted at about 500Mbs (60MB/sec). As a test, if I removed the memcpy and just send whatever data is in DataStruct, the data is transferred at 800Mbs. So, the memcpy is slowing me down.

Then it hit me…Why move data from upper memory to data buffers. Why not just send data directly from upper memory? So I delete all my fancy code and and the thread, and replaced it with:

write(new_fd, upper_map_ptr, 0x20000000);

And guess what my transmission speed was…250Mbs!!!

Even if I broke data into 16MB chucks, still only 250Mbs. Why? That is the question.
memcpy can transfer the data from upper memory to an internal buffer at 950Mbs, so why can’t the write command?

Oh, and one other thing I found… The speed of the memcpy changed based on the lower 4-bits of the addresses being copied: After doing some testing, I found the following combinations to approx 2.5 times slower:

DES=0 SRC=0
DES=0 SRC=8
DES=8 SRC=0
DES=8 SRC=8
DES=4 SRC=4
DES=4 SRC=C
DES=C SRC=4
DES=C SRC=C

Why ideas?