We have a problem with ssd devices connected to the altera pcie root port. The ssd could be detected and even mounted. The read/write performance is not as good as expected of ssd drives, but it is working. After starting multiple read/write jobs, the device becomes unavailable and only reboot or a power cycle could bring it back.
We could reproduce the problem on a Cyclone V development board by using the reference design and kernel from here. We have testet it with a sata/ahci drive (plextor m6e) and a nvme drive (samsung 950 pro).
For the plextor the error message is
ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/00:00:00:87:1a/01:00:06:00:00/40 tag 0 ncq 131072 out
res 40/00:00:e1:01:80/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
resetting does not work
ata1: hard resetting link
ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1: limiting SATA link speed to 3.0 Gbps
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata1.00: qc timeout (cmd 0xec)
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
ata1.00: revalidation failed (errno=-5)
ata1.00: disabled
ata1.00: device reported invalid CHS sector 0
ata1: hard resetting link
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
ata1: EH complete
sd 0:0:0:0: [sda] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
sd 0:0:0:0: [sda] tag#16 CDB: opcode=0x2a 2a 00 06 1a 81 c0 00 05 40 00
The error for the nvme drive is reported as controler fatal status (cfs), but a controller reset does not bring the device back
nvme 0000:01:00.0: Cancelling I/O 718 QID 1
nvme 0000:01:00.0: Cancelling I/O 719 QID 1
------------[ cut here ]------------
WARNING: CPU: 1 PID: 614 at lib/percpu-refcount.c:324 percpu_ref_reinit+0xb8/0xc4()
Modules linked in: nvme gpio_altera altera_sysid altera_rpdma(O)
CPU: 1 PID: 614 Comm: kworker/u4:2 Tainted: G W O 4.1.0-00203-g6de99ee-dirty #1
Hardware name: Altera SOCFPGA
Workqueue: nvme nvme_reset_workfn [nvme]
[<c0018898>] (unwind_backtrace) from [<c00139e8>] (show_stack+0x20/0x24)
[<c00139e8>] (show_stack) from [<c057fcb4>] (dump_stack+0x90/0xa0)
[<c057fcb4>] (dump_stack) from [<c0027260>] (warn_slowpath_common+0x94/0xc4)
[<c0027260>] (warn_slowpath_common) from [<c002734c>] (warn_slowpath_null+0x2c/0x34)
[<c002734c>] (warn_slowpath_null) from [<c02e2998>] (percpu_ref_reinit+0xb8/0xc4)
[<c02e2998>] (percpu_ref_reinit) from [<c02c0154>] (blk_mq_unfreeze_queue+0x64/0xac)
[<c02c0154>] (blk_mq_unfreeze_queue) from [<bf020fa0>] (nvme_dev_resume+0x74/0xf0 [nvme])
[<bf020fa0>] (nvme_dev_resume [nvme]) from [<bf0210bc>] (nvme_reset_failed_dev+0x30/0x130 [nvme])
[<bf0210bc>] (nvme_reset_failed_dev [nvme]) from [<bf01d0cc>] (nvme_reset_workfn+0x1c/0x20 [nvme])
[<bf01d0cc>] (nvme_reset_workfn [nvme]) from [<c003da48>] (process_one_work+0x15c/0x3dc)
[<c003da48>] (process_one_work) from [<c003dd1c>] (worker_thread+0x54/0x4f4)
[<c003dd1c>] (worker_thread) from [<c0043470>] (kthread+0xec/0x104)
[<c0043470>] (kthread) from [<c000faa8>] (ret_from_fork+0x14/0x2c)
---[ end trace 52c30efa12417f12 ]---
nvme 0000:01:00.0: Failed status: 3, reset controller
nvme 0000:01:00.0: Cancelling I/O 225 QID 2
...
Like mentioned, after reboot both ssd drives are functional again. Could these problems come from the pcie driver and what could go wrong with the pcie communication that stops the ssd drives from working?