Rendered at 19:10:04 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
nycerrrrrrrrrr 3 hours ago [-]
This might be orthogonal to the TLB miss overhead you found, but have you looked at using P2PDMA to transfer directly from the NVMe SSDs to the NIC? Not sure how the CRC calculation would play into that.
serious_angel 5 hours ago [-]
> Yes, it is true. NOT A SINGLE LINE!!
Considering that you "do not write a single line" and the, likely slop, article missing the actual script used for the benchmarking, it's impossible to know the actual benchmarking done, and the requirements to base on.
> But digging into those principles has always been one of my small obsessions as a programmer.
No, it hasn't been, I believe. It does no shows. What shows is that the article was written by your agent. The only part you done is named the flamegraph files, where is the typo in the title.
Why do you ask someone else to invest their precious finite human life time into this generative output surrounded by lies, fuss, indication of infant attitude towards technologies, possibly highlighting the fact that you still have no idea how wonderful and miraculous the technologies are...
You are not a developer, or if you are, please do consider your future of dependency on LLM-vendors, your lack of experience and in-depth knowledge, anxiety, accountability, and self-confidence.
I hope you'll reconsider your time you waste on sloppery slops instead of actually reading about the technologies and discuss subjects with accountable professionals to learn from and discover together... yet indeed... currently, you chose a lone life of generative output built on robbed articles like yours now defaced in the datasets of trained LLMs sold you by vendors for money... yet the actual genius people who are in the datasets of models are now unknown... The dear sorrow you could not care less...
Regardless, you do you, and I wish you safety, stability, and peace...
MasterScrat 4 hours ago [-]
This sounds like a strong statement with little backing. The author does infra at DeepSeek if his LinkedIn is to be trusted, and is the author of Foyer.
ozgrakkurt 4 hours ago [-]
It is obvious that the blog is good quality if you have moderate knowledge on the subject and read the blog post.
flipped 2 hours ago [-]
[flagged]
jeffbee 5 hours ago [-]
It seems that you could have reached this conclusion faster by elaborating on your use of the profiler. Don't assume that cycles are spent on instructions. Look at your IPC and drill down into what CPU-bound means for your workload. In your case I think a standard top down analysis would have made the virtual memory management cost jump right out.
MrCroxx 4 days ago [-]
Author here. This post is a write-up of a performance-debugging rabbit hole I hit while trying to saturate NICs with NVMe reads using io_uring and RDMA.
The short version: READ_FIXED fixed the obvious per-I/O GUP overhead in a small demo, but the larger deployment still got stuck at roughly half of line rate. After ruling out io-wq backlog, request splitting, fd lookup, and CRC arithmetic, the actual wall turned out to be dTLB misses from scanning 1,028 KiB buffers backed by 4 KiB pages. Moving the read arena to hugepages brought the system close to NIC saturation.
The funny part is that an AI agent suggested hugepages early and got the optimization right, but its explanation was wrong. This post is mostly about reconstructing the evidence for why it worked.
I’d be very interested in feedback from people who have used AI to debug performance issues in a complex system.
ozgrakkurt 4 hours ago [-]
I disagree with the AI part. Because hugepages is one of the things that can be guessed to improve performance when doing something with substantial amount of data.
So anyone familiar with the space could have suggested something like that without knowing the details of the problem. Hence it is not useful advice IMO.
That aside, the blog post was really cool to read and a instant favorite, wish there were more english posts on the blog.
Especially like the hardware limit based expectations, detailed measurements and the writing style.
Considering that you "do not write a single line" and the, likely slop, article missing the actual script used for the benchmarking, it's impossible to know the actual benchmarking done, and the requirements to base on.
> But digging into those principles has always been one of my small obsessions as a programmer.
No, it hasn't been, I believe. It does no shows. What shows is that the article was written by your agent. The only part you done is named the flamegraph files, where is the typo in the title.
Why do you ask someone else to invest their precious finite human life time into this generative output surrounded by lies, fuss, indication of infant attitude towards technologies, possibly highlighting the fact that you still have no idea how wonderful and miraculous the technologies are...
You are not a developer, or if you are, please do consider your future of dependency on LLM-vendors, your lack of experience and in-depth knowledge, anxiety, accountability, and self-confidence.
I hope you'll reconsider your time you waste on sloppery slops instead of actually reading about the technologies and discuss subjects with accountable professionals to learn from and discover together... yet indeed... currently, you chose a lone life of generative output built on robbed articles like yours now defaced in the datasets of trained LLMs sold you by vendors for money... yet the actual genius people who are in the datasets of models are now unknown... The dear sorrow you could not care less...
Regardless, you do you, and I wish you safety, stability, and peace...
The short version: READ_FIXED fixed the obvious per-I/O GUP overhead in a small demo, but the larger deployment still got stuck at roughly half of line rate. After ruling out io-wq backlog, request splitting, fd lookup, and CRC arithmetic, the actual wall turned out to be dTLB misses from scanning 1,028 KiB buffers backed by 4 KiB pages. Moving the read arena to hugepages brought the system close to NIC saturation.
The funny part is that an AI agent suggested hugepages early and got the optimization right, but its explanation was wrong. This post is mostly about reconstructing the evidence for why it worked.
I’d be very interested in feedback from people who have used AI to debug performance issues in a complex system.
So anyone familiar with the space could have suggested something like that without knowing the details of the problem. Hence it is not useful advice IMO.
That aside, the blog post was really cool to read and a instant favorite, wish there were more english posts on the blog.
Especially like the hardware limit based expectations, detailed measurements and the writing style.