A little offtop, but do you know a number in usecs that io_uring can save on enterprise grade servers, with 10G NICs, for socket latency overheads vs LD_PRELOAD when hardware supports that? Let's say it's Mellanox 4 or 5. My understanding is that each gives around 10us savings, maybe less. Based on some benchmarking, which was not focused on any of those explicitly but had some imprecise experiments. It also looks like they do not add up. Do you have a number based on real experience?
Nice writeup. I suspect you're measuring the cost of abstraction. Specifically, routines that can handle lots of things (like locale based strings and utf8 character) have more things to do before they can produce results. This was something I ran into head on at Sun when we did the I18N[1] project.
In my experience there was a direct correlation between the number of different environments where a program would "just work" and its speed. The original UNIX ls(1) which had maximum sized filenames, no pesky characters allowed, all representable by 7-bit ASCII characters, and only the 12 bits of meta data that God intended[2] was really quite fast. You add things like a VFS which is mapping the source file system into the parameters of the "expected" file system that adds delay. You're mapping different character sets? adds delay. Colors for the display? Adds delay. Small costs that add up.
1: The first time I saw a long word like 'internationalization' reduced to first and last letter and the count of letters in between :-).
2: Those being Read, Write, and eXecute for user, group, and other, setuid, setgid, and 'sticky' :-)
At those time scales, you would be better off using `tim` ( https://github.com/c-blake/bu/blob/main/doc/tim.md ) than hyperfine { and not just because that is your name! Lol. That is just a happy coincidence by clipping one letter off of the word "time". :-) } even though being in Nim might make it more of a challenge.
This is fantastic stuff. I'm doing a C++ project right now that I'm doing with an eye to eventual migration in whole or in part to Zig. My little `libevring` thing is pretty young and I'd be very open to replacing it with `ourio`.
What's your feeling on having C/C++ bindings in the project as a Zig migration path for such projects?
I'm curious how lsr compares to bfs -ls for example. bfs only uses io_uring when multiple threads are enabled, but maybe it's worth using it even for bfs -j1
Oh that's cool. `find` is another tool I thought could benefit from io_uring like `ls`. I think it's definitely worth enabling io_uring for single threaded applications for the batching benefit. The kernel will still spin up a thread pool to get the work done concurrently, but you don't have to manage that in your codebase.
Curious: Why? Is it not a good fit for what ripgrep does? Isn't the sort of "streaming" "line at a time" I/O that ripgrep does a good fit for async io?
For many workloads, ripgrep spends the vast majority of its time searching through files.
But more practically, it would be a terror to implement. ripgrep is built on top of platform specific standard file system APIs. io_uring would mean a whole heap of code to work with a different syscall pattern in addition to the existing code pattern for non-Linux targets.
So to even figure out whether it would be worth doing that, you would need to do a whole bunch of work just to test it. And because of my first point above, there is a hard limit on how much of an impact it could even theoretically have.
Where I would expect this to help is to batch syscalls during directory tree traversal. But I have nonidea how much it would help, if at all.
How much of the speedup over GNU ls is due to lacking localization features? Your results table is pretty much consistent with my local observations: in a dir with 13k files, `ls -al` needs 33ms. But 25% of that time is spent by libc in `strcoll`. Under `LC_ALL=C` it takes just 27ms, which is getting closer to the time of your program.
I didn't include `busybox` in my initial table, so it isn't on the blog post but the repo has the data...but I am 99% sure busybox does not have locale support, so I think GNU ls without locale support would probably be closer to busybox.
Locales also bring in a lot more complicated sorting - so that could be a factor also.
I wonder how it performs against an NFS server with lots of files, especially one over a kinda-crappy connection. Putting an unreliable network service behind blocking POSIX syscalls is one of the main reasons NFS is a terrible design choice (as can be seen by anyone who's tried to ctrl+c any app that's reading from a broken NFS folder), but I wonder if io_uring mitigates the bad parts somewhat.
The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive), which was (and is) a reasonable tradeoff. It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory. (The original NFS protocol is stateless, so clients can survive server reboots.) What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
I don't know how io_uring solves this - does it return an error if the underlying NFS call times out? How long do you wait for a response before giving up and returning an error?
> The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive),
> The original NFS protocol is stateless,
The protocol is, but the underlying disk isn’t.
- A stateless emulation doesn’t know of the concept of “open file”, so “open for exclusive access” isn’t possible, and ways to emulate that were bolted on.
- In a stateless system, you cannot open a scratch file for writing, delete it, and continue using it, in the expectation that it will be deleted when you’re done using it (Th Unix Hater’s handbook (https://web.mit.edu/~simsong/www/ugh.pdf) says there are hacks inside NFS to make this work, but that makes the protocol stateful)
> It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory
But see above for an example where every tool that wants to do record locking or get exclusive access to a file has to know whether it’s writing to a NFS disk to figure out how to do that.
> The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive), which was (and is) a reasonable tradeoff
I don't agree that it was a reasonable tradeoff. Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea. I don't think this is unique to NFS, it applies to any network filesystem you try to present as if it's a local one.
> What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
That's exactly why I don't think it's a good idea to just pretend a network connection is actually a local disk. Because tools aren't set up to handle issues with it being down.
Contrast it with approaches where the client is aware of the network connection (like HTTP/GRPC/etc)... the client can decide for itself how long it should retry failed requests, whether it should bubble up failures to the caller, or work "offline" until it gets an opportunity to resync, etc. With NFS the syscall just hangs forever by default.
Distributed systems are hard, and NFS (and other similar network filesystems) just pretend it isn't hard at all, which is great until something goes wrong, and then the abstraction leaks.
(Also I didn't say io_uring solves this, but I'm curious as to whether its performance would be any better than blocking calls.)
I think it highly depends on your architecture and the scale you are pushing.
The other far-edge is the S3, where appending has just been possible within the last a few years as far as I can tell. Meanwhile editing a file requiring a full download/upload, not great either.
For the NFS case, I cannot say it's my favorite, but certainly easy to setup and run on your own. Obviously a rebooting server may cause certain issues during the unavailability, but the NFS server should be in highly-available. with NFSv4.1, you may use UDP as the primary transport, which allows you to swap/switch servers pretty quickly. (Given you connect to a DNS/FQDN rather than the IP address)
Another case is the plug and play, with NFS, UNIX permissions, ownership/group details, execute bit, etc are all preserved nicely...
Besides, you could always have a "cache" server locally. Similar to GDrive or OneDrive clients, constantly syncing back and forth, caching the data locally, using file-handles to determine locks. Works pretty well _at scale_ (ie. many concurrent users in the case of GDrive or OneDrive).
> Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea.
It's the only idea though. We don't know how to make reliable systems, other than by cobbling together a lot of unreliable ones and hoping the emergent behaviour is more reliable than that of the parts.
I think a difference in magnitude turns into a difference in kind. There's lots of systems where the unreliability of the underlying parts is low enough that it can be a simple matter of retrying quickly once or twice (bit flips in ECC RAM), and others where at least the unreliability is well-known enough that software has all learned to work around the leaky abstraction (like TCP. Although QUIC and other protocols show that maybe it's better to move the unreliability up a layer for more intelligent handling of the edge cases.)
But the unreliability of "the network" compared to "my SATA port" is a whole different ballgame. Filesystems are designed for the latter, and when software uses filesystems it generally expects a reliability guarantee that "the network" can't really provide. Especially on mobile internet, wifi, etc... And that's not even getting into places where NFS just can't do things that local filesystems can do (has anyone figured out how to make inotify/fsevents work?) and all the software that subtly breaks because of it.
I think "making an unreliable system emulate a reliable one = bad" is too simplistic a heuristic.
We do this all the time with things like ECC and retransmissions and packet recovery. This intrinsically is not bad at all, the question is: what abstraction does this expose to the higher layer.
With TCP the abstraction we expect is "pretty robust but has tail latencies, do not use for automotive networks or avionics" and that works out well. The right question IMO is always "what kind of tail behaviors does this expose, and are the consumers of the abstraction prepared for them".
> as can be seen by anyone who's tried to ctrl+c any app that's reading from a broken NFS folder
Theoretically "intr" mounts allowed signals to interrupt operations waiting on a hung remote server, but Linux removed the option long ago[1] (FreeBSD still supports it)[2]. "soft" might be the only workaround on Linux.
The only VDSO-capable calls are clock_gettime, getcpu, getrandom, gettimeofday, and time. (Other architectures have some more, mostly related to signals and CPU cache flushing.)
I vaguely remember some benchmark I read a while back for some other io_uring project, and it suggested that io_uring syscalls are more expensive than whatever the other syscalls were that it was being used to replace. It's still a big improvement, even if not as big as you'd hope.
I wish I could remember the post, but I've had that impression in the back of my mind ever since.
Really interesting, the difference is real though I would just hope that some better coloring support could be added because I have "eza --icons=always -1" command set as my ls and it looks really good, whereas when I use lsr -1, yes the fundamental thing is same, the difference is in the coloring.
Yes lsr also colors the output but it doesn't know as many things as eza does
For example
.opus will show up as a music icon and with the right color (green-ish in my case?) in eza
whereas it would be shown up as any normal file in lsr.
Really no regrets though, its quite easy to patch I think but yes this is rock solid and really fast I must admit.
Can you please create more such things but for cat and other system utilities too please?
Also love that its using tangled.sh which is using atproto, kinda interesting too.
I also like that its written in zig which imo feels way more easier for me to touch as a novice than rust (sry rustaceans)
This seems more interesting as demonstration of the amortized performance increase you'd expect from using io_uring, or as a tutorial for using it. I don't understand why I'd switch from using something like eza. If I'm listing 10,000 files the difference is between 40ms and 20ms. I absolutely would not notice that for a single invocation of the command.
Yeah, I wrote this as a fun little experiment to learn more io_uring usage. The practical savings of using this are tiny, maybe 5 seconds over your entire life. That wasn't the point haha
It could, but important to keep in mind that the filesystem architecture there is also very different with a parallel filesystem with disaggregated data and metadata.
When you run `ls -l` you could potentially be enumerating a directory with one file per rank, or worse, one file per particle or something. You could try making the read fast, but I also think that it makes no sense to have that many files: you can do things to reduce the number of files on disk. Also many are trying to push for distributed object stores instead of parallel filesystems... fun space.
It's a very cool experiment. Just wanted to perhaps steer the conversation towards those things rather than whether or not this was a good ls replacement because like you say that feels like it was missing the point
Probably historical preference for portability without a bunch of #ifdef means platform+version-specific stuff is very late to get adopted. Though, at this point, the benefit of portability across various posixy platforms is much lower.
Has anyone written an io_uring "polyfill" library with fallback to standard posix-y IO? It could presumably be done via background worker threads - at a perf cost.
Seems like a huge lift since io_uring is an ever growing set of interfaces that is encompassing more and more of the kernel surface area. Also, the problem tends to not necessarily be that the io_uring interface isn’t available at compile time but a) the version you distribute to has a kernel with it disabled or you don’t have permission to use it meaning you need to do LD_preload magic or use a framework b) the kernel you’re using supports some of the interfaces you’re trying to use but not all. Not sure how you solve that one without using a framework.
But I agree. It would be cool if it was transparent, but this is actually what a bunch of io-uring runtimes do, using epoll as a fallback (eg in Rust monoio)
You can just ask io_uring what commands you have available to you.
Though the way of the background thread should be readily available/usable by just indirectly calling the syscall (-helper) and replacing it with a futex-based handshake/wrapper.
If you're not using the backend-ordering-imposing link bit, you could probably even use minor futex trickery to dispatch multiple background threads to snatch up from the submission queue in a "grab one at a time" fashion.
io_uring is the asynchronous interface and that requires to use even-based architecture to use it effectively. But many command-line tools are still written is a straightforward sequential style. If C would have async or similar mechanism to pretend doing async programming sequentially, it would be easier to port. But without that a very significant refactoring is necessary.
Besides, io_uring is not yet stable and who knows may be in 10 years it will be replaced by yet another mechanism to take advantage of even newer hardware. So simply waiting for io_uring prove it is here to stay is very viable strategy. Besides in 10 years we may have tools/AI that will do the rewrite automatically...
> If C would have async or similar mechanism to pretend doing async programming sequentially, it would be easier to port.
The *context() family of formerly-POSIX functions (clownishly deprecated as “use pthreads instead”) is essentially a full implementation of stackful coroutines. Even the arguable design botch of them preserving the signal mask (the reason why they aren’t the go-to option even on Linux) is theoretically fixable on the libc level without system calls, it’s just a lot of work and very few can be bothered to do signals well.
As far as stackless coroutines, there’s a wide variety of libraries used in embedded systems and such (see the recent discussion[1] for some links), which are by necessity awkward enough that I don’t see any of them becoming broadly accepted. There were also a number of language extensions, among which I’d single out AC[2] (from the Barrelfish project) and CPC[3]. I’d love for, say, CPC to catch on, but it’s been over a decade now.
iirc io_uring also had some pretty significant security issues early on (a couple of years ago). Those should be fixed by now, but that probably dampened adoption as well.
Last I checked it's blocked by most container runtimes exactly because of the security problems, and Google blocked io_uring across all their services. I've not checked recently if that's still the case, but https://security.googleblog.com/2023/06/learnings-from-kctf-... has some background.
Not years ago. io_uring has been a continuous parade of security problems, including a high severity one that wasn't fixed until a few months ago. Many large organizations have patched it out of their kernels on safety basis, which is one of the reasons it suffers from poor adoption.
> I'm trying to understand why all command line tools don't use io_uring.
Because it's fairly new. The coreutils package which contains the ls command (and the three earlier packages which were merged to create it) is decades old; io_uring appeared much later. It will take time for the "shared ring buffer" style of system call to win over traditional synchronous system calls.
Most of the security concerns with io_uring that I've seen aren't related to the shared buffers at all but simply stem from the fact that io_uring is a mechanism to instruct the kernel to do stuff without making system calls, so security measures that focus on what system calls a process is allowed to do are ineffective.
This isn't the issue; it's relatively easy to safely share some ring buffers. The issue was/is that io_uring is rapidly growing the equivalent of ~all historical Linux syscall interfaces and sometimes comparable security measures were missed on the new interfaces. (Also, stuff like seccomp filters on syscalls are kind of meaningless for io_uring.)
> I have no idea what lsd is doing. I haven’t read the source code, but from viewing it’s strace, it is calling clock_gettime around 5 times per file. Why? I don’t know. Maybe it’s doing internal timing of steps along the way?
Maybe calculating "X minutes/hours/days/weeks ago" thing for each timestamp? (access, create, modify, ...). Could just be an old artifact of another library function...
io_uring doesn't support getdents though. so the primary benefit is bulk statting (ls -l).
It'd be nice if we could have a getdents in flight while processing the results of the previous one.
The times seem sublinear, 10k files is less than 10x 1k files.
I remember getting in to a situation during the ext2 and spinning rust days where production directories had 500k files. ls processes were slow enough to overload everything. ls -F saved me there.
And filesystems got a lot better at lots of files. What filesystem was used here?
It's interesting how well busybox fares, it's written for size not speed iirc?
> The times seem sublinear, 10k files is less than 10x 1k files
Two points are not enough to say it's sublinear. It might very well be some constant factor that becomes less and less important the bigger the linear factor becomes.
The article has data points for n=10,100,1000,10000. Taking
(n=10,000 - n=10)/(n=1,000 - n=10) would eliminate the constant factor and we'd expect about 10.09x higher times for a linear algorithm.
But for lsr, it's 9.34. The other tools have factors close to 10.09 or higher. Since ls has to sort it's output (unless -F is specified) I'd not be too surprised with a little superlinearity.
Ext2 never got better with large directories even with SSDs (this includes up to ext4). The benchmarks don’t include the filesystem type, which is actually extremely important when it comes to the performance of reading directories.
Lovely, I might try doing this for some other "classic" utility!
A bit off-topic too, but I'm new to Zig and curious. This here:
```
const allocator = sfb.get();
var cmd: Command = .{ .arena = allocator };
```
means that all allocations need to be written with an allocator in mind? I.e. one has to pick an allocator per each memory allocation? Or is there a default one?
Allocator is an interface so you write library code only once, and then the caller decides which concrete implementation to use.
There's cases where you do want to change your code based on the expectation that you will be provided a special kind of allocator (e.g. arenas), but that's a more niche thing and in any case it all comes together pretty well in practice.
Caveat emptor, I don't write Zig but followed its development closely for awhile. A core design element of zig is that you shouldn't be stuck with one particular memory model. Zig encourages passing an allocator context around, where those allocators conform to a standardized interface. That means you could pass in different allocators with different performance characteristics at runtime.
But yes, there is a default allocator, std.heap.page_allocator
It's a shame to see uutils doing so poorly here. I feel like they're our best hope for an organization to drive this sort of core modernization forward, but 2x slower than GNU isn't a good start.
I am curious what would happen if ls and other commands were replaced using io_uring and kernel.io_uring_disabled was set to 1. Would it fall back to an older behavior or would the ability to disable it be removed?
I just realized that one could probably write a userspace io_uring emulator in a library that spawns a thread to read the ringbuffer and a worker pool of threads to do the blocking operations. You'd need to get the main software to make calls to your library instead of the io_uring syscalls, that's it; the app logic could remain the same.
Then all the software wanting to use io_uring wouldn't need to write their low-level things twice.
You would have to write your IO to have a fallback. The Ghostty project uses `io_uring`, but on kernels where it isn't available it falls back to an `epoll` model. That's all handled at the library level by libxev.
There used to be lsring by Jens Axboe (author of io_uring), but it no longer exists. This is more extreme than abandoning the project. Perhaps there is some issue with using io_uring this way, perhaps vulnerabilities are exposed.
I'm sure there are uses in Bash scripts that could benefit from it but most people would use it directly in a compiled program, I suppose, if the performance was a reoccurring need.
Explicit Vulnerabilities (Documented CVEs and Exploits)
These are actual discovered vulnerabilities, typically assigned CVEs and often exploited in sandbox escapes or privilege escalations:
1. CVE-2021-3491 (Kernel 5.11+)
Type: Privilege escalation
Mechanism: Failure to check CAP_SYS_ADMIN before registering io_uring restrictions allowed unprivileged users to bypass sandboxing.
Impact: Bypass of security policy mechanisms.
2. CVE-2022-29582
Type: UAF (Use-After-Free)
Mechanism: io_uring allowed certain memory structures to be freed and reused improperly.
Impact: Local privilege escalation.
3. CVE-2023-2598
Type: Race condition
Mechanism: A race in the io_uring timeout code could lead to memory corruption.
Impact: Arbitrary code execution or kernel crash.
4. CVE-2022-2602, CVE-2022-1116, etc.
Type: UAF and out-of-bounds access
Impact: Escalation from containers or sandboxed processes.
5. Exploit Tooling:
Tools like io_uring_shock and custom kernel exploits often target io_uring in container escape scenarios (esp. with Docker or LXC).
Implicit Vulnerabilities (Architectural and Latent Risks)
These are not necessarily exploitable today, but reflect deeper systemic design risks or assumptions.
1. Shared Memory Abuse
io_uring uses shared rings (memory-mapped via mmap) between kernel and user space.
Risk: If ring buffer memory management has reference count bugs, attackers could force races, data corruption, or misuse stale pointers.
2. User-Controlled Kernel Pointers
Some features allow user-specified buffers, SQEs, and CQEs to reference arbitrary memory (e.g. via IORING_OP_PROVIDE_BUFFERS, IORING_OP_MSG_RING).
Risk: Incomplete validation could allow crafting fake kernel structures or triggering speculative attacks.
3. Speculative Execution & Side Channels
Since io_uring relies on pre-submitted work queues and long-lived kernel threads, it opens timing side channels.
Risk: Predictable scheduling or timing leaks, esp. combined with hardware speculation (Spectre-class).
4. Bypassing seccomp or AppArmor Filters
io_uring operations can effectively batch or obscure syscall behavior.
Example: A program restricted from calling sendmsg() directly might still use io_uring to perform similar actions.
Risk: Policy enforcement tools become less effective, requiring explicit io_uring filtering.
5. Poor Auditability
The batched and asynchronous nature makes logging or syscall audit trails incomplete or confusing.
Risk: Harder for defenders or monitoring tools to track intent or detect misuse in real time.
6. Ring Reuse + Threaded Offload
With IORING_SETUP_SQPOLL or IORING_SETUP_IOPOLL, I/O workers can run in kernel threads detached from user context.
Risk: Desynchronized security context can lead to privileged operations escaping sandbox context (e.g., post-chroot but pre-fork).
7. File Descriptor Reuse and Lifecycle Mismatch
Some operations in io_uring rely on fixed file descriptors or registered files. Race conditions with FD reuse or closing can cause inconsistencies.
Risk: UAF, type confusion, or logic bombs triggered by kernel state confusion.
Emerging Threat Vectors
eBPF + io_uring
Some exploits chain io_uring with eBPF to do arbitrary memory reads or writes. e.g., io_uring to perform controlled allocations, then eBPF to read or write memory.
io_uring + userfaultfd
Combining userfaultfd with io_uring allows very fine-grained control over page faults during I/O — great for fuzzing, also for exploit primitives.
> 35x less system calls = others wait less for the kernel to handle their system calls
That isn't how it works. There isn't a fixed syscall budget distributed among running programs. Internally, the kernel is taking many of the same locks and resources to satisfy io_uring requests as ordinary syscall requests.
More system calls mean more overall OS overhead eg. more context switches, or as you say more contention on internal locks etc.
Also, more fs-related system calls mean less available kernel threads to process these system calls. eg. XFS can paralellize mutations only up to its number of allocation groups (agcount)
> More system calls mean more overall OS overhead [than the equivalent operations performed with io_uring]
Again, this just isn't true. The same "stat" operations are being performed one way or another.
> Also, more fs-related system calls mean less available kernel threads to process these system calls.
Generally speaking sync system calls are processed in the context of the calling (user) thread. They don't consume kernel threads generally. In fact the opposite is true here -- io_uring requests are serviced by an internal kernel thread pool, so to the extent this matters, io_uring requests consume more kernel threads.
syscalls are expensive and their relative latency compared with the rest of code only grow especially in view of mitigations against cache-related and other other hardware bugs.
Why isn’t it possible — or is it — to make libc just use uring instead of syscall?
Yes I know uring is an async interface, but it’s trivial to implement sync behavior on top of a single chain of async send-wait pairs, like doing a simple single threaded “conversational” implementation of a network protocol.
It wouldn’t make a difference in most individual cases but overall I wonder how big a global speed boost you’d get by removing a ton of syscalls?
Or am I failing to understand something about the performance nuances here?
You don't need to start spawning new threads to use io_uring as a backend for synchronous IO APIs. You just need to set up the rings once, then when the program does an fwrite or whatever, that gets implemented as sending a submission queue entry followed by a single io_uring_enter syscall that informs the kernel there's something in the submission queue, and using the arguments indicating that the calling process wants to block until there's something in the completion queue.
> using the arguments indicating the calling process wants to block
Nice to know io_uring has facilities for backwards compatibility with blocking code here. But yeah, that's still a syscall, and given that the whole benefit of io_uring is in avoiding (or at least, coalescing) syscalls, I doubt having libc "just" use io_uring is going to give any tangible benefit.
Not speaking of ls which is more about metadata operations, but general file read/write workloads:
io_uring requires API changes because you don't call it like the old read(please_fill_this_buffer). You maintain a pool of buffer that belong to the ringbuffer, and reads take buffers from the pool. You consume the data from the buffer and return it to the pool.
With the older style, you're required to maintain O(pending_reads) buffers. With the io_uring style, you have a pool of O(num_reads_completing_at_once) (I assume with backpressure but haven't actually checked).
In a single threaded flow your buffer pool is just the buffer you were given, and you don't return until the call completes. There are no actual concurrent calls in the ring. All you're doing is using io_uring to avoid syscall.
Other replies lead me to believe it's not worth doing though, that it would not actually save syscalls and might make things worse.
Can you use io_uring in a way that doesn't gain the benefits of using it? Yes. Does the traditional C/POSIX API force you into that pattern? Almost certainly.
In addition to sibling's concern about syscall amplification, the async just isn't useful to the application (from a latency perspective) if you just serialize a bunch of sync requests through it.
Is there any actual focus on ATProto as a decentralized protocol? So far it seems like its only purpose is building Bluesky as a centralized service, which I have no interest in at all.
Why does this require inventing lsr as an alternative to ls instead of making ls use io_uring? It seems pretty annoying to have to install replacements for the most basic command line tools. And especially in this case, where you do not even do it for additional features, just for getting the exact same thing done a bit faster.
Depending on the implementation (and I don't know which `ls` is being referred to), modifying `ls` might mean modifying an FSF project which require copyright assignment as a condition of patch submissions.
That's only the case if the author would want to upstream their changes. If they wanted to only fork ls then they would only be required to comply with the license, without assigning copyright over.
> Why does this require inventing lsr as an alternative to ls instead of making ls use io_uring?
Good luck getting that upstreamed and accepted. The more foundational the tools (and GNU coreutils definitely is foundational), the more difficult that process will be.
Releasing a standalone utility makes iteration much faster, partially because one is not bound to the release cycles of distributions.
In the history of Unix its also a common way to propose tool replacements, for instance how `less` became `more` on most systems, or `vim` became the new `vi` which in its day became the new `ed`.
Yes and no. We don't really have the equivalent of comp.sources.unix nowadays, which is where the early versions of those occurred, and comp.sources.unix did not take just anything. Rich Salz had rules.
Plus, since I actually took stevie and screen and others from comp.sources.unix and worked on them, and wasn't able to even send my improvements to M. Salz or the original authors at all, from my country, I can attest that contributing improvements had hurdles just as large to overcome back then as there exist now. They're just different.
> instance how `less` became `more` on most systems
How `more` became `less`.
The name of 'more' was from paging - rather than having text scroll off the screen, it would show you one page, then ask if you wanted to see 'more' and scroll down.
'less' is a joke by the less authors. 'less is more' etc.
One can even get pg still, with Ilumos-based systems; even though that was actually taken out of the SUS years ago. This goes to show that what's standard is not the same as what exists, of course.
> Releasing a standalone utility makes iteration much faster, partially because one is not bound to the release cycles of distributions.
which certainly is a valid way or prioritizing. similarly, distros/users may prioritize stability, which means the theoretical improvement would now be stuck in not-used-land.
the value of software appears when it's run, not when it's written
> the value of software appears when it's run, not when it's written
Have you ever tried to contribute to open source projects?
The question was why wouldn't someone writing software not take the route likely to end in rejection/failure. I don't know about you, but if I write software, I am not going to write it for a project whose managers will make it difficult for my PR to be accepted, and that 99% likely it never will be.
I will always contribute to the project likely to appreciate my work and incorporate it.
I'll share an anecdote: I got involved with a project, filed a couple PRs that were accepted (slowly), and then I talked about refactoring something so it could be tested better and wasn't so fragile and tightly coupled to IO. "Sounds great" was the response.
So I did the refactor. Filed a PR and asked for code review. The response was (after a long time waiting) "thanks but no, we don't want this." PR closed. No feedback, nothing.
I don't even use the software anymore. I certainly haven't tried to fix any bugs. I don't like being jerked around by management, especially when I'm doing it for free.
(For the record, I privately forked the code and run my own version that is better because by refactoring and then writing tests, I discovered a number of bugs I couldn't be arsed to file with the original project.)
> Have you ever tried to contribute to open source projects?
yes, and it was often painful enough to make me consider very well wether I want to bother contributing. I can only imagine how terrible the experience must be at a core utility such as ls.
> The question was why wouldn't someone writing software not take the route likely to end in rejection/failure
Obviously they wouldn't - in my comment I assumed that the lsr author aimed for providing a better ls for people and tried to offer a perspective with a different definition of what success is.
> I don't like being jerked around by management, especially when I'm doing it for free
I get that. The older OSS projects become, the more they fossilize too - and that makes it more annoying to contribute. But you can try to see it from the maintainers perspective too: They have actual people relying on the program being stable and are often also not paid. Noone is forcing you to contribute to their project, but if you don't want to deal with existing maintainers, you won't have their users enjoying your patchset. Know what you want to achieve and act accordingly, is all I'm trying to say.
A little offtop, but do you know a number in usecs that io_uring can save on enterprise grade servers, with 10G NICs, for socket latency overheads vs LD_PRELOAD when hardware supports that? Let's say it's Mellanox 4 or 5. My understanding is that each gives around 10us savings, maybe less. Based on some benchmarking, which was not focused on any of those explicitly but had some imprecise experiments. It also looks like they do not add up. Do you have a number based on real experience?
Author of the project here! I have a little write up on this here: https://rockorager.dev/log/lsr-ls-but-with-io-uring
Nice writeup. I suspect you're measuring the cost of abstraction. Specifically, routines that can handle lots of things (like locale based strings and utf8 character) have more things to do before they can produce results. This was something I ran into head on at Sun when we did the I18N[1] project.
In my experience there was a direct correlation between the number of different environments where a program would "just work" and its speed. The original UNIX ls(1) which had maximum sized filenames, no pesky characters allowed, all representable by 7-bit ASCII characters, and only the 12 bits of meta data that God intended[2] was really quite fast. You add things like a VFS which is mapping the source file system into the parameters of the "expected" file system that adds delay. You're mapping different character sets? adds delay. Colors for the display? Adds delay. Small costs that add up.
1: The first time I saw a long word like 'internationalization' reduced to first and last letter and the count of letters in between :-).
2: Those being Read, Write, and eXecute for user, group, and other, setuid, setgid, and 'sticky' :-)
At those time scales, you would be better off using `tim` ( https://github.com/c-blake/bu/blob/main/doc/tim.md ) than hyperfine { and not just because that is your name! Lol. That is just a happy coincidence by clipping one letter off of the word "time". :-) } even though being in Nim might make it more of a challenge.
This is fantastic stuff. I'm doing a C++ project right now that I'm doing with an eye to eventual migration in whole or in part to Zig. My little `libevring` thing is pretty young and I'd be very open to replacing it with `ourio`.
What's your feeling on having C/C++ bindings in the project as a Zig migration path for such projects?
I think exposing a C lib would be very nice. Feel free to open a discussion or issue on the Github.
(Thanks - we'll make that the main link (since it has more background info) and include the repo thread at the top as well.)
My bfs project also uses io_uring: https://github.com/tavianator/bfs/blob/main/src/ioq.c
I'm curious how lsr compares to bfs -ls for example. bfs only uses io_uring when multiple threads are enabled, but maybe it's worth using it even for bfs -j1
Oh that's cool. `find` is another tool I thought could benefit from io_uring like `ls`. I think it's definitely worth enabling io_uring for single threaded applications for the batching benefit. The kernel will still spin up a thread pool to get the work done concurrently, but you don't have to manage that in your codebase.
I did try it a while ago and it wasn't profitable, but that was before I added stat() support. Batching those is probably good
and grep / ripgrep. Or did ripgrep migrate to using io_uring already?
No, ripgrep doesn't use io_uring. Idk if it ever will.
Curious: Why? Is it not a good fit for what ripgrep does? Isn't the sort of "streaming" "line at a time" I/O that ripgrep does a good fit for async io?
For many workloads, ripgrep spends the vast majority of its time searching through files.
But more practically, it would be a terror to implement. ripgrep is built on top of platform specific standard file system APIs. io_uring would mean a whole heap of code to work with a different syscall pattern in addition to the existing code pattern for non-Linux targets.
So to even figure out whether it would be worth doing that, you would need to do a whole bunch of work just to test it. And because of my first point above, there is a hard limit on how much of an impact it could even theoretically have.
Where I would expect this to help is to batch syscalls during directory tree traversal. But I have nonidea how much it would help, if at all.
How much of the speedup over GNU ls is due to lacking localization features? Your results table is pretty much consistent with my local observations: in a dir with 13k files, `ls -al` needs 33ms. But 25% of that time is spent by libc in `strcoll`. Under `LC_ALL=C` it takes just 27ms, which is getting closer to the time of your program.
I didn't include `busybox` in my initial table, so it isn't on the blog post but the repo has the data...but I am 99% sure busybox does not have locale support, so I think GNU ls without locale support would probably be closer to busybox.
Locales also bring in a lot more complicated sorting - so that could be a factor also.
I wonder how it performs against an NFS server with lots of files, especially one over a kinda-crappy connection. Putting an unreliable network service behind blocking POSIX syscalls is one of the main reasons NFS is a terrible design choice (as can be seen by anyone who's tried to ctrl+c any app that's reading from a broken NFS folder), but I wonder if io_uring mitigates the bad parts somewhat.
The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive), which was (and is) a reasonable tradeoff. It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory. (The original NFS protocol is stateless, so clients can survive server reboots.) What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
I don't know how io_uring solves this - does it return an error if the underlying NFS call times out? How long do you wait for a response before giving up and returning an error?
> The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive),
> The original NFS protocol is stateless,
The protocol is, but the underlying disk isn’t.
- A stateless emulation doesn’t know of the concept of “open file”, so “open for exclusive access” isn’t possible, and ways to emulate that were bolted on.
- In a stateless system, you cannot open a scratch file for writing, delete it, and continue using it, in the expectation that it will be deleted when you’re done using it (Th Unix Hater’s handbook (https://web.mit.edu/~simsong/www/ugh.pdf) says there are hacks inside NFS to make this work, but that makes the protocol stateful)
> It didn't require every existing tool, such as ls, to deal with things like the server rebooting while listing a directory
But see above for an example where every tool that wants to do record locking or get exclusive access to a file has to know whether it’s writing to a NFS disk to figure out how to do that.
> The designers of NFS chose to make a distributed system emulate a highly consistent and available system (a hard drive), which was (and is) a reasonable tradeoff
I don't agree that it was a reasonable tradeoff. Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea. I don't think this is unique to NFS, it applies to any network filesystem you try to present as if it's a local one.
> What does vi do when the server hosting the file you're editing stop responding? None of these tools have that kind of error handling.
That's exactly why I don't think it's a good idea to just pretend a network connection is actually a local disk. Because tools aren't set up to handle issues with it being down.
Contrast it with approaches where the client is aware of the network connection (like HTTP/GRPC/etc)... the client can decide for itself how long it should retry failed requests, whether it should bubble up failures to the caller, or work "offline" until it gets an opportunity to resync, etc. With NFS the syscall just hangs forever by default.
Distributed systems are hard, and NFS (and other similar network filesystems) just pretend it isn't hard at all, which is great until something goes wrong, and then the abstraction leaks.
(Also I didn't say io_uring solves this, but I'm curious as to whether its performance would be any better than blocking calls.)
I think it highly depends on your architecture and the scale you are pushing.
The other far-edge is the S3, where appending has just been possible within the last a few years as far as I can tell. Meanwhile editing a file requiring a full download/upload, not great either.
For the NFS case, I cannot say it's my favorite, but certainly easy to setup and run on your own. Obviously a rebooting server may cause certain issues during the unavailability, but the NFS server should be in highly-available. with NFSv4.1, you may use UDP as the primary transport, which allows you to swap/switch servers pretty quickly. (Given you connect to a DNS/FQDN rather than the IP address)
Another case is the plug and play, with NFS, UNIX permissions, ownership/group details, execute bit, etc are all preserved nicely...
Besides, you could always have a "cache" server locally. Similar to GDrive or OneDrive clients, constantly syncing back and forth, caching the data locally, using file-handles to determine locks. Works pretty well _at scale_ (ie. many concurrent users in the case of GDrive or OneDrive).
> Making an unreliable system emulate a reliable one is the very thing I find to be a bad idea.
It's the only idea though. We don't know how to make reliable systems, other than by cobbling together a lot of unreliable ones and hoping the emergent behaviour is more reliable than that of the parts.
I think a difference in magnitude turns into a difference in kind. There's lots of systems where the unreliability of the underlying parts is low enough that it can be a simple matter of retrying quickly once or twice (bit flips in ECC RAM), and others where at least the unreliability is well-known enough that software has all learned to work around the leaky abstraction (like TCP. Although QUIC and other protocols show that maybe it's better to move the unreliability up a layer for more intelligent handling of the edge cases.)
But the unreliability of "the network" compared to "my SATA port" is a whole different ballgame. Filesystems are designed for the latter, and when software uses filesystems it generally expects a reliability guarantee that "the network" can't really provide. Especially on mobile internet, wifi, etc... And that's not even getting into places where NFS just can't do things that local filesystems can do (has anyone figured out how to make inotify/fsevents work?) and all the software that subtly breaks because of it.
I think "making an unreliable system emulate a reliable one = bad" is too simplistic a heuristic.
We do this all the time with things like ECC and retransmissions and packet recovery. This intrinsically is not bad at all, the question is: what abstraction does this expose to the higher layer.
With TCP the abstraction we expect is "pretty robust but has tail latencies, do not use for automotive networks or avionics" and that works out well. The right question IMO is always "what kind of tail behaviors does this expose, and are the consumers of the abstraction prepared for them".
Do you have similar thoughts about iscsi?
> as can be seen by anyone who's tried to ctrl+c any app that's reading from a broken NFS folder
Theoretically "intr" mounts allowed signals to interrupt operations waiting on a hung remote server, but Linux removed the option long ago[1] (FreeBSD still supports it)[2]. "soft" might be the only workaround on Linux.
[1]: https://man7.org/linux/man-pages/man5/nfs.5.html
[2]: https://man.freebsd.org/cgi/man.cgi?query=mount_nfs&sektion=...
Samba too
Kind of fascinating that slashing syscalls by ~35x (versus the `ls -la` benchmark) is "only" worth a 2x speedup
These syscalls are mostly through VDSO, so not very costly
The only VDSO-capable calls are clock_gettime, getcpu, getrandom, gettimeofday, and time. (Other architectures have some more, mostly related to signals and CPU cache flushing.)
I vaguely remember some benchmark I read a while back for some other io_uring project, and it suggested that io_uring syscalls are more expensive than whatever the other syscalls were that it was being used to replace. It's still a big improvement, even if not as big as you'd hope.
I wish I could remember the post, but I've had that impression in the back of my mind ever since.
Really interesting, the difference is real though I would just hope that some better coloring support could be added because I have "eza --icons=always -1" command set as my ls and it looks really good, whereas when I use lsr -1, yes the fundamental thing is same, the difference is in the coloring.
Yes lsr also colors the output but it doesn't know as many things as eza does
For example .opus will show up as a music icon and with the right color (green-ish in my case?) in eza whereas it would be shown up as any normal file in lsr.
Really no regrets though, its quite easy to patch I think but yes this is rock solid and really fast I must admit.
Can you please create more such things but for cat and other system utilities too please?
Also love that its using tangled.sh which is using atproto, kinda interesting too.
I also like that its written in zig which imo feels way more easier for me to touch as a novice than rust (sry rustaceans)
“bat” is a pretty good modern “cat”
https://github.com/sharkdp/bat
So I just ran strace -c cat <file> and strace -c bat <file>
Bat did 445 syscall Cat did 48 syscall
Sure bat does beautify some things a lot but still I just wanted to tell this, I want something that can use io_uring for cat too I think,
like what's the least number of syscalls that you can use for something like cat?
As for coloring support, I think the best way would be to implement LS_COLORS / dircolors. My GNU ls looks nice.
This seems more interesting as demonstration of the amortized performance increase you'd expect from using io_uring, or as a tutorial for using it. I don't understand why I'd switch from using something like eza. If I'm listing 10,000 files the difference is between 40ms and 20ms. I absolutely would not notice that for a single invocation of the command.
Yeah, I wrote this as a fun little experiment to learn more io_uring usage. The practical savings of using this are tiny, maybe 5 seconds over your entire life. That wasn't the point haha
I'd be curious to know if this helps on supercomputers, which are notorious for frequently hanging for a few seconds on an ls -l.
It could, but important to keep in mind that the filesystem architecture there is also very different with a parallel filesystem with disaggregated data and metadata.
When you run `ls -l` you could potentially be enumerating a directory with one file per rank, or worse, one file per particle or something. You could try making the read fast, but I also think that it makes no sense to have that many files: you can do things to reduce the number of files on disk. Also many are trying to push for distributed object stores instead of parallel filesystems... fun space.
It's a very cool experiment. Just wanted to perhaps steer the conversation towards those things rather than whether or not this was a good ls replacement because like you say that feels like it was missing the point
Well I have a directory with a couple million JSON files and ls/du take minutes.
Most of the coreutils are not fast enough to actually utilize modern SSDs.
What’s the filesystem type? Ext4 suffers terrible lookup performance with large directories, while xfs absolutely flies.
Yup, default ext4 and most files are <4KB, so it's extra bad.
Thanks for the comment, didn't know that!
Love it.
I'm trying to understand why all command line tools don't use io_uring.
As an example, all my nvme's on usb 3.2 gen 2 only reach 740MB/s peak.
If I use tools with aio or io_uring I get 1005MB/s.
I know I may not be copying many files simultaneously every time, but the queue length strategies and the fewer locks also help I guess.
Probably historical preference for portability without a bunch of #ifdef means platform+version-specific stuff is very late to get adopted. Though, at this point, the benefit of portability across various posixy platforms is much lower.
Has anyone written an io_uring "polyfill" library with fallback to standard posix-y IO? It could presumably be done via background worker threads - at a perf cost.
Seems like a huge lift since io_uring is an ever growing set of interfaces that is encompassing more and more of the kernel surface area. Also, the problem tends to not necessarily be that the io_uring interface isn’t available at compile time but a) the version you distribute to has a kernel with it disabled or you don’t have permission to use it meaning you need to do LD_preload magic or use a framework b) the kernel you’re using supports some of the interfaces you’re trying to use but not all. Not sure how you solve that one without using a framework.
But I agree. It would be cool if it was transparent, but this is actually what a bunch of io-uring runtimes do, using epoll as a fallback (eg in Rust monoio)
You can just ask io_uring what commands you have available to you. Though the way of the background thread should be readily available/usable by just indirectly calling the syscall (-helper) and replacing it with a futex-based handshake/wrapper. If you're not using the backend-ordering-imposing link bit, you could probably even use minor futex trickery to dispatch multiple background threads to snatch up from the submission queue in a "grab one at a time" fashion.
io_uring is the asynchronous interface and that requires to use even-based architecture to use it effectively. But many command-line tools are still written is a straightforward sequential style. If C would have async or similar mechanism to pretend doing async programming sequentially, it would be easier to port. But without that a very significant refactoring is necessary.
Besides, io_uring is not yet stable and who knows may be in 10 years it will be replaced by yet another mechanism to take advantage of even newer hardware. So simply waiting for io_uring prove it is here to stay is very viable strategy. Besides in 10 years we may have tools/AI that will do the rewrite automatically...
> If C would have async or similar mechanism to pretend doing async programming sequentially, it would be easier to port.
The *context() family of formerly-POSIX functions (clownishly deprecated as “use pthreads instead”) is essentially a full implementation of stackful coroutines. Even the arguable design botch of them preserving the signal mask (the reason why they aren’t the go-to option even on Linux) is theoretically fixable on the libc level without system calls, it’s just a lot of work and very few can be bothered to do signals well.
As far as stackless coroutines, there’s a wide variety of libraries used in embedded systems and such (see the recent discussion[1] for some links), which are by necessity awkward enough that I don’t see any of them becoming broadly accepted. There were also a number of language extensions, among which I’d single out AC[2] (from the Barrelfish project) and CPC[3]. I’d love for, say, CPC to catch on, but it’s been over a decade now.
[1] https://news.ycombinator.com/item?id=44546640
[2] https://users.soe.ucsc.edu/~abadi/Papers/acasync.pdf
[3] https://www.irif.fr/~jch/research/cpc-2012.pdf
iirc io_uring also had some pretty significant security issues early on (a couple of years ago). Those should be fixed by now, but that probably dampened adoption as well.
Last I checked it's blocked by most container runtimes exactly because of the security problems, and Google blocked io_uring across all their services. I've not checked recently if that's still the case, but https://security.googleblog.com/2023/06/learnings-from-kctf-... has some background.
Not years ago. io_uring has been a continuous parade of security problems, including a high severity one that wasn't fixed until a few months ago. Many large organizations have patched it out of their kernels on safety basis, which is one of the reasons it suffers from poor adoption.
> I'm trying to understand why all command line tools don't use io_uring.
Because it's fairly new. The coreutils package which contains the ls command (and the three earlier packages which were merged to create it) is decades old; io_uring appeared much later. It will take time for the "shared ring buffer" style of system call to win over traditional synchronous system calls.
io_uring is a security nightmare.
I updated the Wikipedia article on io_uring to dispute that.
How so?
This is a good read on the topic: https://chomp.ie/Blog+Posts/Put+an+io_uring+on+it+-+Exploiti...
https://security.googleblog.com/2023/06/learnings-from-kctf-... - Has some interesting information on that topic.
you give process direct access to a piece of kernel memory. its a reason why there is separation. thats all.
Most of the security concerns with io_uring that I've seen aren't related to the shared buffers at all but simply stem from the fact that io_uring is a mechanism to instruct the kernel to do stuff without making system calls, so security measures that focus on what system calls a process is allowed to do are ineffective.
This isn't the issue; it's relatively easy to safely share some ring buffers. The issue was/is that io_uring is rapidly growing the equivalent of ~all historical Linux syscall interfaces and sometimes comparable security measures were missed on the new interfaces. (Also, stuff like seccomp filters on syscalls are kind of meaningless for io_uring.)
...don't you supply the memory in the submission queue? or do you mean the queues themselves?
Iouring is very recent
One reason is so that they work in all linux environments rather than just bleeding edge installs from the last couple years.
Thats a great speed boost. What tools are these?
Poe's law hits again.
> I have no idea what lsd is doing. I haven’t read the source code, but from viewing it’s strace, it is calling clock_gettime around 5 times per file. Why? I don’t know. Maybe it’s doing internal timing of steps along the way?
Maybe calculating "X minutes/hours/days/weeks ago" thing for each timestamp? (access, create, modify, ...). Could just be an old artifact of another library function...
This shouldn't be an actual syscall these days; it should be handled by vDSO (`man 7 vDSO`). Maybe zig doesn't use that, though.
I find it funny that there are icons for .mjs and .cjs file extensions but not .c, .h, .sh
io_uring doesn't support getdents though. so the primary benefit is bulk statting (ls -l). It'd be nice if we could have a getdents in flight while processing the results of the previous one.
POSIX adopting NFS' "readdirplus" operation (getdents + stat) could negate some of the benefit towards io_uring, too.
but then someone wants statx...
The times seem sublinear, 10k files is less than 10x 1k files.
I remember getting in to a situation during the ext2 and spinning rust days where production directories had 500k files. ls processes were slow enough to overload everything. ls -F saved me there.
And filesystems got a lot better at lots of files. What filesystem was used here?
It's interesting how well busybox fares, it's written for size not speed iirc?
> The times seem sublinear, 10k files is less than 10x 1k files
Two points are not enough to say it's sublinear. It might very well be some constant factor that becomes less and less important the bigger the linear factor becomes.
Or in other words 10000n+C < 10000(n+C)
The article has data points for n=10,100,1000,10000. Taking (n=10,000 - n=10)/(n=1,000 - n=10) would eliminate the constant factor and we'd expect about 10.09x higher times for a linear algorithm.
But for lsr, it's 9.34. The other tools have factors close to 10.09 or higher. Since ls has to sort it's output (unless -F is specified) I'd not be too surprised with a little superlinearity.
https://docs.google.com/spreadsheets/d/1EAYua3B3UeTGBtAejPw2...
Ext2 never got better with large directories even with SSDs (this includes up to ext4). The benchmarks don’t include the filesystem type, which is actually extremely important when it comes to the performance of reading directories.
Lovely, I might try doing this for some other "classic" utility!
A bit off-topic too, but I'm new to Zig and curious. This here: ``` const allocator = sfb.get();
``` means that all allocations need to be written with an allocator in mind? I.e. one has to pick an allocator per each memory allocation? Or is there a default one?Allocator is an interface so you write library code only once, and then the caller decides which concrete implementation to use.
There's cases where you do want to change your code based on the expectation that you will be provided a special kind of allocator (e.g. arenas), but that's a more niche thing and in any case it all comes together pretty well in practice.
Caveat emptor, I don't write Zig but followed its development closely for awhile. A core design element of zig is that you shouldn't be stuck with one particular memory model. Zig encourages passing an allocator context around, where those allocators conform to a standardized interface. That means you could pass in different allocators with different performance characteristics at runtime.
But yes, there is a default allocator, std.heap.page_allocator
> you shouldn't be stuck with one particular memory model
Nit: an allocator is not a "memory model", and I very much want the memory model to not change under my feet.
> Zig encourages passing an allocator context around, where those allocators conform to a standardized interface.
in libraries. if youre just writing a final product it's totally fine to pick one and use it everywhere.
> std.heap.page_allocator
strongly disrecommend using this allocator as "default", it will take a trip to kernelland on each allocation.
std.heap.smp_allocator
You should basically only use the page allocator if you're writing another allocator.
It's a shame to see uutils doing so poorly here. I feel like they're our best hope for an organization to drive this sort of core modernization forward, but 2x slower than GNU isn't a good start.
I am curious what would happen if ls and other commands were replaced using io_uring and kernel.io_uring_disabled was set to 1. Would it fall back to an older behavior or would the ability to disable it be removed?
I just realized that one could probably write a userspace io_uring emulator in a library that spawns a thread to read the ringbuffer and a worker pool of threads to do the blocking operations. You'd need to get the main software to make calls to your library instead of the io_uring syscalls, that's it; the app logic could remain the same.
Then all the software wanting to use io_uring wouldn't need to write their low-level things twice.
I'm about to start something like this targetting epoll, poll, dispatch_io and maybe kqueue this weekend.
You would have to write your IO to have a fallback. The Ghostty project uses `io_uring`, but on kernels where it isn't available it falls back to an `epoll` model. That's all handled at the library level by libxev.
Love the idea and execution, don't love the misplaced apo'strophe's.
Oh no - where at?
Technically, the first, third, and fifth occurrence of "it's" should be "its". The dog chased its tail.
I didn't notice when I read the article though. The original commenter is being pedantic.
I've been playing around with io_uring for a while.
Still, I am yet to come across a some tests that simulate typical real life application workload.
I heard of fio but are yet to check how exactly it works and whether it might be possible to simulate real life application workload with it.
what a "real life application workload" looks like is entirely dependent on your use case, but fio is very widely used in the storage industry
it's a good first approximation to test the cartesian product of
- sequential/random
- reads/writes
- in arbitrary sizes
- with arbitrarily many workers
- with many different backends to perform such i/o including io_uring
and its reporting is solid and thorough
implementing the same for your specific workload is often not trivial at all
There used to be lsring by Jens Axboe (author of io_uring), but it no longer exists. This is more extreme than abandoning the project. Perhaps there is some issue with using io_uring this way, perhaps vulnerabilities are exposed.
> Perhaps there is some issue with using io_uring this way, perhaps vulnerabilities are exposed.
... no. It's just not interesting or particularly valuable to optimize ls, and Jens probably just used it as a demo and didn't want to keep it around.
I'm sure there are uses in Bash scripts that could benefit from it but most people would use it directly in a compiled program, I suppose, if the performance was a reoccurring need.
Explicit Vulnerabilities (Documented CVEs and Exploits)
These are actual discovered vulnerabilities, typically assigned CVEs and often exploited in sandbox escapes or privilege escalations: 1. CVE-2021-3491 (Kernel 5.11+)
2. CVE-2022-29582 3. CVE-2023-2598 4. CVE-2022-2602, CVE-2022-1116, etc. 5. Exploit Tooling: Implicit Vulnerabilities (Architectural and Latent Risks)These are not necessarily exploitable today, but reflect deeper systemic design risks or assumptions. 1. Shared Memory Abuse
The link isn't working for me. For those who were able to see it: does it improve anything by using that instead of what ls does now??
70% faster, but more importantly 35x times less syscalls.
Why do you say more importantly? The time is all that matters, I think.
%70 faster = you wait less
35x less system calls = others wait less for the kernel to handle their system calls
> 35x less system calls = others wait less for the kernel to handle their system calls
That isn't how it works. There isn't a fixed syscall budget distributed among running programs. Internally, the kernel is taking many of the same locks and resources to satisfy io_uring requests as ordinary syscall requests.
More system calls mean more overall OS overhead eg. more context switches, or as you say more contention on internal locks etc.
Also, more fs-related system calls mean less available kernel threads to process these system calls. eg. XFS can paralellize mutations only up to its number of allocation groups (agcount)
> More system calls mean more overall OS overhead [than the equivalent operations performed with io_uring]
Again, this just isn't true. The same "stat" operations are being performed one way or another.
> Also, more fs-related system calls mean less available kernel threads to process these system calls.
Generally speaking sync system calls are processed in the context of the calling (user) thread. They don't consume kernel threads generally. In fact the opposite is true here -- io_uring requests are serviced by an internal kernel thread pool, so to the extent this matters, io_uring requests consume more kernel threads.
> Again, this just isn't true.
Again, it just is true.
More fs-related operations mean less kthreads available for others. More syscalls means more OS overhead. It's that simple.
Is there a noticeable benefit of this huge syscall reduction?
Yes I just checked it after installing strace
strace -c ls gave me this
100.00 0.002709 13 198 5 total
strace -c eza gave me this
100.00 0.006125 12 476 48 total
strace -c lsr gave me this
100.00 0.001277 33 38 total
So seeing the number of syscalls in the calls directory
198 : ls
476 : eza
33 : lsr
A meaningful difference indeed!
That's just observing there is a difference, not explaining why that's a good thing.
syscalls are expensive and their relative latency compared with the rest of code only grow especially in view of mitigations against cache-related and other other hardware bugs.
It improves the latency of ls calls.
Hm interesting, it worked for me.
Link doesn’t work
Hm, well I have replied it to some other comment too but the link is working fine for me.
Currently downloading zig to build it.
Why isn’t it possible — or is it — to make libc just use uring instead of syscall?
Yes I know uring is an async interface, but it’s trivial to implement sync behavior on top of a single chain of async send-wait pairs, like doing a simple single threaded “conversational” implementation of a network protocol.
It wouldn’t make a difference in most individual cases but overall I wonder how big a global speed boost you’d get by removing a ton of syscalls?
Or am I failing to understand something about the performance nuances here?
In order to make this work, libc would have to:
- Start some sort of async executor thread to service the io_uring requests/responses
- Make it so every call to "normal" syscalls causes the calling thread to sleep until the result is available (that's 1 syscall)
- When the executor thread gets a result, have it wake up the original thread (that's another syscall)
So you're basically turning 1 syscall into 2 in order to emulate the legacy syscalls.
io_uring only makes sense if you're already async. Emulating sync on top of async is nearly always a terrible idea.
You don't need to start spawning new threads to use io_uring as a backend for synchronous IO APIs. You just need to set up the rings once, then when the program does an fwrite or whatever, that gets implemented as sending a submission queue entry followed by a single io_uring_enter syscall that informs the kernel there's something in the submission queue, and using the arguments indicating that the calling process wants to block until there's something in the completion queue.
> using the arguments indicating the calling process wants to block
Nice to know io_uring has facilities for backwards compatibility with blocking code here. But yeah, that's still a syscall, and given that the whole benefit of io_uring is in avoiding (or at least, coalescing) syscalls, I doubt having libc "just" use io_uring is going to give any tangible benefit.
Not speaking of ls which is more about metadata operations, but general file read/write workloads:
io_uring requires API changes because you don't call it like the old read(please_fill_this_buffer). You maintain a pool of buffer that belong to the ringbuffer, and reads take buffers from the pool. You consume the data from the buffer and return it to the pool.
With the older style, you're required to maintain O(pending_reads) buffers. With the io_uring style, you have a pool of O(num_reads_completing_at_once) (I assume with backpressure but haven't actually checked).
In a single threaded flow your buffer pool is just the buffer you were given, and you don't return until the call completes. There are no actual concurrent calls in the ring. All you're doing is using io_uring to avoid syscall.
Other replies lead me to believe it's not worth doing though, that it would not actually save syscalls and might make things worse.
Can you use io_uring in a way that doesn't gain the benefits of using it? Yes. Does the traditional C/POSIX API force you into that pattern? Almost certainly.
In addition to sibling's concern about syscall amplification, the async just isn't useful to the application (from a latency perspective) if you just serialize a bunch of sync requests through it.
This was more interesting for the tangled.sh platform it's hosted on. Wasn't aware of that!
Same! Just signed up and will be following tangled and this repo. I like how tangled is built on atproto (bluesky).
Is there any actual focus on ATProto as a decentralized protocol? So far it seems like its only purpose is building Bluesky as a centralized service, which I have no interest in at all.
Doesn't the existence of tangled answer your question?
One past thread so far:
Show HN: Tangled – Git collaboration platform built on atproto - https://news.ycombinator.com/item?id=43234544 - March 2025 (15 comments)
Why does this require inventing lsr as an alternative to ls instead of making ls use io_uring? It seems pretty annoying to have to install replacements for the most basic command line tools. And especially in this case, where you do not even do it for additional features, just for getting the exact same thing done a bit faster.
You don't have to install it. You can modify ls yourself too.
The author answered on lobster thread [1]. This is more of an io_uring exercise than an attempt to replace ls.
[1] https://lobste.rs/s/mklbl9/lsr_ls_with_io_uring
`ls` is in C, `lsr` is in Zig. The `lsr` programmer probably doesn't want to make new code in C.
In addition, the author might not want to sign away their rights to the FSF.
What on earth are you talking about? Why would this be the case?
Depending on the implementation (and I don't know which `ls` is being referred to), modifying `ls` might mean modifying an FSF project which require copyright assignment as a condition of patch submissions.
That's only the case if the author would want to upstream their changes. If they wanted to only fork ls then they would only be required to comply with the license, without assigning copyright over.
That may be the case but then why bother modifying ls when you can just write your own exactly as you want it?
Are you unfamiliar with contributing to GNU projects (ls is part of GNU corutils)?
https://www.gnu.org/prep/maintain/maintain.html#Copyright-Pa...
> Why does this require inventing lsr as an alternative to ls instead of making ls use io_uring?
Good luck getting that upstreamed and accepted. The more foundational the tools (and GNU coreutils definitely is foundational), the more difficult that process will be.
Releasing a standalone utility makes iteration much faster, partially because one is not bound to the release cycles of distributions.
In the history of Unix its also a common way to propose tool replacements, for instance how `less` became `more` on most systems, or `vim` became the new `vi` which in its day became the new `ed`.
Yes and no. We don't really have the equivalent of comp.sources.unix nowadays, which is where the early versions of those occurred, and comp.sources.unix did not take just anything. Rich Salz had rules.
Plus, since I actually took stevie and screen and others from comp.sources.unix and worked on them, and wasn't able to even send my improvements to M. Salz or the original authors at all, from my country, I can attest that contributing improvements had hurdles just as large to overcome back then as there exist now. They're just different.
> instance how `less` became `more` on most systems
How `more` became `less`.
The name of 'more' was from paging - rather than having text scroll off the screen, it would show you one page, then ask if you wanted to see 'more' and scroll down.
'less' is a joke by the less authors. 'less is more' etc.
For a while there was a less competitor named most.
It hasn't gone away.
* https://freshports.org/sysutils/most/
* https://ftp.netbsd.org/pub/pkgsrc/current/pkgsrc/misc/most/i...
* https://packages.debian.org/sid/most
One can even get pg still, with Ilumos-based systems; even though that was actually taken out of the SUS years ago. This goes to show that what's standard is not the same as what exists, of course.
* https://illumos.org/man/1/pg
* https://pubs.opengroup.org/onlinepubs/9699919799.2008edition...
> Releasing a standalone utility makes iteration much faster, partially because one is not bound to the release cycles of distributions.
which certainly is a valid way or prioritizing. similarly, distros/users may prioritize stability, which means the theoretical improvement would now be stuck in not-used-land. the value of software appears when it's run, not when it's written
> the value of software appears when it's run, not when it's written
Have you ever tried to contribute to open source projects?
The question was why wouldn't someone writing software not take the route likely to end in rejection/failure. I don't know about you, but if I write software, I am not going to write it for a project whose managers will make it difficult for my PR to be accepted, and that 99% likely it never will be.
I will always contribute to the project likely to appreciate my work and incorporate it.
I'll share an anecdote: I got involved with a project, filed a couple PRs that were accepted (slowly), and then I talked about refactoring something so it could be tested better and wasn't so fragile and tightly coupled to IO. "Sounds great" was the response.
So I did the refactor. Filed a PR and asked for code review. The response was (after a long time waiting) "thanks but no, we don't want this." PR closed. No feedback, nothing.
I don't even use the software anymore. I certainly haven't tried to fix any bugs. I don't like being jerked around by management, especially when I'm doing it for free.
(For the record, I privately forked the code and run my own version that is better because by refactoring and then writing tests, I discovered a number of bugs I couldn't be arsed to file with the original project.)
> Have you ever tried to contribute to open source projects?
yes, and it was often painful enough to make me consider very well wether I want to bother contributing. I can only imagine how terrible the experience must be at a core utility such as ls.
> The question was why wouldn't someone writing software not take the route likely to end in rejection/failure
Obviously they wouldn't - in my comment I assumed that the lsr author aimed for providing a better ls for people and tried to offer a perspective with a different definition of what success is.
> I don't like being jerked around by management, especially when I'm doing it for free
I get that. The older OSS projects become, the more they fossilize too - and that makes it more annoying to contribute. But you can try to see it from the maintainers perspective too: They have actual people relying on the program being stable and are often also not paid. Noone is forcing you to contribute to their project, but if you don't want to deal with existing maintainers, you won't have their users enjoying your patchset. Know what you want to achieve and act accordingly, is all I'm trying to say.
> The older OSS projects become, the more they fossilize too - and that makes it more annoying to contribute.
Newer ones can be just as braindead, if they came out of some commercial entity. CLAs and such.