What's this all about?

You can do stuff in parallel from a POSIX shell, and it's super easy! I went to college for this stuff and nobody told me? I feel so cheated!

The alternative of doing multi-process or multi-thread stuff in C isn't great. First, you have to drop down into C, and you need to compile your code. Second, manually launching and waiting for child processes or importing and using pthreads is… verbose. Certainly more so than doing it from the shell.

Here, let me show you what I mean:

# do some work on each file, sequentially
for file in input*.txt; do
  do_some_work $file
done

# do some work on each file, in parallel
for file in input*.txt; do
  do_some_work $file &
done
wait

That's 5 characters' worth of difference¹! Nice!

If you have a hundred input files, this will launch a hundred background processes. At the same time. You don't have one hundred CPU cores (unless you do). If you want to be less wasteful, you use find and/or xargs.

# do some work on each file, in parallel, using 4 workers at a time
find . -name "input*.txt" | xargs -n1 -P4 do_some_work

xargs calls do_some_work <line> for each line coming in through stdin. It will launch a new process for each line, but it'll make sure you only have 4 running at a time.

find and xargs are really flexible, and you can get up to some real nonsense with them. I'm showing a very simple² example here.

The problem with all this is that you need every chunk of input to be stored in a different file. Take another look at the examples: each worker gets its data from a different file, which it gets all to itself. Maybe this is a common workload? Anyway, it seems to me that it's much more common to have a single gigantic file where all of your data comes from.

If your file is only large-ish, you can first split it into pieces, then do the for-loop or xargs thing. But splitting your file will create a second copy of it, doubling your storage requirements. If your file is really large, you may not be able to afford that! You may not even be able to afford keeping a single copy of it around!

You'd ideally want to process the data as a stream: only read data as you need it, evict it as soon as you're done with it, keep the result.

For single-threaded pipelines, this is trivial:

wget -O - https://my.example/input.txt | do_some_work

There, done. If your worker can read from stdin, this will process the entire input file as it downloads, without ever storing the whole input.

But we wanted parallelism, right?

What we want is some command (we'll call it parallel) that can split the file in flight for us, and send each piece to a different worker.

getdata | parallel -P4 do_some_work

Well, this command does exist! It's called GNU parallel, and it's an absolute beast.

parallel is a single-file Perl script that can split a workload between several processes.

…or it can send the job to be performed remotely on another computer, using nothing but an ssh connection.

…or it can send your job to several computers at once.

parallel knows how to serialize a shell function into a string and send it over the network to be remotely executed, and probably a dozen more things I haven't even thought of. parallel is written for speed and portability, and is shockingly well documented. Truly terrific software.

I could just learn and recomment GNU parallel (the free commandline cluster-computing script, holy shit I can't believe that exists) and call it a day, but there's something about it that irks me: it's not POSIX.

POSIX, or the Portable Operating System Interface… X, is a least common denominator for Unix-like systems. It's comprehensive enough that a human could conceivably do real work on a computer using only standard features, but the real goal is to make shell scripts and C programs interoperable. I don't like POSIX as a standard, and I doubt anybody does. As far as I can tell nobody bothers implementing all of it (who the hell implements sccs?), and it can be shitty in places (bash shittiness has a big overlap with sh shittiness, which is POSIX shittiness) but it's the best we've got.

I think all this parallelism magic should be possible from a POSIX standard shell. So I set out to write my own, stupid, lobotomized version of GNU parallel. For science! And stubbornness.

Turns out: there is enough stuff in the POSIX shell to get this done. It works! But it's dog slow. There's a trick to make the processing go a hell of a lot faster, but it relies on what may-or-may-not-be a Linux feature and it breaks in the dash shell (due to a bug; it works fine in bash). But it works! Let me show you how. But before I do: please don't try to use this for real; it's a research project, not production software.

Ok, what does the code do?

There are two scripts in the repo: rrobin and atomicpipes.

Both are meant to be used in basically the same way: you put them in the middle of a pipeline to speed up a slow, parallelizable step, then collect your answers on the other side.

# split data among 4 workers
data_source_cmd \
| rrobin 4 do_some_work \
| combine_answers

# ditto. -l and -m are explained later
data_source_cmd \
| atomicpipes -l10 -m12 4 do_some_work \
| combine_answers

Data coming in or out should be lines of text.

rrobin

rrobin does the obvious thing: you tell it to make four workers, it makes four pipes. Then, it reads your input and feeds it down each pipe. First line into first pipe, second line into second pipe, etc.

With this method, I managed to leverage all four of my laptop's cores to add up the numbers 1 to 10 million in twice the time as a single-threaded application.

time seq 10000000 | awk '{i+=$0} END {print i}'

# 50000005000000

# real    0m2,440s
# user    0m2,520s
# sys     0m0,095s

time seq 10000000 \
| ./rrobin 4 awk '{i+=$0} END {print i}' \
| awk '{i+=$0} END {print i}'

# 50000005000000

# real    0m5,264s
# user    0m9,625s
# sys     0m0,255s

This is technically a success! The best kind of success.

I don't know why it's so much slower.

I suspected that calling write once per input line was to blame, but when I tried batching outputs I didn't get any speed improvement.

I thought maybe GNU awk wasn't being smart about IO, so I tried switching to mawk. That did make everything faster, but single-threaded code still beat multi-threaded code.

Whatever the cause, if we were doing meatier work on each line then the overhead wouldn't matter, so this stuff is in theory useful. But I'd hardly be content with that; let's move on.

atomicpipes

GNU parallel is fast, so I decided to steal some ideas from it. In their design document (did you know they have a design document? their documentation is so good!) I found a link to the catern pipes post, which is where I got the core idea for atomicpipes.

Here's how it works: instead of doing line-by-line processing using awk, we process fixed-sized blocks using a tool written in C. Shocker.

We'll be using (maybe abusing) dd and (definitely abusing) pipe semantics.

Fixed-sized records

dd is a tool that reads some data from one file and writes it in another. That sounds terribly pointless (doesn't every cli tool do that?), but there's a twist:

dd can skip around in a file, reading and writing data just where needed
dd has support for binary files and (pertinent to our purposes) some other… oddball formats

dd gives you fine-grained control of the underlying read() and write() syscalls without dropping down to C. These are useful capabilities when doing full disk backups, fixing broken filesystems, rescuing deleted files, etc, so dd is a staple tool for power users and savvy system administrators.

So. About those oddball formats…

Apparently, back in the day, fixed-width record "databases" were a thing that people used, and dd has options to deal with them. These are now pretty much legacy code (at least, I hope they are), since an actual database will do all sorts of devious and subtle performance tricks on your behalf.

We'll write a function to pad inputs to a consistent width. This means each worker can read a fixed amount of bytes and know it's getting exactly one record, which we can also achieve with dd.

pad () {
  dd conv=block cbs=4096 obs=4096 2>/dev/null
}
unpad () {
  dd conv=unblock cbs=4096 ibs=4096 2>/dev/null
}

And now we can do stuff like this:

data_source_cmd | pad \
| {
  unpad | do_some_work | pad &
  unpad | do_some_work | pad &
  unpad | do_some_work | pad &
  unpad | do_some_work | pad &
} | unpad | combine_answers

We picked 4096 bytes as the record length because this (actually, PIPE_BUF) is the limit of atomic pipe IO.

Reading or writing pipe data is atomic if the size of data written is not greater than PIPE_BUF. This means that the data transfer seems to be an instantaneous unit, in that nothing else in the system can observe a state in which it is partially complete. Atomic I/O may not begin right away (it may need to wait for buffer space or for data), but once it does begin it finishes immediately.

Reading or writing a larger amount of data may not be atomic; for example, output data from other processes sharing the descriptor may be interspersed. Also, once PIPE_BUF characters have been written, further writes will block until some characters are read.

– https://www.gnu.org/software/libc/manual/html_node/Pipe-Atomicity.html

This is a simplification, of course. It'd be a huge waste to write packets of 4096 bytes just to send 10-byte lines. This is why atomicpipes lets you input maximum line lengths for both inputs and outputs (the -l and -m options we saw previously). If multiple lines fit in PIPE_BUF, they're packed together into a single write.

Contention

Notice how, instead of giving each worker their own input pipe, we're having them contend for access to pad's stdout. Under Linux, we seem to get sane behavior: atomic (less than PIPE_BUF) reads stay atomic; each process gets to read a full buffer, with no interleaving issues.

This is portability wrinkle number one.

POSIX says that "The behavior of multiple concurrent reads on the same pipe, FIFO, or terminal device is unspecified" (normal files have different rules).

And while in Linux we seem to get sane behavior, this doesn't seem to be specified anywhere! I tried Kerrisk's The Linux Programming Interface, no luck. I tried the Linux man pages, no luck. I tried the Linux kernel online documentation, and frankly it could be there, for how well I managed to navigate it.

Portability wrinkle number two is that redirecting output into a background process doesn't work across all shells.

echo "hello, world!" | { cat & } should send "hello, world!" to cat, which then prints it to stdout.

This behavior is mandated by POSIX, and bash handles it fine, but neither dash (Ubuntu's startup shell) nor busybox ash (Alpine Linux's default login shell, and a fairly common shell in cheap Linux-based chinese products of all stripes) do the right thing with it.

I've filed a bug on the dash mailing list, but that's probably not going anywhere.

I wrote a workaround for dash, but it looks like it runs into an unavoidable race condition, so it's commented out. The code runs fine in practice, but the main process sleeps so background jobs have a chance to start up. Ugly, and unreliable.

So atomicpipes depends on bash and Linux. It may work with other systems, or it may break on the next version of Linux.

But does it go fast?

It's a lot faster.

time seq 10000000 | awk '{i+=$0} END {print i}'

# 50000005000000

# real    0m2,440s
# user    0m2,520s
# sys     0m0,095s

time seq 10000000 \
| ./rrobin 4 awk '{i+=$0} END {print i}' \
| awk '{i+=$0} END {print i}'

# 50000005000000

# real    0m5,264s
# user    0m9,625s
# sys     0m0,255s

time seq 10000000 \
| ./atomicpipes -l8 -m14 4 awk '{i+=$0} END {print i}' \
| awk '{i+=$0} END {print i}'

# 50000005000000

# real    0m1,789s
# user    0m0,588s
# sys     0m0,491s

Finally! We're faster than single-threaded code!

And this is on a simple "add all these numbers" task, which means that on any realistic task we should see much better ratios.

There are a couple of serious limitations to this approach:

If you don't specify a maximum input length, each line is padded to PIPE_BUF bytes, which on Linux systems is 4096. Writing 4 KB of data per line of input understandably kills performance.
If input or output lines exceed PIPE_BUF bytes, writes stop being atomic and you get interleaving. You may end up with part of an input line in one process and part in another, or lines coming from different processes mixed together.
Lines that used to be adjacent in the input aren't adjacent when they get to a worker.

However, there are tasks where these kind of things aren't a big deal.

What the hell came over you?

Silly projects like this generally don't just occur to me. I stumble upon the core idea (or at least a fairly strong hint) somewhere else. Here are the texts that inspired me this time:

Drake, Adam. Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster. 2014-01-18. https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html.
Drake, Adam. Big Data, Small Machine. 2018-05-28. https://adamdrake.com/big-data-small-machine.html
https://catern.com/posts/pipes.html
https://www.gnu.org/software/parallel/parallel_design.html
https://www.gnu.org/software/parallel/parallel_alternatives.html#differences-between-xargs-and-gnu-parallel

Where did you learn this stuff?

I'm trying not to accidentally learn Linux- or bash-only idioms (when I do, it'll be on purpose, dammit!), so I mostly just read straight from the POSIX standard, which is now freely available online.

The standard is surprisingly easy to read, and often a lot nicer and more complete than man pages. It has basically no examples, though, and there are a lot of times when you want to do a really simple thing, but the obvious way to do it doesn't work, and you hate everything, and the standard is just no use at all. Greg's wiki, Unix StackExchange, and the archives for the GNU help-bash mailing list have the pragmatic details.

There are some other resources listed, but I didn't get that much use out of them. Next project, maybe.

POSIX standard at the Open Group (I tried to use only the stuff that's common to the 2024 and 2008 versions)
Greg's wiki
https://unix.stackexchange.com/
https://lists.gnu.org/archive/html/help-bash/
https://www.shellcheck.net/
Dougherty, Dale, and Arnold Robbins. Sed & Awk. 2nd ed. O'Reilly Media, 2010.
Janssens, Jeroen. Data Science at the Command Line. 2021

Ok, 7 if you count white-space. You pedant. ↩
And brittle! The examples I'm showing here break for some valid input filenames, since filenames can have all sorts of crap in them, which can cause the shell to treat them as multiple files, options, globbing patterns, new commands, etc. Malicious use of this technique is called a "shell injection attack", and preventing it is a real pain and shockingly hard to get right. ↩