Coding with Jesse

How to use Node.js Streams (And how not to!)

October 30th, 2019

When I first started to understand Node.js streams, I thought they were pretty amazing. I love JavaScript Promises, but they only resolve to one result. Streams, however, can provide a constant stream of data, as you might expect!

Functional Reactive Programming is all the rage these days. Libraries like MobX, RxJS and Highland.js make it easy to structure your front-end application as data flowing in one direction down through a chain of pipes.

You can pipe a stream to another stream so that the output of the first becomes the input to the next. Sounds like a really neat way to structure an application, right?

I've already rewritten a lot of my JavaScript code to use Promises. Are streams the next step in the evolution? Is it time to rewrite all our applications to use Node streams? (Spoiler: NO!)

Unix pipes are the best

I love working with pipes in Linux (or Unix). It's really nice to be able to take a text file, pipe that into a command, pipe the output to another command, and pipe the output from that into a final text file.

Here's an example of using the power of pipes on the command line. It takes a text file with a list of words, sorts the list, counts how many times each word appears, then sorts the counts to show the top 5 words:

$ cat words.txt | sort | uniq -c | sort -nr | head -n5

It's not important for you to understand these commands, just understand that data is coming in to each command as "Standard Input" (or stdin), and the result is coming out as "Standard Output" (or stdout). The output of each command becomes the input to the next command. It's a chain of pipes.

So can we use Node.js in the middle of this chain of pipes? Of course we can! And Node streams are the best way to do that.

Going down the pipe

Node.js streams are a great way to be able to work with a massive set of data, more data than could possible fit into memory. You can read a line of data from stdin, process that data, then write it to stdout.

For example, how would we make a Node CLI application that capitalizes text? Seems simple enough. Let's start with an application that just takes stdin and pipes directly to stdout. This code does almost nothing (similar to the cat unix command):

process.stdin.pipe(process.stdout);

Now we can start using our Node.js application in the middle of our pipeline:

$ cat words.txt | node capitalize.js | sort | uniq -c | sort -nr | head -n5

Pretty simple, right? Well, we're not doing anything useful yet. So how do we capitalize each line before we output it?

npm to the rescue

Creating our own Node streams is a bit of a pain, so there are some good libraries on npm to make this a lot easier. (I used to heavily use a package called event-stream, until a hacker snuck some code into it to steal bitcoins!)

First, we'll use the split package, which is a stream that splits an input into lines, so that we can work with the data one line at a time. If we don't do this, we might end up with multiple lines, or partial lines, or even partial Unicode characters! It's a lot safer to use split and be sure we are working with a single, complete line of text each time.

We can also use a package called through which lets us easily create a stream to process data. We can receive data from an input stream, manipulate the data, and pipe it to an output stream.

const split = require('split');
const through = require('through');

process.stdin
    .pipe(split())
    .pipe(
        through(function(line) {
            this.emit('data', line.toUpperCase());
        })
    )
    .pipe(process.stdout);

There is a bug in the code above, because the newline characters are stripped out by split, and we never add them back in. No problem, we can create as many reusable streams as we want, to split our code up.

const through = require('through');
const split = require('split');

function capitalize() {
    return through(function(data) {
        this.emit('data', data.toUpperCase());
    });
}

function join() {
    return through(function(data) {
        this.emit('data', data + '\n');
    });
}

process.stdin
    .pipe(split())
    .pipe(capitalize())
    .pipe(join())
    .pipe(process.stdout);

Isn't that lovely? Well, I used to think so. There's something satisfying about having the main flow of your application expressed through a list of chained pipes. You can pretty easily imagine your data coming in from stdin, being split into lines, capitalized, joined back into lines, and streamed to stdout.

Down the pipe, into the sewer

For a few years, I was really swept up in the idea of using streams to structure my code. Borrowing from some Functional Reactive Programming concepts, it can seem elegant to have data flowing through your application, from input to output. But does it really simplify your code? Or is it just an illusion? Do we really benefit from having all our business logic tied up in stream boilerplate?

It's worse than it looks too. What if we emit an error in the middle of our pipeline? Can we just catch the error by adding an error listener to the bottom of the pipeline?

process.stdin
    .pipe(split())
    .pipe(capitalize())
    .pipe(join())
    .pipe(process.stdout)
    .on('error', e => console.error(e)); // this won't catch anything!

Nope! It won't work because errors don't propagate down the pipe. It's not anything like Promises where you can chain .then calls and throw a .catch at the end to catch all the errors inbetween. No, you have to add an error handler after each .pipe to be sure:

process.stdin
    .pipe(split())
    .pipe(capitalize())
    .on('error', e => console.error(e))
    .pipe(join())
    .on('error', e => console.error(e))
    .pipe(process.stdout);

Yikes! If you forget to do this, you could end up with an "Unhandled stream error in pipe." with no stack trace. Good luck trying to debug that in production!

Conclusions and recommendations

I used to love streams but I've had a change of heart recently. Now, my advice is to use data and error listeners instead of through streams, and write to the output instead of piping. Try to keep the number of streams to a minimum, ideally just an input stream and an output stream.

Here's a different way we can write the same example from above, but without all the hassle:

const split = require('split');
const input = process.stdin.pipe(split());
const output = process.stdout;

function capitalize(line) {
    return line.toUpperCase();
}

input.on('data', line => {
    output.write(capitalize(line));
    output.write('\n');
});

input.on('error', e => console.error(e));

Notice I'm still piping to the split library, because that's straightforward. But after that, I'm using a listener to the data event of the input to receive data. Then, I'm using write() to send the result to the stdout output.

And notice that my capitalize() function no longer has anything to do with streams. That means I can easily reuse it in other places where I don't want to use streams, and that's a really good thing!

I still think Node streams are interesting but they are not the future of JavaScript. If used carefully, you can make pretty powerful command-line tools with Node.js. Just be careful not to overdo it!