How can I stream my data from Cloudant in Node.js
While considering hosting applications on IBM Cloud you might be tempted to use Node.js and Cloudant in your application stack. Cloudant is one of the most commonly used NoSQL database on IBM Cloud. It's based on Apache CouchDB and hosted in the cloud. If you're using Cloudant in your application you should consider using streams.
Tip: for development purposes you can run a local instance of CouchDB and your app will be mostly compatible with it.
Handling big queries
When it comes to querying data from a database, the frequently spotted attitude of "let's just run this query and return the data in an API and it'll be fine" can lead to serious bottlenecks and cause trouble when the queried data turns out to be a lot bigger than anticipated. Querying hundreds of megabytes of data from a database is generally a bad idea and in most cases can and should be avoided. The memory burden and performance cost of each request can be quite high and can lead to unexpected consequences. In simple terms, every time you make a query your data needs to be stored somewhere to be processed.
What does this 'memory usage' mean anyway?
Take a typical REST API database query example from any available tutorials out on the inter-webs. For demonstration purposes we'll make one adjustment. Let's assume that your database has 30,000 medium sized JSON objects and for some good reason you want to return them all at once without pagination (and not just because you made a terrible mistake when you designed the system in the first place; never thinking there will eventually be so many records in there...).
What happens in this instance is, you make a request to an API endpoint which queries a database for some records. The application waits for the database to go through its tables or collections (1st bottleneck) after which the data is send through the network (2nd bottleneck) to the application and stored in a variable (3rd bottleneck). As this variable is stored somewhere in memory, it means there is a single object of 600MB of JSON data in your app which will be eventually returned through your API (4th bottleneck). Only then it can be safely disposed off by the garbage collector. In a typical example, it's a very synchronous process. It takes a some time to make the query, pass the requested records to the app, handle them in the app and eventually pass the results to an API endpoint. If you're lucky enough this process can lead to a timeout of said API endpoint before the data starts to download through it. Before it gets to that though, your app will most likely blow up.
NodeJS streams to the rescue
According to the Node.js documentation:
A stream is an abstract interface for working with streaming data in Node.js. Thestream
module provides an API for implementing the stream interface. There are many stream objects provided by Node.js. For instance, a request to an HTTP server andprocess.stdout
are both stream instances. Streams can be readable, writable, or both. All streams are instances ofEventEmitter
.
In other words, we can use a stream to prevent data clogging by making the data pass through the app bit by bit, instead of synchronously passing gigantic numbers of records at a time. Did I already mention it's a bad idea?
Pipes
The official Cloudant Node.js library includes streaming as standard. Functions such as List() or Find() have equivalents with a *AsStream
suffix. Instead of a Promise
they return a request
object which may be piped as a stream. For example:
Note that there are no callbacks when using the streams functions and event listeners must be used instead.
Below we'll build a fully fledged example to test the streams in action.
First, let's set up a simple express app and pass along the Express response object to a newly created service. Let's call it dbService.js
.
The application will import a dbService module and pass along the express response object. The response object is a stream
which means you can pipe a stream of data into it. Here's the dbService in full:
The getRecords()
function assumes there is a JSON document in your Cloudant collection* called 'your-cloudant-db-name' which looks something like this:
*Collections in cloudant are called 'databases'.
The findAsStream()
function returns a stream object which can be piped to the res object. We should also end the response after the read is finished on the end
event as well as on an error to prevent the API from hanging.
And that's it, it's fairly easy to take the burder off your strained application.
Leave a comment below if you like this example and let me know if there's anything you'd like me to cover next.
Cześć!
Has this been helpful to you?
You can support my work by sharing this article with others, or perhaps buy me a cup of coffee 😊