Streaming data from Cloudant

While considering hosting applications on IBM Cloud you might be tempted to use Node.js and Cloudant in your application stack. Cloudant is one of the most commonly used NoSQL database on IBM Cloud. It's based on Apache CouchDB and hosted in the cloud. If you're using Cloudant in your application you should consider using streams.

Tip: for development purposes you can run a local instance of CouchDB and your app will be mostly compatible with it.

Handling big queries

When it comes to querying data from a database, the frequently spotted attitude of "let's just run this query and return the data in an API and it'll be fine" can lead to serious bottlenecks and cause trouble when the queried data turns out to be a lot bigger than anticipated. Querying hundreds of megabytes of data from a database is generally a bad idea and in most cases can and should be avoided. The memory burden and performance cost of each request can be quite high and can lead to unexpected consequences. In simple terms, every time you make a query your data needs to be stored somewhere to be processed.

What does this 'memory usage' mean anyway?

Take a typical REST API database query example from any available tutorials out on the inter-webs. For demonstration purposes we'll make one adjustment. Let's assume that your database has 30,000 medium sized JSON objects and for some good reason you want to return them all at once without pagination (and not just because you made a terrible mistake when you designed the system in the first place; never thinking there will eventually be so many records in there...).

What happens in this instance is, you make a request to an API endpoint which queries a database for some records. The application waits for the database to go through its tables or collections (1st bottleneck) after which the data is send through the network (2nd bottleneck) to the application and stored in a variable (3rd bottleneck). As this variable is stored somewhere in memory, it means there is a single object of 600MB of JSON data in your app which will be eventually returned through your API (4th bottleneck). Only then it can be safely disposed off by the garbage collector. In a typical example, it's a very synchronous process. It takes a some time to make the query, pass the requested records to the app, handle them in the app and eventually pass the results to an API endpoint. If you're lucky enough this process can lead to a timeout of said API endpoint before the data starts to download through it. Before it gets to that though, your app will most likely blow up.

NodeJS streams to the rescue

According to the Node.js documentation:

A stream is an abstract interface for working with streaming data in Node.js. The stream module provides an API for implementing the stream interface. There are many stream objects provided by Node.js. For instance, a request to an HTTP server and process.stdout are both stream instances. Streams can be readable, writable, or both. All streams are instances of EventEmitter.

In other words, we can use a stream to prevent data clogging by making the data pass through the app bit by bit, instead of synchronously passing gigantic numbers of records at a time. Did I already mention it's a bad idea?

Pipes

The official Cloudant Node.js library includes streaming as standard. Functions such as List() or Find() have equivalents with a *AsStream suffix. Instead of a Promise they return a request object which may be piped as a stream. For example:

cloudant.db.listAsStream()
  .on('error', function(error) {
    console.log('ERROR');  })
  .on('end', function(error) {
    console.log('DONE');  })
  .pipe(process.stdout);

*AsStream functions example using the @cloudant/cloudant npm module

Note that there are no callbacks when using the streams functions and event listeners must be used instead.

Below we'll build a fully fledged example to test the streams in action.

First, let's set up a simple express app and pass along the Express response object to a newly created service. Let's call it dbService.js.

const express = require('express');
const dbService = require('./dbService'); // We will create a service next

const app = express();

app.get('/', (req, res) => {
  dbService.getRecords(res); // Passing the express res object to dbService
});

app.listen(3000).on('listening', () => console.log('Listening on port 3000'));

Pretty self explanatory - Simple Express server in your entry point index.js file

The application will import a dbService module and pass along the express response object. The response object is a stream which means you can pipe a stream of data into it. Here's the dbService in full:

const Cloudant = require('@cloudant/cloudant');
const cloudant = Cloudant('https://username-bluemix:password@username-bluemix.cloudantnosqldb.appdomain.cloud');

module.exports = {
  getRecords(res) {
    const db = cloudant.use('your-cloudant-db-name');

    const querySelector = {
      selector: {
        something: {
          $eq: 'here',
        }
      }
    };

    db.findAsStream(querySelector)
      .on('error', (error) => {
        console.error('ERROR', error);
        // Make sure you end your request on error
        // as otherwise the API will hang
        res.end('Error finding data');
      })
      .on('end', (error) => {
        if (error) {
          console.error('ERROR on DONE');
        }
        console.log('DONE');
        res.end();
      })
      .pipe(res);
  },
};

dbService.js module handling the database connection through streams

The getRecords() function assumes there is a JSON document in your Cloudant collection* called 'your-cloudant-db-name' which looks something like this:

{
  "something": "here"
}

The selector queries for documents with an attribute 'something' equal to the word 'here'.

*Collections in cloudant are called 'databases'.

The findAsStream() function returns a stream object which can be piped to the res object. We should also end the response after the read is finished on the end event as well as on an error to prevent the API from hanging.

And that's it, it's fairly easy to take the burder off your strained application.

Leave a comment below if you like this example and let me know if there's anything you'd like me to cover next.

Cześć!

Has this been helpful to you?

You can support my work by sharing this article with others, or perhaps buy me a cup of coffee 😊

How can I stream my data from Cloudant in Node.js

Peter Poliwoda

Peter Poliwoda

Handling big queries

What does this 'memory usage' mean anyway?

NodeJS streams to the rescue

Pipes

Has this been helpful to you?

Developer hints: Why do my node.js app ports keep growing?

Making my old '98 Opel Astra play Spotify like it's nobody's business

I've called it Car-Fi because it makes my car -finally play Spotify

Developer insight: Why does Dart have two constant variable variants?

How to use an iPad as a whiteboard in Zoom?