ETL: Transform Data with Node.js

The goal of the transform step in a Node ETL (Extract, Transform, Load) pipeline is to take the data from the data source and make it fit into the format we want it to be in for the destination. We can rename fields, add new fields, and filter out unnecessary data. Transforming the data in Node is simple once it’s been extracted.

In this example, we’ll be transforming planet records from the NASA Exoplanet API into a different, easier-to-use format. We can do this by renaming keys, removing data points we aren’t interested in, and computing new values from existing values.

In this tutorial we’ll:

Understand the benefits of transforming the service payload
List ways we can transform a payload in Node

By the end of the tutorial you’ll know how to transform data with Node.js as part of an ETL pipeline.

Goal

Understand the ways we can transform a payload

Prerequisites

Watch

Overview

Now that we are able to extract data from our data source in the extract step, we can transform the data into the format we need it to be in.

The goal of the transform step in an ETL pipeline is to take the data from the data source and make it fit into the format we want it to be in for the destination. In this example, we will be transforming planet records from the NASA Exoplanet API into a different, easier to use format. We can do this by renaming keys, removing data points we aren’t interested in, and computing new values from existing values.

To summarize, in the transform step you can clean data, standardize values and fields, and aggregate values.

Determine the new structure of the data

Below is the raw data we received from the data source representing a single planet. (Some data has been omitted for brevity):

{
  "pl_hostname": "GJ 3341",
  "pl_letter": "b",
  "pl_name": "GJ 3341 b",
  "pl_discmethod": "Radial Velocity",
  "pl_controvflag": 0,
  "pl_pnum": 1,
  "pl_orbper": 14.207,
  "pl_orbpererr1": 0.007,
  "pl_orbpererr2": -0.007,
  "pl_orbperlim": 0,
  "pl_orbpern": 1,
  "rowupdate": "2014-12-03",
  "pl_facility": "La Silla Observatory"
}

And the format that we want to transform a record into into looks like this:

{
  "name": "GJ 3341 b",
  "discoveryMethod": "Radial Velocity",
  "facility": "La Silla Observatory",
  "neighbors": 1,
  "orbitsInDays": 14.207,
  "orbitsInDaysError": "±0.007",
  "lastUpdate": "2014-12-03",
  "hostStar": "GJ 3341"
}

In this example, we want to remove fields that are unnecessary, rename keys, and compute new values.

Create a transform function

To do this, we will create a function that takes a planet record as an argument, and returns a new object containing the transformed data.

function transformOnePlanet(planet) {
  return {
    name: planet.pl_name,
    discoveryMethod: planet.pl_discmethod,
    facility: planet.pl_facility,
    neighbors: planet.pl_pnum,
    orbitsInDays: planet.pl_orbper,
    lastUpdate: planet.rowupdate,
    hostStar: planet.pl_hostname
  };
}

We’ve actually done two steps here at once; we removed unnecessary data, and renamed the keys for some fields. We’ve removed unnecessary data by creating a new object which is returned from the function. The resulting object only contains the data that we want to use and leaves the input intact. Creating a new object which only has the data we care about is an effective way to pick and choose what data to keep if your data source provides you with more data than you need.

To rename keys, we have created new keys on the object and assigned them values from the original planet record we received from the API. The original key pl_name becomes name, pl_discmethod becomes discoveryMethod, pl_num (the number of planets in a system) becomes neighbors, and so on.

Compute new values

To compute new values, we can create another function which takes the values to compute and returns a single value. For our demonstration purposes, this is a fairly contrived example.

Some of the measurements in our data have margin of error values (e.g. +0.07 and -0.07). For this example, we want to create a single value from the two error values which indicate the range of inaccuracy represented by the error values. If both error values have the same absolute value we can represent them as a single ± value. If their absolute values are different, we will show that the error margin can either be + the max value, or - the minimum value.

Creating a function to handle computing new values helps keep your code clean, and allows you to reuse the code on any similar values which are in your dataset. Creating small functions to transform individual values is useful when transforming your data. If you later want to change how values are being computed, you can just change the function instead of having to track down everywhere that it is being used.

function computeErrorThreshold(err1, err2) {
  if (err1 === null || err2 === null) {
    return null;
  } else if (Math.abs(err1) === Math.abs(err2)) {
    return `±${Math.abs(err1)}`;
  } else {
    const max = Math.max(err1, err2);
    const min = Math.min(err1, err2);
    return `+${max}/${min}`;
  }
}

With our data set, it’s not guaranteed that each pl_orbper value will have error margins. So if we are missing one or both error values we will return null for this value.

To use this function to compute a new value for the orbitsInDaysError, we can call the function with the two planet orbit error values from within the transformOnePlanet function, like this:

function computeErrorThreshold(err1, err2) {
  if (err1 === null || err2 === null) {
    return null;
  } else if (Math.abs(err1) === Math.abs(err2)) {
    return `±${err1}`;
  } else {
    const max = Math.max(err1, err2);
    const min = Math.min(err1, err2);
    return `+${max}/${min}`;
  }
}
function transformOnePlanet(planet) {
  return {
    name: planet.pl_name,
    discoveryMethod: planet.pl_discmethod,
    facility: planet.pl_facility,
    neighbors: planet.pl_pnum,
    orbitsInDays: planet.pl_orbper,
    orbitsInDaysError: computeErrorThreshold(
      planet.pl_orbpererr1,
      planet.pl_orbpererr2
    ),
    lastUpdate: planet.rowupdate,
    hostStar: planet.pl_hostname
  };
}

Iterate over the records and transform each one

Now that we have created our transform functions, we need to apply the transformation to each record we received from the data source.

We received an array of records from the data source, so we can iterate over each record using map and create a new array containing the transformed records by passing each planet record to the transformOnePlanet function.

We’ll do this in our startEtlPipeline function we created previously, after we’ve extracted the data from the source.

const startEtlPipeline = async () => {
  try {
    // extract the data
    const planets = await getAllPlanets();
    console.log(`Extracted ${planets.length} planets from the API`);

    // transform the data
    const planetsTransformed = planets.map(p => transformOnePlanet(p));
    // More to come in the Load step...
  } catch (err) {
    console.error(err);
  }
};

Once we have transformed the planet data, we are ready to load it to the destination.

Recap

In this tutorial, we covered various ways we can transform a payload by removing unused data, renaming keys, and computing new values. We decided the format we need our data to end up in, and created a function to transform each record.

Further your understanding

What are some various ways you can transform a payload?
Besides using map, how else could you iterate over all the records and apply the transformation function?

Additional resources

Making asynchronous programming easier with async and await (developer.mozilla.org)
MDN web docs: async function (developer.mozilla.org)
MDN web docs: await (developer.mozilla.org)