ETL: Transform Data with Node.js
The goal of the transform step in a Node ETL (Extract, Transform, Load) pipeline is to take the data from the data source and make it fit into the format we want it to be in for the destination. We can rename fields, add new fields, and filter out unnecessary data. Transforming the data in Node is simple once it’s been extracted.
In this example, we’ll be transforming planet records from the NASA Exoplanet API into a different, easier-to-use format. We can do this by renaming keys, removing data points we aren’t interested in, and computing new values from existing values.
In this tutorial we’ll:
- Understand the benefits of transforming the service payload
- List ways we can transform a payload in Node
By the end of the tutorial you’ll know how to transform data with Node.js as part of an ETL pipeline.
Goal
- Understand the ways we can transform a payload
Prerequisites
- What Is the Node.js ETL (Extract, Transform, Load) Pipeline?
- Understanding Promises in Node.js
- Use Promise.all to Wait for Multiple Promises
- Use JavaScript’s Async/Await with Promises
- ETL: Extract Data with Node.js
Watch
Overview
Now that we are able to extract data from our data source in the extract step, we can transform the data into the format we need it to be in.
The goal of the transform step in an ETL pipeline is to take the data from the data source and make it fit into the format we want it to be in for the destination. In this example, we will be transforming planet records from the NASA Exoplanet API into a different, easier to use format. We can do this by renaming keys, removing data points we aren’t interested in, and computing new values from existing values.
To summarize, in the transform step you can clean data, standardize values and fields, and aggregate values.
Determine the new structure of the data
Below is the raw data we received from the data source representing a single planet. (Some data has been omitted for brevity):
{ "pl_hostname": "GJ 3341", "pl_letter": "b", "pl_name": "GJ 3341 b", "pl_discmethod": "Radial Velocity", "pl_controvflag": 0, "pl_pnum": 1, "pl_orbper": 14.207, "pl_orbpererr1": 0.007, "pl_orbpererr2": -0.007, "pl_orbperlim": 0, "pl_orbpern": 1, "rowupdate": "2014-12-03", "pl_facility": "La Silla Observatory"}
And the format that we want to transform a record into into looks like this:
{ "name": "GJ 3341 b", "discoveryMethod": "Radial Velocity", "facility": "La Silla Observatory", "neighbors": 1, "orbitsInDays": 14.207, "orbitsInDaysError": "±0.007", "lastUpdate": "2014-12-03", "hostStar": "GJ 3341"}
In this example, we want to remove fields that are unnecessary, rename keys, and compute new values.
Create a transform function
To do this, we will create a function that takes a planet record as an argument, and returns a new object containing the transformed data.
function transformOnePlanet(planet) { return { name: planet.pl_name, discoveryMethod: planet.pl_discmethod, facility: planet.pl_facility, neighbors: planet.pl_pnum, orbitsInDays: planet.pl_orbper, lastUpdate: planet.rowupdate, hostStar: planet.pl_hostname };}
We’ve actually done two steps here at once; we removed unnecessary data, and renamed the keys for some fields. We’ve removed unnecessary data by creating a new object which is returned from the function. The resulting object only contains the data that we want to use and leaves the input intact. Creating a new object which only has the data we care about is an effective way to pick and choose what data to keep if your data source provides you with more data than you need.
To rename keys, we have created new keys on the object and assigned them values from the original planet record we received from the API. The original key pl_name
becomes name
, pl_discmethod
becomes discoveryMethod
, pl_num
(the number of planets in a system) becomes neighbors
, and so on.
Compute new values
To compute new values, we can create another function which takes the values to compute and returns a single value. For our demonstration purposes, this is a fairly contrived example.
Some of the measurements in our data have margin of error values (e.g. +0.07 and -0.07). For this example, we want to create a single value from the two error values which indicate the range of inaccuracy represented by the error values. If both error values have the same absolute value we can represent them as a single ± value. If their absolute values are different, we will show that the error margin can either be + the max value, or - the minimum value.
Creating a function to handle computing new values helps keep your code clean, and allows you to reuse the code on any similar values which are in your dataset. Creating small functions to transform individual values is useful when transforming your data. If you later want to change how values are being computed, you can just change the function instead of having to track down everywhere that it is being used.
function computeErrorThreshold(err1, err2) { if (err1 === null || err2 === null) { return null; } else if (Math.abs(err1) === Math.abs(err2)) { return `±${Math.abs(err1)}`; } else { const max = Math.max(err1, err2); const min = Math.min(err1, err2); return `+${max}/${min}`; }}
With our data set, it’s not guaranteed that each pl_orbper
value will have error margins. So if we are missing one or both error values we will return null
for this value.
To use this function to compute a new value for the orbitsInDaysError
, we can call the function with the two planet orbit error values from within the transformOnePlanet
function, like this:
function computeErrorThreshold(err1, err2) { if (err1 === null || err2 === null) { return null; } else if (Math.abs(err1) === Math.abs(err2)) { return `±${err1}`; } else { const max = Math.max(err1, err2); const min = Math.min(err1, err2); return `+${max}/${min}`; }}function transformOnePlanet(planet) { return { name: planet.pl_name, discoveryMethod: planet.pl_discmethod, facility: planet.pl_facility, neighbors: planet.pl_pnum, orbitsInDays: planet.pl_orbper, orbitsInDaysError: computeErrorThreshold( planet.pl_orbpererr1, planet.pl_orbpererr2 ), lastUpdate: planet.rowupdate, hostStar: planet.pl_hostname };}
Iterate over the records and transform each one
Now that we have created our transform functions, we need to apply the transformation to each record we received from the data source.
We received an array of records from the data source, so we can iterate over each record using map
and create a new array containing the transformed records by passing each planet record to the transformOnePlanet
function.
We’ll do this in our startEtlPipeline
function we created previously, after we’ve extracted the data from the source.
const startEtlPipeline = async () => { try { // extract the data const planets = await getAllPlanets(); console.log(`Extracted ${planets.length} planets from the API`);
// transform the data const planetsTransformed = planets.map(p => transformOnePlanet(p)); // More to come in the Load step... } catch (err) { console.error(err); }};
Once we have transformed the planet data, we are ready to load it to the destination.
Recap
In this tutorial, we covered various ways we can transform a payload by removing unused data, renaming keys, and computing new values. We decided the format we need our data to end up in, and created a function to transform each record.
Further your understanding
- What are some various ways you can transform a payload?
- Besides using
map
, how else could you iterate over all the records and apply the transformation function?
Additional resources
- Making asynchronous programming easier with async and await (developer.mozilla.org)
- MDN web docs: async function (developer.mozilla.org)
- MDN web docs: await (developer.mozilla.org)