What Is the Node.js ETL (Extract, Transform, Load) Pipeline?

Imagine you're a retail company with global stores working in local currencies. Using an ETL (Extract, Transform, Load) pipeline in Node.js would be a great tool to standardize and analyze revenue.

ETL is a process of extracting data from one location, transforming it in some way, and then loading it somewhere else. We can use this process to convert large amounts of data in Node.js from one format to another.

By the end of this tutorial, you should be able to:

  • Understand the different steps of an ETL pipeline, and why it’s useful to break the process up in this manner
  • Identify when we might need to use an ETL pipeline
  • List different examples of an ETL pipeline

Goal

  • Understand the concept of an ETL pipeline in Node.

Prerequisites

  • None

Watch: What Is an ETL Pipeline?

Overview

ETL is a process of extracting, transforming, and loading data from one or multiple sources into a destination:

  • Extract: Retrieve raw data from wherever it is, be that a database, an API, or any other source.
  • Transform: Alter the data in some way. This could be restructuring the data, renaming keys in the data, removing invalid or unnecessary data points, computing new values, or any other type of data processing which converts what you have extracted into the format you need it to be.
  • Load: Move the formatted data into its final destination (such as a database, flat file, or other structure) where it can be accessed by others.

ETL aims to solve the problem of having data in disparate places and formats by allowing you to pull data from various sources into a centralized location with a standardized format. It's called a "pipeline" because it's a set of processes that allow data to flow from one destination to another, while passing through the 3 different stages.

You might want to implement an ETL pipeline when you want to periodically move stored data to a new database or destination in a different format than it is stored in the data source.

Imagine you are a retail company with stores around the world that transact in different local currencies, and every store reports their revenue to the head office at the end of the month.

Because the stores report revenue in different currencies, it would be difficult to analyze all the stores in relation to each other.

To better analyze the revenue from each store you can use an ETL process to standardize each report so they all use the same currency.

You would extract the data from the reports sent by the stores, transform the currency amounts from their local currency to a single base currency, and then load the modified report data to a final reporting database or other location.

This is just one example of using ETL to transform data from a data source into a new structure, but ETL has many uses.

What all ETL processes have in common is that they take data from one or more locations, modify and standardize it, and then load the data to another destination to be used in its new form.

The Extract, Transform, Load process

An ETL process is broken into three distinct steps, and each step plays an important role in the overall process.

First you’ll retrieve the data from the data source. This is your opportunity to select the data you want to use as input for the pipeline. The raw data you start with affects the output you’ll produce. Each data source you extract from will require its own unique way to query or retrieve the data you are interested in. Interacting with the data source and getting the right data is the main focus of the extract step, and frees up the other parts of the pipeline from having to worry about how to retrieve the data.

The transform step can be as simple or complex as your needs require. Here we focus on the data flowing through the pipeline, converting the raw data from the extract step into the data structures or values that you desire. This is done with functions that operate on the data, taking the raw data as input and outputting new records in your desired format. Your transform step can be looked at as taking input and returning an appropriate output based on the data it received.

Finally, loading the data is similar to the extract step, but instead of retrieving data, we are sending it somewhere. Whether the destination is a database, a data warehouse, or a flat file, we are interacting with a destination to deliver the data into where it can be used.

Breaking an ETL pipeline up into these smaller parts helps us think through the process from start to finish.

Moving through each step with a clear goal in mind (extract the data, modify the data, and load the data) takes a complex process and allow us to look at each step individually.

You’ll also likely find parts of each step which are reusable, allowing you to find patterns within your process and take a more modular approach to moving your data around.

Uses for ETL

We looked at one example of ETL -- converting local currencies from different stores -- but an ETL process can be used to approach many problems.

ETL is very popular with Big Data use cases, where people need to analyze large amounts of data to uncover insights and relationships that can be useful. This is because Big Data often involves many different types of data, and some amount of cleaning and standardization is necessary before analysis is possible.

In Big Data applications, you’re also often working with multiple sources of data at the same time, which is a scenario that ETL is very good at handling. Normalizing and aggregating disparate data sets is a great use case for ETL.

But it’s not all just Big Data and analytics! Sometimes you need to migrate to a new type of database, and then the question becomes: how do you handle translating your data structure from one type of database to another? ETL can help us with that by modifying the data we have stored in one type of database into something appropriate for storage in another type of database.

ETL can be the pipes that connect data sources and help move data around for us.

Limitations

An ETL process can be computationally intensive, and sometimes requires access to data that may not be available in real-time. Therefore, ETL processes are typically executed on a regular schedule with a batch of data. However, real-time ETL solutions are becoming more common; investigate both approaches and evaluate the business and technical needs before deciding on an approach.

The result of an ETL process lags the actual state of your source data. This could be by seconds, minutes, hours, or days depending on your specific case. It takes time to extract, transform, and load all the necessary data. Depending on the nature and scale of the data you’re working with, your destination won’t be up to date until the process has completed.

These limitations aside, whenever you have data in one format but want to move it somewhere else in another format, ETL pipelines can be a powerful way to automate moving data between different parts of your architecture in bulk.

Recap

In this tutorial, we reviewed the concept of an ETL (Extract, Transform, Load) pipeline.

Further your understanding

  • Discuss with a co-worker the concept of ETL and talk about 2 or 3 good use cases for ETL.
  • Can you think of scenarios where the latency of a batch ETL pipeline would be a problem?
  • How does ETL differ from making queries against a dataset?

Additional resources

Sign in with your Osio Labs account
to gain instant access to our entire library.