ETL- Data Processing using Talend

Akshada Gaonkar
Analytics Vidhya
Published in
7 min readOct 23, 2021

--

Photo by Markus Spiske on Unsplash

Processing gets tedious as the volume of data increases. ETL tools help transforming huge volumes of data with ease. Once the data is processed, it further makes analyzing and interpreting data easier and quicker.

Before implementing some processing on Talend, let’s first understand what ETL is..

What is ETL?
→ 3-step Data Integration process- Extract Transform and Load. This process is required to move raw data from data source to database or data warehouse.

  • Extract
    In the first step of ETL, data in structured and unstructured form is extracted/gathered from multiple sources and is consolidated into a single centralized repository.
  • Transform
    Next, the entire data is transformed to ensure data quality. Various rules and functions are applied to Transform the data into a format that matches the requirements. Some commonly used transformations include standardization, sorting, filtering, replacing null values, replacing inconsistent entries for a single value, removing duplicates.
  • Load
    Finally, the extracted and transformed data is loaded into a target data repository or database. During this step, it should be ensured that the data is accurately loaded using limited resources and there is no loss in data consistency.

Let’s know a little something about Talend..

What is Talend?
→ Talend is one of the most widely used and powerful ETL tool available for data integration. It makes the process of ETL is easier and cost-effective. Talend Open Studio for Data Integration allows you to build Java ETL jobs using predefined components. It provides a graphical workspace in you which can simply drag and drop the components from the palette and connect them to map the data between the source and destination.

Finally, let’s play around with some Talend components..

We’ll look at 5 processing steps using the following Talend components:
tFileInputDelimiter, tFileOutputDelimiter, tLogRow, tMap, tFilterRow, tAggregateRow, tReplace, tReplaceList.

For illustration, we’ll be using the the same Insurance_dataset from one of the previous blogs about data processing but in a text file, delimited with commas.

This is what the text file looks like.

Create a job in TOS, if you’ve never done this before, follow the instructions. Then, create metadata for the delimited file and use it in the tFileInutDelimited component of the input family so that it can be used further. If you don’t how to do it, refer to the video on this link.

Setting up the metadata.
Settings for the tFileInputDelimited component.

To view the data use the tLogRow component from the logs & errors family, it allows you to write the data flowing from your job into the console. Select ‘Table’ mode in the component settings to view the data in a structured table format.

Settings for the tLogRow component.

Image below is what our table looks like.

Output table

Replacing missing or null values

In the above image of the output, I’ve highlighted missing or null values. Let’s try to deal with these values using tMap component from the processing family. We’ll form expressions in the tMap component window using the available functions to transform data as per requirement. After this we’ll save the processed data.

Job for replacing missing values.
tMap component to replace missing values.

Let’s understand the expressions formed..
Java’s conditional or ternary operator (?:) is used to write the expression for replacing missing or null values.

Expression 1 ? Expression 2 : Expression 3 

Expression 1 is the condition to be checked. If true, expression 2 is executed, else, expression 3 is executed.

row1.smoker.equals("NAN") ? "no" : row1.smoker
row1.region.isEmpty() ? "northeast" : row1.region
Relational.ISNULL(row1.bmi) ? 20.00 : row1.bmi

ISNULL() is a function provided by Talend’s Relational category which returns a boolean value

tOutputDelimited component of the output family helps us save the transformed data into another file with suitable delimiter.

For further processing, add this new file to the metadata and then the tFileInputDelimited component like before.
When we again use tLogRow to check the output table, you can notice the changes (compare with the previous output image).

Output after dealing with missing values.

Creating a column — Birth Year

Job for creating birth_year column.
tMap component to calculate and create birth_year column.

Understanding the expression..

TalendDate.addDate(TalendDate.getCurrentDate(),-row1.age, "yyyy")

addDate(Date date, int n, Date dateType) is a function provided by the TalendDate category in the tMap expression builder. The first parameter is the date you want to add days, months, years to; second parameter is the value to be added and third parameter is the date pattern. This function returns a Talend date.
In our case, we need to subtract the age from current date to get the birth year. This is why, the first parameter passed in the addDate function is another in-built function provided by Talend date category- getCurrentDate(), which as the name suggests returns the current date. We need to subtract the age so the second parameter passed is minus of age and third is yyyy since the year is required.

Output table with birth_year column.

Categorical Encoding

Categorical Encoding is a widely used technique in Data Analysis. Let’s see how to perform Label Encoding on the two categorical columns- sex and smoker. This technique can be generally used to replace data entries with inconsistencies. Say, there’s a country column with US, U.S.A, USA, America, which are all represent the same value ‘USA’, so replacing inconsistent values in the data can also be done using the same method. We’ll be using the tReplace component from the processing family which searches and replaces in the input columns defined and helps to clean the file before further processing.

Job for label encoding.
tReplace component for encoding the values for columns- sex & smoker.

Mentioned string will be searched in the selected input column and will be replaced with another string that is provided.

Output after label encoding.

Replacing region values with their codes

In case you’ve a column with countries and you want to replace them with country codes from another file or languages with language codes or localities with zip-codes you can use the tReplaceList component from the data quality family. In our example, we’ll replace the region values with their codes using this component. Below is an image of the delimited file with regions and their codes. Add this file to the metadata and further use it in the tFileInputDelimited component just like before.

File with regions and their respective codes.
Job for replacing region values with codes.

tReplaceList will take 2 input files. Here, tFileInputDelimited_2 is our insurance data file and tFileInputDelimited_3 is the region_code file.

tReplaceList for replacing region values with respective codes.

Lookup search column is the column to search for the codes, i.e. region of tFileInputDelimited_3 and lookup replacement column is the column with the new values, i.e. code tFileInputDelimited_3. Then, select the column of the main file to be replaced, i.e. region column of tFileInputDelimited_2.

Output after replacing region values with codes.

Filtering and Aggregating Data

tFilterRow and tAggregateRow are the two processing family components that make your life easy when you want to filter data and group data to aggregate values respectively.

Say, you want to compare the average insurance charges for smoker and non-smoker males and females with children.

Job to filter and aggregate data.
tFilterRow component to filter people with children.

First, to get entries for people with children, you need to apply a filter on the children column with greater than operator for value zero, i.e. children > 0.

tAggregateRow component to aggregate charges.

The filtered output is then provided as an input to the tAggregateRow component. Now, group the data by sex and smoker and apply the average function to the charges column.

Output after filtering and aggregating data.

In the above output, we receive 4 observations — non-smoker female, smoker male, non-smoker male and smoker female. Comparing the 4 values, we can say that insurance charges for non-smokers with children are less than the charges for smokers with children.

--

--

Akshada Gaonkar
Analytics Vidhya

Intern at SAS • MTech Student at NMIMS • Data Science Enthusiast!