authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Alexandre is a database and data migration specialist and a team leader with more than two decades of professional experience.
PREVIOUSLY AT
One of the most critical mistakes that prevent companies from providing vital decision-making information to business users is the absence of reliable data from one or more data sources, gathered in a single location, organized and prepared to be used.
Imagine this scenario—you are the IT manager of a cell phone case company with retail locations all over the United States. Your board of directors is struggling to make decisions based on sales because the information is available but it is in different locations and formats. One of the directors asks you to generate a dashboard with sales information from all the points of sale to present in the next board meeting.
You already know that it is almost impossible to consolidate the information due to different formats and structures. Some of the retail kiosks are still using a proprietary system in an Oracle database. Bigger stores are using a new Salesforce system. The newest kiosks that started operations during the system transition have shared spreadsheets used to compute sales.
How will you collate all the data from different locations, formats, and structures into a unique database ready to be accessed for dashboard generation?
ETL stands for Extract, Transform, and Load. ETL is a group of processes designed to turn this complex store of data into an organized, reliable, and replicable process to help your company generate more sales with the data you already have.
In our case, we’ll receive data from an Oracle database (most kiosks), from Salesforce (stores), and from spreadsheets (newer kiosks), extract the data, transform when necessary, and load into a single data warehouse database to be accessed by reporting tools and used to generate dashboards and scorecards.
Let’s dive into the three steps of ETL to elaborate on the process.
Extraction is the process of receiving data from single or multiple sources. Sources can have different formats and structures, such as documents, spreadsheets, CSV files, flat files, relational databases such as Oracle, MySQL, SQL Server, non-relational databases, and so on.
There are two main types of extraction: full and partial.
In addition to these aspects, some other considerations are necessary when choosing between full or partial extraction, and I want to describe one of them: data availability and integrity.
It means that only completed transactions will be considered for extraction, eliminating data that could cause an integrity fault. For example, an online test to identify an engineer’s knowledge with 10 questions. If an engineer is in the middle of the test and has answered a few questions but hasn’t finished yet, the extraction process can’t read the questions from unfinished tests. It could cause an integrity fault.
After extracting the data, we can begin the transformation process: clean, manipulate, and transform data according to business rules and technical criteria to maintain an acceptable level of data quality.
Depending on a number of factors, the use of a staging area may be necessary. A staging area is an intermediate storage space used to temporarily store data extracted from data sources to be transformed.
In some projects, normally those with a small amount of data, there is no need to use staging areas, but most projects use it.
There is a range of tasks performed during the transformation phase:
Here are some common transformation types:
Last but not least, the final process in ETL is to load the data into the destination. Loading is the act of inserting transformed data (from a staging area or not) into the repository, normally a data warehouse database.
There are three main types of loading data: full or initial, incremental, and refresh.
In summary, every company, independent of its size, can use ETL processes to integrate information that already exists and to generate an even bigger wealth of information for decision-making, turning data that previously could not be used into a new source of revenue.
Testing is one of the most important phases of ETL, and yet one of the most overlooked too.
Transforming data from different sources and structures and loading it into a data warehouse is very complex and can generate errors. The most common errors were described in the transformation phase above.
Data accuracy is the key to success, while inaccuracy is a recipe for disaster. Therefore, ETL professionals have a mission to guarantee data integrity during the whole process. After each phase, a test must be performed. Whether extracting data from a single source or from multiple sources, the data must be checked to establish there are no errors.
The same must be done after any transformation. For example, when summarizing data during the transformation phase, the data must be checked to guarantee that no data was lost and the sums are correct.
After loading the transformed data into the data warehouse, the testing process must be applied again. The loaded data needs to be compared with the transformed data and after that with the extracted data.
In our cell phone case company example, we are working with three different sources (proprietary Oracle database, Salesforce, and spreadsheets) and different formats. The testing phase can use sample data from the original sources and compare them with the data that is in the staging area to guarantee that the extraction occurred without errors.
The sample data—which in this case can be sales information from three different locations (stores, old kiosks, new kiosks)—is to be compared against the original source. The differences, if any, must be analyzed to see whether they are acceptable or if they are errors.
If errors are found, they must be fixed, and there are a few decisions to be made if you need to fix them: Should the original data be modified? Is it possible to do so? If the errors can’t be fixed in the original source, can they be fixed by some transformation?
In some cases, the data with errors must be eliminated and an alert triggered to inform those in charge.
Some testing examples:
Logging ETL processes is the key guarantee that you have maintainable and easy-to-fix systems.
An ETL with the correct logging process is important to keep the entire ETL operation in a state of constant improvement, helping the team manage bugs and problems with data sources, data formats, transformations, destinations, etc.
A robust log process helps teams save time allowing them to identify problems more quickly and easily, and lead engineers need less time to directly pinpoint the problem. Sometimes, errors occur in the middle of extracting tons of data, and without a log, identifying the problem is hard—at times almost impossible. Without logs, the whole process must be re-run. Using logs, the team can rapidly identify the file and row that caused the problem and can fix only that data.
The only case I can imagine where logs are not so important is with very small unautomated systems, where the process runs manually and there’s a small amount of data that can be monitored by hand.
Logs improve automation. ETL processes with a large amount of data that run automatically need log systems. If they’re well planned and executed, all the effort put into building a logging system will pay dividends in the form of faster error identification, more reliable data, and the improvement points found in log files.
There are three main steps in creating a log system: generate, archive, and analyze.
ETL processes can work with tons of data and may cost a lot—both in terms of time spent to set them up and the computational resources needed to process the data. When planning an integration, engineers must keep in mind the necessity of all the data being employed. The best thing to do is to work with the minimum amount of data to achieve the objectives, rather than spend time and money just migrating useless data. Also, bear in mind that data amounts tend to increase over time, so try to consider your future needs.
How much effort should go into designing a logging system?
It depends on a number of different factors like the amount and frequency of the data to be processed. Small systems with little data can be managed manually, without the need to invest in advanced logging systems.
Companies with a large amount of data, many different processes, different data sources and formats, and complex pipelines must invest in creating logging systems. As we already mentioned, a good log system can save a lot of time and money.
For a more obvious approach, regardless of company size, amount of data, and frequency of integration, an ETL process must be profitable. An investment in time, money, and technical resources must generate returns—as economist Milton Friedman said: “There is no free lunch.”
In summary, an ETL process can help companies to increase profit with data they already have but that is not being used in the right way. ETL pipelines can integrate different systems, those spreadsheets with important information, and other portions of data that are spread out around different departments and branches, allowing organizations to make the most of their data.
ETL stands for Extract, Transform, and Load.
ETL is not a single process, but a group of processes designed to turn a complex store of data into an organized, reliable, and replicable process to help you efficiently use more of the data you already have.
Because ETL consolidates data from different sources, then transforms it into a format used in the data warehouse, improving the quality and consistency of the data.
By ensuring the data is consolidated and transformed into a common format, regardless of the original source and format.
San Diego, CA, United States
June 18, 2020
Alexandre is a database and data migration specialist and a team leader with more than two decades of professional experience.
PREVIOUSLY AT
World-class articles, delivered weekly.
World-class articles, delivered weekly.
Join the Toptal® community.