Volume 1 Issue 2 | 2024 | View PDF
Paper Id: IJMSM-V1I2P101
doi: 10.71141/30485037/V1I2P101
Data Transformation Techniques in ETL
Nithish, Ravi, David
Citation:
Nithish, Ravi, David, "Data Transformation Techniques in ETL" International Journal of Multidisciplinary on Science and Management, Vol. 1, No. 2, pp. 01-16, 2024.
Abstract:
Data transformion is an important phase of the ETL (extract, transform, load) process in which raw, unstructured or semi-structured data is transformed into a clean and structured format for analyzing and reporting. The transformational process itself consists of a suite of methods intended to enhance data quality, consistency and usability. The key objectives include data compatibility, where disparate data formats are converted to the standardized data structure, and data aggregation, whereby data from diverse sources are combined and summarized to construct a concise report and data integration, which brings together data from different sources into a unified dataset. Probably the hardest part of data transformation is handling huge amounts of complicated, inconsistent, and sometimes even conflicting data and needing to use powerful techniques to purify, align, and normalize data. Data is prepared for full analysis and complex reporting with advanced techniques like flattening, aggregation and filtering. If you have hierarchical data structures, flattening is especially useful because you can treat this type of structure in a similar way to a normal, SQL-friendly data structure, making it easier to use when integrating with business intelligence tools. In addition, the transformation is of paramount importance to scale and maintain ETL pipelines in real-time or batch, depending on the case of use. ETL architecture is a combination of data sources, transformation tools, and storage solutions that run in a single workflow, from data extraction to analysis. Appropriate data transformation techniques, together with appropriate monitoring in the form of CloudWatch and Prometheus tools, ensure integrity, accuracy and usability of the data so that organizations can make datadriven decisions and be more business intelligence in nature.
Keywords: ETL (Extract, Transform, Load), Data Transformation, Data Cleansing, Normalization, Aggregation, Data Integration, Data Quality, Data Consistency, Business Intelligence (BI).
References:
1. E.F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, vol. 13,
no. 6, pp. 377-387, 1970.
2. Mark Levene, and George Loizou, “Why is the Snowflake Schema a Good Data Warehouse Design?,”
Information Systems, vol. 28, no. 3, pp. 225-240, 2003.
3. Joseph M. Hellerstein, “Quantitative Data Cleaning for Large Databases,” UC Berkeley, pp. 1-42, 2013.
4. Panos Vassiliadis, and Alkis Simitsis, Near Real Time ETL, New Trends in Data Warehousing and Data
Analysis, pp. 1-31, 2008.
5. Ralph Kimball, and Joe Caserta, The Data Warehouse ETL Toolkit, Wiley, pp. 1-528, 2004.
6. Naveen K, Santhosh R, Jayalakshman A, "Advanced GDP Analysis Using Artificial Intelligence"
International Journal of Multidisciplinary on Science and Management, Vol. 1, No. 1, pp. 15-20, 2024.
7. Xiaofang Li, and Yingchi Mao, “Real-Time data ETL framework for big real-time data analysis,” IEEE
International Conference on Information and Automation, Lijiang, China, pp. 1289-1294, 2015.
8. Manivasanthan R, Jonathan J, Arshard M, "Modern Accounting Systems can Support an Organization's
Efficient Management: A case of A, B, and C Transportation" International Journal of Multidisciplinary on
Science and Management, Vol. 1, No. 4, pp. 01-06, 2024.
9. Md. Badiuzzaman Biplob, Galib Ahasan Sheraji, and Shahidul Islam Khan, “Comparison of Different
Extraction Transformation and Loading Tools for Data Warehousing,” 2018 International Conference on
Innovations in Science, Engineering and Technology (ICISET), Chittagong, Bangladesh, pp. 262-267, 2018.
10. Senda Bouaziz, Ahlem Nabli, and Faiez Gargouri, “From Traditional Data Warehouse to Real Time Data
Warehouse,” Intelligent Systems Design and Application, Advances in Intelligent Systems and Computing,
vol. 557, pp. 467-477, 2017.
11. Aleksejs Vesjolijs, “The E(G)TL Model: A Novel Approach for Efficient Data Handling and Extraction in
Multivariate Systems,” Applied System Innovation, vol. 7, no. 5, pp. 1-25, 2024.
12. Vasco Santos, and Orlando Belo, “Modeling ETL Data Quality Enforcement Tasks Using Relational
Algebra Operators,” Procedia Technology, vol. 9, pp. 442-450, 2013.
13. Tanvi Jain, S. Rajasree, and Shivani Saluja, “Refreshing Datawarehouse in Near Real-Time,” International
Journal of Computer Applications, vol. 46, no. 8, pp. 24-28, 2012.
14. M. Mesiti, L. Ferrari, and S. Valtolina, “StreamLoader: An Event-Driven ETL System for the On-Line
Processing of Heterogeneous Sensor Data,” Advances in Database Technology: EDBT 2016: Proceedings,
pp. 628-631, 2016.
15. J. Sreemathy et al., “Data Integration in ETL Using TALEND,” 2020 6th International Conference on
Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, pp. 1444-1448, 2020.
16. Shaker H. Ali El-Sappagh, Abdeltawab M. Ahmed Hendawi, and Ali Hamed El Bastawissy, “A Proposed
Model for Data Warehouse ETL Processes,” Journal of King Saud University - Computer and Information
Sciences, vol. 23, no. 2, pp. 91-104, 2011.
17. Toan C. Ong et al., “Dynamic-ETL: A Hybrid Approach for Health Data Extraction, Transformation and
Loading,” BMC Medical Informatics and Decision Making, vol. 17, pp. 1-12, 2017.
18. K.V. Phanikanth, and Sithu D. Sudarsan, “A Big Data Perspective of Current ETL Techniques,” 2016
International Conference on Advances in Computing and Communication Engineering (ICACCE), Durban,
South Africa, pp. 330-334, 2016.
19. Gustavo V. Machado et al., “DOD-ETL: Distributed On-Demand ETL for Near Real-Time Business
Intelligence,” Journal of Internet Services and Applications, vol. 10, pp. 1-15, 2019.
20. Neepa Biswas et al., “A New Approach for Conceptual Extraction-Transformation-Loading Process
Modeling,” International Journal of Ambient Computing and Intelligence (IJACI), vol. 10, no. 1, pp. 1-16,
2019.
21. James L. Peugh, and Craig K. Enders, “Missing Data in Educational Research: A Review of Reporting
Practices and Suggestions for Improvement,” Review of Educational Research, vol. 74, no. 4, pp. 525-556,
2004.
22. N. Mohammed Muddasir, and K. Raghuveer, “Study of Methods to Achieve Near Real Time ETL,” 2017
International Conference on Current Trends in Computer, Electrical, Electronics and Communication
(CTCEEC), Mysore, India, pp. 436-441, 2017.