Data Engineering Fundamentals
Types of data
-
Structured
Data is organized in a defined manner or schema, typically found in relational databases.
- easily queryable
- organized in rows and columns
- has a consistent structure
-
Unstructured
Data that doesn’t have a predefined structure or schema.
- not easily queryable without preprocessing
- may come in various formats
-
Semi-structured
Data that is not as organized as structured data but has some level of structure in the form of tags, hierarchies or other patterns.
- elements might be tagged or chategorized in some way
- more flexible than structured data but not as chaotic as unstructured data
Properties of data
-
Volume
Refers to the amount or size of data that organizations are dealing with at any given time.
-
Velocity
Refers to the speed at which new data is generated, collected and processed.
-
Variety
Refers to the different types, structures and sources of data.
Data Warehouses vs. Data Lakes
Data Warehouse
A centralized repository optimized for analysis where data from different sources is stored in a structured format.
Data Lake
A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.
Schema
-
Data Warehouse
Schema on write: predefined schema before writing data.
- Extract - Transform - Load (ETL)
-
Data Lake
Schema on read: schema is defined at the time of reading data.
- Extract - Load - Transform (ELT)
Data Lakehouse
A hybrid data architecture that combines the best features of data lakes and data warehouses, aiming to provide the performance, reliability and capabilities of a data warehouse while maintaining the flexibility, scale and low-cost storage of data lakes.
Data Mesh
It’s more about governance and organization. Individual teams own “data products” within a given domain. There data products serve various “use cases” around the organization. Domain-bases data management. Federated governance with central standards. Self-service tooling and infrastructure.
ETL Pipelines
ETL stands for Extract, Transform and Load. It’s a process used to move data from source systems into a data warehouse
-
Extract
- Retrieve raw data from source systems.
- Ensure data integrity during the extraction phase.
- Can be done in real-time or in batches, depending on requirements.
-
Transform
- Convert the extracted data into a format suitable for the target data warehouse.
- Can involve various operations such as:
- data cleansing
- data enrichment
- format changes
- aggregations or computations
- encoding or decoding
- handling missing values
-
Load
- Move the transformed data into the target data warehouse or another data repository.
- Can be done in batches or in a streaming manner.
- Ensure that data maintains its integrity during the load phase.
-
Managing ETL Pipelines
- This process must be automated in some reliable way.
Data Sources
- JDBC: Java Database Connectivity
- ODBC: Open Database Connectivity
- Raw logs
- APIs
- Streams
Data Formats
- CSV
- JSON
- Avro
- Parquet