Skip to content

Data Engineering Fundamentals

Types of data

  • Structured

    Data is organized in a defined manner or schema, typically found in relational databases.

    • easily queryable
    • organized in rows and columns
    • has a consistent structure
  • Unstructured

    Data that doesn’t have a predefined structure or schema.

    • not easily queryable without preprocessing
    • may come in various formats
  • Semi-structured

    Data that is not as organized as structured data but has some level of structure in the form of tags, hierarchies or other patterns.

    • elements might be tagged or chategorized in some way
    • more flexible than structured data but not as chaotic as unstructured data

Properties of data

  • Volume

    Refers to the amount or size of data that organizations are dealing with at any given time.

  • Velocity

    Refers to the speed at which new data is generated, collected and processed.

  • Variety

    Refers to the different types, structures and sources of data.

Data Warehouses vs. Data Lakes

Data Warehouse

A centralized repository optimized for analysis where data from different sources is stored in a structured format.

Data Lake

A storage repository that holds vast amounts of raw data in its native format, including structured, semi-structured, and unstructured data.

Schema

  • Data Warehouse

    Schema on write: predefined schema before writing data.

    • Extract - Transform - Load (ETL)
  • Data Lake

    Schema on read: schema is defined at the time of reading data.

    • Extract - Load - Transform (ELT)

Data Lakehouse

A hybrid data architecture that combines the best features of data lakes and data warehouses, aiming to provide the performance, reliability and capabilities of a data warehouse while maintaining the flexibility, scale and low-cost storage of data lakes.

Data Mesh

It’s more about governance and organization. Individual teams own “data products” within a given domain. There data products serve various “use cases” around the organization. Domain-bases data management. Federated governance with central standards. Self-service tooling and infrastructure.

Data Mesh

ETL Pipelines

ETL stands for Extract, Transform and Load. It’s a process used to move data from source systems into a data warehouse

  • Extract

    • Retrieve raw data from source systems.
    • Ensure data integrity during the extraction phase.
    • Can be done in real-time or in batches, depending on requirements.
  • Transform

    • Convert the extracted data into a format suitable for the target data warehouse.
    • Can involve various operations such as:
      • data cleansing
      • data enrichment
      • format changes
      • aggregations or computations
      • encoding or decoding
      • handling missing values
  • Load

    • Move the transformed data into the target data warehouse or another data repository.
    • Can be done in batches or in a streaming manner.
    • Ensure that data maintains its integrity during the load phase.
  • Managing ETL Pipelines

    • This process must be automated in some reliable way.

Data Sources

  • JDBC: Java Database Connectivity
  • ODBC: Open Database Connectivity
  • Raw logs
  • APIs
  • Streams

Data Formats

  • CSV
  • JSON
  • Avro
  • Parquet