- it’s a repository of data stored in raw format
- no schema on write requirements for the input sources to make easy the injection process
- to access data we follow a schema on read approach for more versatility
BENEFITS
- higher scalability (no need to scale large computing architectures)
- data are stored only in one place
- support for unstructured data
- support for machine learning workloads
USE CASES VS TECH SOLUTIONS
| USE CASES | TECH SOLUTIONS | FEATURES TRENDS |
|---|---|---|
| Mission critical, low latency, insight apps | Data Warehouse/Hot | - More expensive HW/SW - Use case-specific data - Less latency - More governance - Higher data quality - Used by end-users and data analysts |
| Agile insight apps | Data Hub/Warm | |
| Staging area, data mining, searching, profiling, cataloging | Data lake/Cold | - Less expensive HW/SW - All enterprise data - More latency - Less governance - Lower data quality - Used by data scientists |
INSIGHT DRIVEN DATA SYSTEMS VS TRADITIONAL DATA SYSTEM
| USE CASES | TRADITIONAL DATA SYSTEM | INSIGHT DRIVEN DATA SYSTEM |
|---|---|---|
| DATA SOURCES | Structured, relational data from transaction systems, relational and operational data stores | Traditional sources + semi and unstructured sources: logs, web sites, social media, alternative data providers |
| Data Movement (Ingestion) | limited Amount of data that can be moved | Unlimited volume of data that can be moved inside the system |
| Storage | Limited volume of data | Virtually unlimited volume of data |
| Data Structure | schema is designed upfront before data are inserted into the system | no schema is defined for data, data can be stored in various formats |
| Data Trasformation | data need to be formatted cleaned and adjusted to fit into the model schema | data transformation are added to match data analysis requirements |
| Analytics | sql queries BI tools full text | + self-service BI, big data analytics, realtime analytics, machine learning, data exploration/visualization. Allow users to securely explore and query raw data. Easily introduce new types of analytics |
| Price/Performance | Highest cost storage/fastest query results | low-cost storage + performance scale/speed/cost tradeoffs |
| Users | Business | data scientist data analysis |
| Data Quality | High | use case dependent must be clear from design |
| Data sharing and collaboration | Very limited | Rich. Raw and transformed data sets, analytical models, dashboards can be easily and securely shared |
- this new systems are cheaper to design, can handle multiple data inputs, enabling users to perform powerful operations in a large amount of data of various forms
DATA LAKES STRUCTURES

DATA LAKE ARCHITECTURES
lambda lake
- designed for multiple workloads
- data are inserted in 2 pipelines one for time consuming operation (cold path) and one real-time workflows where data need to be sent to the clients faster

kappa lake
- simplified version of lambda lake, it removes the cold path stage and replace it with a long term data storage

delta lake
advanced features like:
- ACID properties (it is able to deal with consistency requirements).
- Scalable metadata handling.
- Data versioning (also called time travel, it allows the analysis of different snapshots).
- Unified batch and streaming source and sink.
- Schema enforcement and evolution.
- DBMS-like operations: updates, deletes, inserts, upserts (insert, on conflict update).