no schema on write requirements for the input sources to make easy the injection process
to access data we follow a schema on read approach for more versatility
BENEFITS
higher scalability (no need to scale large computing architectures)
data are stored only in one place
support for unstructured data
support for machine learning workloads
USE CASES VS TECH SOLUTIONS
USE CASES
TECH SOLUTIONS
FEATURES TRENDS
Mission critical, low latency, insight apps
Data Warehouse/Hot
- More expensive HW/SW - Use case-specific data - Less latency - More governance - Higher data quality - Used by end-users and data analysts
Agile insight apps
Data Hub/Warm
Staging area, data mining, searching, profiling, cataloging
Data lake/Cold
- Less expensive HW/SW - All enterprise data - More latency - Less governance - Lower data quality - Used by data scientists
INSIGHT DRIVEN DATA SYSTEMS VS TRADITIONAL DATA SYSTEM
USE CASES
TRADITIONAL DATA SYSTEM
INSIGHT DRIVEN DATA SYSTEM
DATA SOURCES
Structured, relational data from transaction systems, relational and operational data stores
Traditional sources + semi and unstructured sources: logs, web sites, social media, alternative data providers
Data Movement (Ingestion)
limited Amount of data that can be moved
Unlimited volume of data that can be moved inside the system
Storage
Limited volume of data
Virtually unlimited volume of data
Data Structure
schema is designed upfront before data are inserted into the system
no schema is defined for data, data can be stored in various formats
Data Trasformation
data need to be formatted cleaned and adjusted to fit into the model schema
data transformation are added to match data analysis requirements
Analytics
sql queries BI tools full text
+ self-service BI, big data analytics, realtime analytics, machine learning, data exploration/visualization. Allow users to securely explore and query raw data. Easily introduce new types of analytics
Rich. Raw and transformed data sets, analytical models, dashboards can be easily and securely shared
this new systems are cheaper to design, can handle multiple data inputs, enabling users to perform powerful operations in a large amount of data of various forms
DATA LAKES STRUCTURES
DATA LAKE ARCHITECTURES
lambda lake
designed for multiple workloads
data are inserted in 2 pipelines one for time consuming operation (cold path) and one real-time workflows where data need to be sent to the clients faster
kappa lake
simplified version of lambda lake, it removes the cold path stage and replace it with a long term data storage
delta lake
advanced features like:
ACID properties (it is able to deal with consistency requirements).
Scalable metadata handling.
Data versioning (also called time travel, it allows the analysis of different snapshots).
Unified batch and streaming source and sink.
Schema enforcement and evolution.
DBMS-like operations: updates, deletes, inserts, upserts (insert, on conflict update).