• it’s a repository of data stored in raw format
  • no schema on write requirements for the input sources to make easy the injection process
  • to access data we follow a schema on read approach for more versatility

BENEFITS

  • higher scalability (no need to scale large computing architectures)
  • data are stored only in one place
  • support for unstructured data
  • support for machine learning workloads

USE CASES VS TECH SOLUTIONS

USE CASESTECH SOLUTIONSFEATURES TRENDS
Mission critical, low latency, insight appsData Warehouse/Hot- More expensive HW/SW - Use case-specific data - Less latency - More governance - Higher data quality - Used by end-users and data analysts
Agile insight appsData Hub/Warm
Staging area, data mining, searching, profiling, catalogingData lake/Cold- Less expensive HW/SW - All enterprise data - More latency - Less governance - Lower data quality - Used by data scientists

INSIGHT DRIVEN DATA SYSTEMS VS TRADITIONAL DATA SYSTEM

USE CASESTRADITIONAL DATA SYSTEMINSIGHT DRIVEN DATA SYSTEM
DATA SOURCESStructured, relational data from transaction systems, relational and operational data storesTraditional sources + semi and unstructured sources: logs, web sites, social media, alternative data providers
Data Movement (Ingestion)limited Amount of data that can be movedUnlimited volume of data that can be moved inside the system
StorageLimited volume of dataVirtually unlimited volume of data
Data Structureschema is designed upfront before data are inserted into the systemno schema is defined for data, data can be stored in various formats
Data Trasformationdata need to be formatted cleaned and adjusted to fit into the model schemadata transformation are added to match data analysis requirements
Analyticssql queries BI tools full text+ self-service BI, big data analytics, realtime analytics, machine learning, data exploration/visualization. Allow users to securely explore and query raw data. Easily introduce new types of analytics
Price/PerformanceHighest cost storage/fastest query resultslow-cost storage + performance scale/speed/cost tradeoffs
UsersBusinessdata scientist data analysis
Data QualityHighuse case dependent must be clear from design
Data sharing and collaborationVery limitedRich. Raw and transformed data sets, analytical models, dashboards can be easily and securely shared
  • this new systems are cheaper to design, can handle multiple data inputs, enabling users to perform powerful operations in a large amount of data of various forms

DATA LAKES STRUCTURES

DATA LAKE ARCHITECTURES

lambda lake

  • designed for multiple workloads
  • data are inserted in 2 pipelines one for time consuming operation (cold path) and one real-time workflows where data need to be sent to the clients faster

kappa lake

  • simplified version of lambda lake, it removes the cold path stage and replace it with a long term data storage

delta lake

advanced features like:

  • ACID properties (it is able to deal with consistency requirements).
  • Scalable metadata handling.
  • Data versioning (also called time travel, it allows the analysis of different snapshots).
  • Unified batch and streaming source and sink.
  • Schema enforcement and evolution.
  • DBMS-like operations: updates, deletes, inserts, upserts (insert, on conflict update).

PREVIOUS NEXT