Buoyancy in Data Lakes - Agile Metadata Management in Hadoop Data Warehouses at Otto Group

Keynote

Current metadata and information about data lineage are crucial for understanding and interpreting data in a Hadoop data warehouse. At the same time, Hadoop data warehouse projects sink or swim with the ability to continuously add new data sources and views as business requirements evolve.

Conventional ETL workflow schedulers and metadata management approaches prove millstones round a project's neck, however:

with every rollout of new views significant effort has to be invested in one-time migration scripts for schema and data;
manual data documentation is laborious and gets out-of-sync quickly; and
automatic metadata derivation approaches are invasive, resource-taxing, and usually yield a perspective too technical for business.

The talk proposes integrated specification of data structure, data dependencies, and computation logic as a way to keep ETL productive and metadata current.

Based on such specifications, a scheduler can automatically detect changes and perform

appropriate schema migrations and
data recomputations automatically as necessary, significantly reducing rollout and operations effort.

Also, the very same specifications explicitly "program" rich metadata, avoiding

additional manual documentation labor,
automatic metadata derivation overhead and its low semantic level, greatly simplifying the implementation of metadata exploration tools.

The talk illustrates this approach with Schedoscope, a scheduler developed at Otto Group based on integrated view specification, and Metascope, a collaborative metadata exploration tool built on top of Schedoscope. Schedoscope and Metascope drive Otto Group BI's data platform, which processes clickstream, product, and CRM data from 120 online shops with a yearly revenue north of 5bn Euros. Schedoscope has enabled Otto Group BI's small team of data engineers to continuously release new data sources and view for more than 2 years now; with Metascope, Otto Group's analysts and data scientists have access to always up-to-date metadata and documentation.

Schedoscope and Metascope are available as open-source at http://schedoscope.org

Speakers:

Utz Westermann

Senior Data Architect at Otto Group

Otto Group
http://www.ottogroup.com/en/die-otto-group/Group-Companies.php

Utz Westermann is Senior Data Architect at Otto Group BI, Hamburg, with 18 years of experience in large-volume data processing.

He is the tech lead for Otto Group BI's data platform, which processes clickstream, product,and CRM data from 120 online shops.

Search form

Buoyancy in Data Lakes - Agile Metadata Management in Hadoop Data Warehouses at Otto Group

Speakers:

Utz Westermann