Buoyancy in Data Lakes - Agile Metadata Management in Hadoop Data Warehouses at Otto Group


Current metadata and information about data lineage are crucial for understanding and interpreting data in a Hadoop data warehouse. At the same time, Hadoop data warehouse projects sink or swim with the ability to continuously add new data sources and views as business requirements evolve.

Conventional ETL workflow schedulers and metadata management approaches prove millstones round a project's neck, however:

  • with every rollout of new views significant effort has to be invested in one-time migration scripts for schema and data;
  • manual data documentation is laborious and gets out-of-sync quickly; and
  • automatic metadata derivation approaches are invasive, resource-taxing, and usually yield a perspective too technical for business.

The talk proposes integrated specification of data structure, data dependencies, and computation logic as a way to keep ETL productive and metadata current.

Based on such specifications, a scheduler can automatically detect changes and perform

  • appropriate schema migrations and
  • data recomputations automatically as necessary, significantly reducing rollout and operations effort.

Also, the very same specifications explicitly "program" rich metadata, avoiding

  • additional manual documentation labor,
  • automatic metadata derivation overhead and its low semantic level, greatly simplifying the implementation of metadata exploration tools.

The talk illustrates this approach with Schedoscope, a scheduler developed at Otto Group based on integrated view specification, and Metascope, a collaborative metadata exploration tool built on top of Schedoscope. Schedoscope and Metascope drive Otto Group BI's data platform, which processes clickstream, product, and CRM data from 120 online shops with a yearly revenue north of 5bn Euros. Schedoscope has enabled Otto Group BI's small team of data engineers to continuously release new data sources and view for more than 2 years now; with Metascope, Otto Group's analysts and data scientists have access to always up-to-date metadata and documentation. 

Schedoscope and Metascope are available as open-source at http://schedoscope.org