SEMANTICS is really proud to welcome Otto Group Senior Data Architect Utz Westermann as Keynote Speaker this year. We had the chance to do a quick interview and ask him about his current projects, challenges he faces and trends he foresees as well as his expectations for SEMANTICS 2017.
Utz, can you tell something about your work/research focus?
The focus of our work at Otto Group BI (GBI) is to develop and operate a large-scale data platform that brings together user data -- including click activity, transactions, CRM data -- from more than 120 online shops with a yearly revenue north of 7bn Euros. The GBI Data Platform provides a common basis for analytics and machine learning for cross-device user recognition, user profiling, and personalization in form of a Hadoop data lake.
Which trends and challenges you see for linked data/semantic web?
The central challenge we are facing is agility -- and that along multiple dimensions:
On the one hand, business requires us to respond to new requirements quickly. This entails being able to add new data sources, implement pipelines for data cleansing, new views, or aggregates, and rolling these changes out to production, and operating them -- and all that quickly with only limited personnel.
On the other hand, the platform's constantly evolving data pool needs to be kept accessible for our analysts and data scientists: documentation about the meaning of the data, form of data, and data lineage all need to be kept current.
Up to now, we have been able to keep up with the challenge by following an unconventional approach to data processing that is based on the declarative specification of operational data semantics.
In contrast to traditional ETL workflows, we specify the data in our platform holistically, including their structure, the dependencies on which data are based, as well as the computation logic by which data is derived from other data.
The benefits of such a semantic approach are at least twofold:
Firstly, we were able to develop a job scheduler allowing us to quickly deploy new data views or change the logic of existing views without having to manually implement and coordinate the rollout.
Secondly, rich metadata tooling comes for cheap as metadata and data lineage do not have to be inferred as they are "programmed" in the specifications.
We were very successful with our approach within GBI and published our scheduler -- Schedoscope -- and metadata management tool -- Metascope -- as open-source software (http://schedoscope.org).
However, there is a new challenge: even more agility!
Integrating data sources continuously, we need to scale up cluster capacity quickly, which is difficult if you follow traditional corporate hardware purchase processes and rather coarse-grained. Elastic scaling in the cloud becomes more and more reasonable. Also, more and more data comes in form of streams instead of batches: information should be become available near real-time and not in batches once a day.
The problem is that Schedoscope's declarative specification of operational data semantics does not translate naturally to a world of data streams: Surely, information about data structure, dependencies, and logic are still relevant for documenting and understanding data stream processing pipelines as well. However, finite time job scheduling is no longer relevant as data stream processing pipelines run continuously.
Questions of interest are rather:
How to deploy and scale stream processing pipelines elastically on cloud infrastructure without much operations overhead?
How to modify already running pipelines?
How to monitor them?
Essentially: what would a semantic description of data streams look like that could serve as basis for dealing with those questions?
What are your expectations about Semantics 2017 in Amsterdam?
I am looking forward to Semantics 2017 for stimulative input to these questions. Beyond our use cases at GBI outlined above: I think it is interesting to review tools and techniques from knowledge representation and "Semantic Web" with regard to how they apply near real-time streaming data more and more trespassing into formerly relatively static databases and batch job territory.
You work a lot in open source projects, doesn’t that interfere with the ambitions of Otto to be a unique company?
The simple answer is: no!
Schedoscope and Metascope are generic tools not tied to retail use cases. We are not competing with other companies in terms of tools but in terms of building great online retail experiences. Our ability to do this is mostly determined by the engineering talent in our company and not so much by the tools used.
As probably every larger company will be telling you, it has become difficult to acquire talent in sufficient numbers from the current job market. The traditional pitch for open-sourcing software has been that you might get engineering work done by people outside your organization. Much more, however, have we found that a full shopping window of interesting open-source software constitutes a great recruiting tool bringing Otto on the radar of scarce talent that before would have never considered Otto the great place to work at that it really is.