Semantic PDF Segmentation for Legacy Documents in Technical Documentation

Research & Innovation

The most common format to store and provide technical documentation is PDF. However, due to the un- structured nature of the format these documents are often excluded from a granular semantic access. While more and more companies are implementing XML-based component content management systems which can deliver annotated structured content, older legacy documents remain in their monolithic form.

We developed a new approach which segments PDF documents into semantically related sections via classification knowledge gained from structured training content. This approach based on machine learning is independent from any formatting information or visual clues.

In this paper, we take the results from multiple previous works and combine them into a holistic procedure model. We introduce a parameterizable range finding algorithm to refine segment detection and provide a RDF-based format to exchange the generated metadata which can then be used to improve information retrieval for users.

SlideDeck:

S4.1 - Oevermann_SemanticPDFSegementation_Presentation.pdf

Speakers:

Jan Oevermann

German Research Center for Artificial Intelligence
https://2018.semantics.cc/

Jan Oevermann is a PhD candidate at University of Bremen and Karlsruhe University of Applied Sciences. His research focuses on the improvement of semantic access to technical documentation. He works as team leader at ICMS GmbH and is a visiting researcher at the German Research Center for Artificial Intelligence (DFKI).

Search form

Semantic PDF Segmentation for Legacy Documents in Technical Documentation

Speakers:

Jan Oevermann

Interested in this talk?