Using Semantic Technology to Solve Sparse Training Material Problem in Machine Learning for Classification of Company Websites


Starting in 2015 ds9 has been developing a large SEARCHCORPUS of companies in the Bio Sciences market for Boehringer Ingelheim. This Biotech Companies SEARCHCORPUS is optimized on an ongoing base to allow data scientists to quickly find licensing opportunities, acquisition targets and new technological developments of competitors.
Comprising a collection of  > 10 Mio. pages from approx. 50.000 corporate websites even highly specific expert searches result in hundreds of potential targets that need to be verified manually.
This presentation will talk about an approach to restrict search on company types that are expected among as targets among the search hits by automatically classifying companies based on the corporate websites in different classes like Client Research Organizations, Electronic Health or Big Pharma.

The main problem to automatically classify companies based on website content is the fact that training data needs to be manually selected and qualified. Where machine learning usually expects thousands of training samples, semantic technology allowed us to use classic machine learning algorithms to build classifiers converging with less than 100 records of training data.


Interested in this talk?

Register for SEMANTiCS conference