Conversion of Regulatory Documents from PDF to Plain Text Format (currently reserved)

Business process compliance aims at checking and ensuring that business processes obey to the relevant constraints that are imposed on them. Sources for constraints are typically regulatory documents, i.e., unstructred textual data. By means of text mining and natural language processing (NLP) techniques it is possible to extract constraints. However, before such techniques can be applied regulatory documents need to be prepared. The challenge is that these documents are mostly available in PDF format and can contain figures, tables, footnotes etc. introducing a lot of noise which hamper the direct application of text and NLP techniques without a lot of preprocessing steps.

The purpose of this bachelor thesis is to develop means for converting regulatory documents from PDF format to plain text format while at the mean time filtering out the parts of the document that could cause noise. A software prototype shall be developed.

Contact: bachelor.i17 [at] in.tum.de