Inclusion of Reference Information in Regulatory Documents

About 80% of all data is of unstructured nature (e.g. text, images, video, audio), with text being the largest human-generated data source. Natural language processing is a branch of AI that helps computers understand, interpret and manipulate human language. A major task in natural language processing is the extraction of relevant information in a representation that the machine learning models can process.

One of the key challenges is the extraction of context-based meaning, including the integration of referred knowledge in other paragraphs/articles/documents. If a text states for example: „the data protection needs to meet the requirements in article 5“, then in order to grasp the whole context, it is necessary to make a connection to the information contained in „article 5“. To resolve these references e.g. the text could be enriched with the relevant information from „article 5“ or a knowledge graph could be created containing all references and their content.

 

Based on a structured literature review on the topic, the aim of this thesis is to suggest and implement a solution for resolving references in regulatory documents. The student is expected to conduct a scientific literature review about current state approaches for the challenge stated above. The knowledge gained through this literature review should then be applied to implement a Prototype of the most promising approach (or own improvement of this) for regulatory documents in Python. Focus of this bachelor thesis is the implementation.

Prior knowledge in these fields is not required but a strong interest and motivation to build knowledge and improve one's programming skills should exist.

Contact: bachelor.i17 [at] in.tum.de