Guided Research Maria Nakhla

Abstract

The extraction of domain-specific keywords from textual data, a critical application within Natural Language Processing (NLP), has gained substantial importance in the contemporary data-driven landscape. The research concern is that there is a paramount chance of extracting keywords, which deviate from the core domain meaning. This is due to possibility of nth child keywords relations being introduced, which do not directly relate to the main domain goal. Thus, further keyword filtering is a crucial step to guarantee all keywords actually belong to the target domain. The methodology utilized consists of two main steps. The first one is clustering; in this phase multiple clustering techniques are investigated, and specially using a convex hull approach. Then comes the second step to get rid of outliers.Various techniques have been tested such as Isolation Forest and Local Outlier Factor. Text-Embeddings similarity measuring techniques with utilization of WordNet and ConceptNet are also involved as a final step. Furthermore, the utilized techniques are evaluated using recall, precision and F1-score, as well as with domain experts help for further evaluations. The results are quite promising using the convex hull clustering approach. The hybrid method combining three powerful tools which are clustering, outlier detection, and semantic similarity has proved its ability of getting rid of irrelevant class-specific keywords.

Research Questions

Which clustering approaches currently exist that can be utilized to cluster keywords based on relevance to a class?
What are possible outlier detection methods that could also help to achieve a more class-specific keyword set?
Could different methods be combined for better results?
In which ways can the resulting filtered keywords set be evaluated?

Attribute	Value
Title (de)	Clusterbasiertes korrigierendes Filtern von klassenspezifischen Schlüsselwortgruppen
Title (en)	Cluster-based Corrective Filtering of Class-specific Keyword Sets
Project	CreateData4AI (CD4AI)
Type	Guided Research
Status	completed
Student	Maria Nakhla
Advisor	Stephen Meisenbacher , Tim Schopf
Supervisor	Prof. Dr. Florian Matthes
Start Date	17.10.2023
Sebis Contributor Agreement signed on	19.10.2023
Checklist filled	Yes
Submission date	15.04.2024

To top

Chair of Software Engineering for Business Information Systems

Prof. Dr. Florian Matthes