Guided Research Maria Nakhla
Abstract
The extraction of domain-specific keywords from textual data, a critical application within Natural Language Processing (NLP), has gained substantial importance in the contemporary data-driven landscape. The research concern is that there is a paramount chance of extracting keywords, which deviate from the core domain meaning. This is due to possibility of nth child keywords relations being introduced, which do not directly relate to the main domain goal. Thus, further keyword filtering is a crucial step to guarantee all keywords actually belong to the target domain. The methodology utilized consists of two main steps. The first one is clustering; in this phase multiple clustering techniques are investigated, and specially using a convex hull approach. Then comes the second step to get rid of outliers.Various techniques have been tested such as Isolation Forest and Local Outlier Factor. Text-Embeddings similarity measuring techniques with utilization of WordNet and ConceptNet are also involved as a final step. Furthermore, the utilized techniques are evaluated using recall, precision and F1-score, as well as with domain experts help for further evaluations. The results are quite promising using the convex hull clustering approach. The hybrid method combining three powerful tools which are clustering, outlier detection, and semantic similarity has proved its ability of getting rid of irrelevant class-specific keywords.
Research Questions
- Which clustering approaches currently exist that can be utilized to cluster keywords based on relevance to a class?
- What are possible outlier detection methods that could also help to achieve a more class-specific keyword set?
- Could different methods be combined for better results?
- In which ways can the resulting filtered keywords set be evaluated?
| Attribute | Value |
|---|---|
| Title (de) | Clusterbasiertes korrigierendes Filtern von klassenspezifischen Schlüsselwortgruppen |
| Title (en) | Cluster-based Corrective Filtering of Class-specific Keyword Sets |
| Project | CreateData4AI (CD4AI) |
| Type | Guided Research |
| Status | completed |
| Student | Maria Nakhla |
| Advisor | Stephen Meisenbacher , Tim Schopf |
| Supervisor | Prof. Dr. Florian Matthes |
| Start Date | 17.10.2023 |
| Sebis Contributor Agreement signed on | 19.10.2023 |
| Checklist filled | Yes |
| Submission date | 15.04.2024 |