Bachelor's Thesis Weixin Yan
Leveraging Domain Knowledge for Class-Specific Keyword Extraction
Abstract
Around 80% of the data generated today are unannotated and unstructured text, making it challenging for AI applications to leverage it effectively. Manual annotation by domain experts can provide high precision and incorporate domain-specific knowledge, but it is expensive, inefficient, and unscalable. This motivates the need for a hybrid approach that combines Natural Language Processing techniques and domain expertise to more efficiently annotate and classify text data. The proposed approach is divided into a pipeline of multiple sub-tasks with the goal of creating meaningful datasets that are classified according to defined features.
The first step in this pipeline is to support the domain expert in defining the classes with the help of keyword extraction techniques, which is the focus of this thesis. In this context, the role of a domain expert involves conceptualizing the desired classes by assigning relevant tags or creating class descriptions. This domain-specific knowledge can then be injected into state-of-the-art keyword extraction methods, offering support for the domain expert to better identify related class-specific keywords and potentially refine the scope of the class. The objective is to create a more efficient and accurate approach to keyword extraction that is tailored to the specific needs of the domain expert.
The results of this study can provide a valuable contribution to the development of domain-specific datasets for AI applications, particularly for small and medium-sized companies with limited resources. The evaluation of the modified approach will involve domain experts and their assessment of the comprehensiveness of the resulting keyword sets.
Research Questions
How can domain experts be supported in the definition of classes for characterizing large text corpora, particularly in the creation of keywords and keyphrases?
- What approaches currently exist that can be utilitzed to extract keywords and keyphrases from large unstructured text corpora?
- How can short textual class descriptions and class-specific seed keywords from the WZ2008 classification, validated by domain experts, be leveraged to adapt the identified keyword extraction approaches for class-specific keyword extraction?
- Without the use of external knowledge bases, how can the extracted class-specific keywords be used as a basis for the generation of further class-specific keywords?
- In what way can the modified approach be evaluated by domain experts to validate the representativeness of the resulting keyword sets?
| Attribute | Value |
|---|---|
| Title (de) | Nutzung von Domänenwissen für die Extraktion klassenspezifischer Schlüsselwörter |
| Title (en) | Leveraging Domain Knowledge for Class-Specific Keyword Extraction |
| Project | CreateData4AI (CD4AI) |
| Type | Bachelor's Thesis |
| Status | completed |
| Student | Weixin Yan |
| Advisor | Stephen Meisenbacher , Tim Schopf |
| Supervisor | Prof. Dr. Florian Matthes |
| Start Date | 15.04.2023 |
| Sebis Contributor Agreement signed on | 06.04.2023 |
| Checklist filled | Yes |
| Submission date | 15.09.2023 |