Gl21a

Data Scarcity - Methods to Improve the Quality of Text Classification

Legal document analysis is an important research area. The classification of clauses or sentences enables valuable insights such as the extraction of rights and obligations. However, datasets consisting of contracts or other legal documents are quite rare, particularly regarding the German language. The exorbitant cost of manually labeled data, especially in regard to text classification, is the motivation of many studies that suggest alternative methods to overcome the lack of labeled data.

This paper experiments the effects of text data augmentation on the quality of classification tasks. While a large amount of techniques exists, this work examines a selected subset including semi-supervised learning methods and thesaurus-based data augmentation. We could not just show that thesaurus-based data augmentation as well as text augmentation with synonyms and hypernyms can improve the classification results, but also that the effect of such methods depends on the underlying data structure.

Attribute	Value
Address	Virtual
Authors	Ingo Glaser , Shabnam Sadegharmaki , Basil Komboz , Prof. Dr. Florian Matthes
Citation	Glaser, I.; Sadegharmaki, S.; Komboz, B.; Matthes, F.: Data Scarcity: Methods to Improve the Quality of Text Classification, ICPRAM: International Conference on Pattern Recognition Applications and Methods, Virtual, 2021
Key	Gl21a
Research project	Semantic Analysis of Court Rulings
Title	Data Scarcity: Methods to Improve the Quality of Text Classification
Type of publication	Conference
Year	2021
Acronym
Project
Publication URL
Team members

To top

Chair of Software Engineering for Business Information Systems

Prof. Dr. Florian Matthes