Poster
in
Affinity Workshop: New In ML
SciAnnotate: A Tool for Combining Weak Labeling Sources for Sequence Labeling
Numerous natural language processing tasks rely on large amounts of labeled data to achieve high performance. However, annotating corpora often requires careful scrutiny of text and domain-specific background knowledge, which can be particularly challenging in specialized domains such as biology and medicine. In highly specialized domains such as biology and medicine, terms and phrases often have specific and dedicated meanings, rarely exhibiting polysemy, which is prevalent in general language. This characteristic makes weak labeling techniques, such as regex pattern matching, particularly effective for annotating datasets in these domains. Thus, we developed SciAnnotate, an online tool for text annotation that enables the creation of weak labels alongside a manual annotation experience. Our tool allows users to generate weak labels using multiple text matching and customized labeling functions, and it can be integrated with user-provided language models. Additionally, we explore a case of using Bertifying Conditional Hidden Markov Model to refine the weak labels generated by our tool, further improving annotation quality in specialized domains.