Two new libraries have been published on Orange-OpenSource GitHub:
- Lexical-corrector – a C++ library and a java package for rapid lexicon access including correction (based on Levenshtein distance). It permits to define Levenshtein distance in function of typographical error (diacritics, case, adjoining keys).
- Text-tokenizer – a C++ library to segment raw text (UTF-8) into typed tokens using a set of regular expressions. This is a basic functionality for almost all Natural Language Processing (NLP) approaches. The library has a simple API and is initialized with a rule file defining the token types and the regular expressions to match.
This is a contribution to the Natural Language Processing (NLP) community as well as Orange’s academic partners. The author of both libraries is Johannes Heinecke from Orange Labs Services. They are available under 3-Clause BSD License.