langlab.core.parsers
Module contains tools for parsing text into sentences and words.
en-split-sentences-bi
(en-split-sentences-bi s)Convenience function alias for lg-split-sentences-bi for English.
en-split-sentences-icu-bi
(en-split-sentences-icu-bi s)Convenience function alias for lg-split-sentences-icu-bi for English.
en-split-tokens-bi
(en-split-tokens-bi s)Convenience function alias for lg-split-tokens-bi for English.
en-split-tokens-icu-bi
(en-split-tokens-icu-bi s)Convenience function alias for lg-split-tokens-icu-bi for English.
lg-split-sentences-bi
(lg-split-sentences-bi lang s)Split s into seq of sentences using standard BreakIterator class. Sets language to lang.
Note. It is not clear to me how the locale is used by BreakIterator.
lg-split-sentences-icu-bi
(lg-split-sentences-icu-bi lang s)Split s into seq of sentences using ICU BreakIterator class. Sets language to lang.
Note. It is not clear to me how the locale is used by BreakIterator.
lg-split-tokens-bi
(lg-split-tokens-bi lang s)Split s into seq of words using standard BreakIterator class. Sets language to lang.
Note. It is not clear to me how the locale is used by BreakIterator.
lg-split-tokens-icu-bi
(lg-split-tokens-icu-bi lang s)Split s into seq of words using ICU BreakIterator class. Sets language to lang.
Note. It is not clear to me how the locale is used by BreakIterator.
make-split-sentences-onlp
(make-split-sentences-onlp model-fname)Creates Open NLP sentence splitter using model from model-fname.
make-split-tokens-onlp
(make-split-tokens-onlp model-fname)Creates Open NLP token splitter using model from file model-fname.
split*
(split* s re)Splits s on regexp re, but as opposed to string/split keeps the regexp matches in a resulting seq.
Similar effect can be achieved by using look-arounds, but its clumsy, see
split-tokens-simple-lucene
(split-tokens-simple-lucene s)Splits s on whitespace and removes punctuation. Splitter based on Simple Analyzer from Lucene.
split-tokens-with-whitespace
(split-tokens-with-whitespace s)Splits s into tokens on whitespace (using regexp \s+). Inverse of merge-tokens-with-space.