langlab.core.parsers
Module contains tools for parsing text into sentences and words.
en-split-sentences-bi
(en-split-sentences-bi s)
Convenience function alias for lg-split-sentences-bi
for English.
en-split-sentences-icu-bi
(en-split-sentences-icu-bi s)
Convenience function alias for lg-split-sentences-icu-bi
for English.
en-split-tokens-bi
(en-split-tokens-bi s)
Convenience function alias for lg-split-tokens-bi
for English.
en-split-tokens-icu-bi
(en-split-tokens-icu-bi s)
Convenience function alias for lg-split-tokens-icu-bi
for English.
lg-split-sentences-bi
(lg-split-sentences-bi lang s)
Split s
into seq of sentences using standard BreakIterator class. Sets language to lang
.
Note. It is not clear to me how the locale is used by BreakIterator.
lg-split-sentences-icu-bi
(lg-split-sentences-icu-bi lang s)
Split s
into seq of sentences using ICU BreakIterator class. Sets language to lang
.
Note. It is not clear to me how the locale is used by BreakIterator.
lg-split-tokens-bi
(lg-split-tokens-bi lang s)
Split s
into seq of words using standard BreakIterator class. Sets language to lang
.
Note. It is not clear to me how the locale is used by BreakIterator.
lg-split-tokens-icu-bi
(lg-split-tokens-icu-bi lang s)
Split s
into seq of words using ICU BreakIterator class. Sets language to lang
.
Note. It is not clear to me how the locale is used by BreakIterator.
make-split-sentences-onlp
(make-split-sentences-onlp model-fname)
Creates Open NLP sentence splitter using model from model-fname
.
make-split-tokens-onlp
(make-split-tokens-onlp model-fname)
Creates Open NLP token splitter using model from file model-fname
.
split*
(split* s re)
Splits s
on regexp re
, but as opposed to string/split
keeps the regexp matches in a resulting seq.
Similar effect can be achieved by using look-arounds, but its clumsy, see
split-tokens-simple-lucene
(split-tokens-simple-lucene s)
Splits s
on whitespace and removes punctuation. Splitter based on Simple Analyzer from Lucene.
split-tokens-with-whitespace
(split-tokens-with-whitespace s)
Splits s
into tokens on whitespace (using regexp \s+). Inverse of merge-tokens-with-space
.