langlab.core.parsers

Module contains tools for parsing text into sentences and words.

en-split-sentences-bi

(en-split-sentences-bi s)

Convenience function alias for lg-split-sentences-bi for English.

en-split-sentences-icu-bi

(en-split-sentences-icu-bi s)

Convenience function alias for lg-split-sentences-icu-bi for English.

en-split-tokens-bi

(en-split-tokens-bi s)

Convenience function alias for lg-split-tokens-bi for English.

en-split-tokens-icu-bi

(en-split-tokens-icu-bi s)

Convenience function alias for lg-split-tokens-icu-bi for English.

lg-split-sentences-bi

(lg-split-sentences-bi lang s)

Split s into seq of sentences using standard BreakIterator class. Sets language to lang.

Note. It is not clear to me how the locale is used by BreakIterator.

lg-split-sentences-icu-bi

(lg-split-sentences-icu-bi lang s)

Split s into seq of sentences using ICU BreakIterator class. Sets language to lang.

Note. It is not clear to me how the locale is used by BreakIterator.

lg-split-tokens-bi

(lg-split-tokens-bi lang s)

Split s into seq of words using standard BreakIterator class. Sets language to lang.

Note. It is not clear to me how the locale is used by BreakIterator.

lg-split-tokens-icu-bi

(lg-split-tokens-icu-bi lang s)

Split s into seq of words using ICU BreakIterator class. Sets language to lang.

Note. It is not clear to me how the locale is used by BreakIterator.

make-split-sentences-onlp

(make-split-sentences-onlp model-fname)

Creates Open NLP sentence splitter using model from model-fname.

make-split-tokens-onlp

(make-split-tokens-onlp model-fname)

Creates Open NLP token splitter using model from file model-fname.

split*

(split* s re)

Splits s on regexp re, but as opposed to string/split keeps the regexp matches in a resulting seq.

Similar effect can be achieved by using look-arounds, but its clumsy, see

http://stackoverflow.com/questions/19951850/split-string-with-regex-but-keep-delimeters-in-match-array

split-tokens-simple-lucene

(split-tokens-simple-lucene s)

Splits s on whitespace and removes punctuation. Splitter based on Simple Analyzer from Lucene.

split-tokens-with-whitespace

(split-tokens-with-whitespace s)

Splits s into tokens on whitespace (using regexp \s+). Inverse of merge-tokens-with-space.