langlab.core.transformers

Module contains utilities for transforming tokens.

merge-tokens-with-space

(merge-tokens-with-space tokens)

Creates a string from tokens seq, by inserting space between them.

trans-drop-punct

(trans-drop-punct tokens)

Drops all items from tokens that contains only punctuation tokens.

trans-drop-punct-lower

(trans-drop-punct-lower tokens)

Drops all punctuation tokens and lowercases all tokens.

trans-drop-set

(trans-drop-set drop-set tokens)

Drop all elements of tokens that are included in the drop-set. To generate drop-set one of the functions returning stopwords or articles from module core.stopwords can be used.

trans-drop-set-all-case

(trans-drop-set-all-case drop-set tokens)

Drop all elements of tokens that are included in the drop-set. Ignore case. To generate drop-set one of the functions returning stopwords or articles from module core.stopwords can be used.

trans-drop-whitespace

(trans-drop-whitespace tokens)

From seq tokens removes all entries that contain only whitespace.

trans-keep-letters-or-digits

(trans-keep-letters-or-digits tokens)

Drops all items from tokens that contain other characters than letters or digits.

trans-lower-case

(trans-lower-case tokens)

Lowercases all tokens.

trans-merge-punct

(trans-merge-punct tokens)

In seq tokens merges those groups that contain only punctuation.

(trans-merge-punct [ "Wow" "!" "!" "!" ])

[ "Wow" "!!!" ]

Inverse of trans-split-punct.

trans-split-punct

(trans-split-punct tokens)

Split all punctuation tokens from tokens into separate characters.

(trans-split-punct [ "Wow" "!!!" ])

[ "Wow" "!" "!" "!" ]

Inverse of trans-split-punct.