langlab.core.characters

Module contains string utilities operating on characters.

This includes, e.g., diacritics removal, vowel groups detection, character counting, non-BMP characters removal, etc.

Part of this module functionality can be also replaced by regular expression matching with Unicode support: http://www.regular-expressions.info/unicode.html

General convention is that all contains-* functions return false on empty string.

contains-digits-only?

(contains-digits-only? s)

Checks if s contains only digits according to Character.isDigit(cp).

contains-digits?

(contains-digits? s)

Checks if s contains any digits according to Character.isDigit(cp).

contains-letters-only?

(contains-letters-only? s)

Checks if s contains only letters according to Character.isLetter(cp).

contains-letters-or-digits-only?

(contains-letters-or-digits-only? s)

Checks if s contains only letters and digits according to Character.isLetterOrDigit(cp).

contains-letters-or-digits?

(contains-letters-or-digits? s)

Checks if s contains any letters or any digits according to Character.isLetterOrDigit(cp).

contains-letters?

(contains-letters? s)

Checks if s contains any letters according to Character.isLetter(cp).

contains-non-bmp?

(contains-non-bmp? s)

Checks if s contains non-bmp characters according to !Character.isBmpCodePoint(cp).

contains-punct-only?

(contains-punct-only? s)

Checks if s contains only punctuation according to Character.getType(cp) equal to *_PUNCTUATION classes.

contains-punct?

(contains-punct? s)

Checks if s contains punctuation according to Character.getType(cp) equal to *_PUNCTUATION classes.

contains-whitespace-only?

(contains-whitespace-only? s)

Checks if s contains only whitespace according to Character.isWhitespace(cp). Be warned that some intuitively whitespace characters from Unicode are excluded (e.g., hard spaces). See tests.

contains-whitespace?

(contains-whitespace? s)

Checks if s contains whitespace according to Character.isWhitespace(cp). Some intuitively whitespace characters from Unicode are excluded (e.g., hard spaces). See tests.

count-latin-vowel-groups

(count-latin-vowel-groups s)

Counts groups of latin vowels in the string s, e.g. for ‘employee’ it should return 2.

count-latin-vowel-groups-without-final

(count-latin-vowel-groups-without-final s)

Counts groups of latin vowels in string s without the group ending the word, e.g. for ‘employee’ it should return 1.

en-count-chars-bi

(en-count-chars-bi s)

Counts number of characters in s using Break Iterator. Uses English locale.

en-count-chars-icu-bi

(en-count-chars-icu-bi s)

Counts number of characters in s using ICU Break Iterator. Uses English locale.

lg-count-chars-bi

(lg-count-chars-bi lang s)

lg-count-chars-icu-bi

(lg-count-chars-icu-bi lang s)

Counts number of characters in s using Break Iterator. Uses locale corresponding to lang.

remove-bmp

(remove-bmp s)

Removes all bmp codepoints from s.

remove-diacritics

(remove-diacritics s)

Remove diacritical marks from the string s, E.g., ‘żółw’ is transformed to ‘zolw’.

remove-non-bmp

(remove-non-bmp s)

Removes all non-bmp codepoints from s.