langlab.core.characters
Module contains string utilities operating on characters.
This includes, e.g., diacritics removal, vowel groups detection, character counting, non-BMP characters removal, etc.
Part of this module functionality can be also replaced by regular expression matching with Unicode support: http://www.regular-expressions.info/unicode.html
General convention is that all contains-* functions return false on empty string.
contains-digits-only?
(contains-digits-only? s)Checks if s contains only digits according to Character.isDigit(cp).
contains-digits?
(contains-digits? s)Checks if s contains any digits according to Character.isDigit(cp).
contains-letters-only?
(contains-letters-only? s)Checks if s contains only letters according to Character.isLetter(cp).
contains-letters-or-digits-only?
(contains-letters-or-digits-only? s)Checks if s contains only letters and digits according to Character.isLetterOrDigit(cp).
contains-letters-or-digits?
(contains-letters-or-digits? s)Checks if s contains any letters or any digits according to Character.isLetterOrDigit(cp).
contains-letters?
(contains-letters? s)Checks if s contains any letters according to Character.isLetter(cp).
contains-non-bmp?
(contains-non-bmp? s)Checks if s contains non-bmp characters according to !Character.isBmpCodePoint(cp).
contains-punct-only?
(contains-punct-only? s)Checks if s contains only punctuation according to Character.getType(cp) equal to *_PUNCTUATION classes.
contains-punct?
(contains-punct? s)Checks if s contains punctuation according to Character.getType(cp) equal to *_PUNCTUATION classes.
contains-whitespace-only?
(contains-whitespace-only? s)Checks if s contains only whitespace according to Character.isWhitespace(cp). Be warned that some intuitively whitespace characters from Unicode are excluded (e.g., hard spaces). See tests.
contains-whitespace?
(contains-whitespace? s)Checks if s contains whitespace according to Character.isWhitespace(cp). Some intuitively whitespace characters from Unicode are excluded (e.g., hard spaces). See tests.
count-latin-vowel-groups
(count-latin-vowel-groups s)Counts groups of latin vowels in the string s, e.g. for ‘employee’ it should return 2.
count-latin-vowel-groups-without-final
(count-latin-vowel-groups-without-final s)Counts groups of latin vowels in string s without the group ending the word, e.g. for ‘employee’ it should return 1.
en-count-chars-bi
(en-count-chars-bi s)Counts number of characters in s using Break Iterator. Uses English locale.
en-count-chars-icu-bi
(en-count-chars-icu-bi s)Counts number of characters in s using ICU Break Iterator. Uses English locale.
lg-count-chars-icu-bi
(lg-count-chars-icu-bi lang s)Counts number of characters in s using Break Iterator. Uses locale corresponding to lang.
remove-diacritics
(remove-diacritics s)Remove diacritical marks from the string s, E.g., ‘żółw’ is transformed to ‘zolw’.