langlab.core.characters
Module contains string utilities operating on characters.
This includes, e.g., diacritics removal, vowel groups detection, character counting, non-BMP characters removal, etc.
Part of this module functionality can be also replaced by regular expression matching with Unicode support: http://www.regular-expressions.info/unicode.html
General convention is that all contains-*
functions return false
on empty string.
contains-digits-only?
(contains-digits-only? s)
Checks if s
contains only digits according to Character.isDigit(cp)
.
contains-digits?
(contains-digits? s)
Checks if s
contains any digits according to Character.isDigit(cp)
.
contains-letters-only?
(contains-letters-only? s)
Checks if s
contains only letters according to Character.isLetter(cp)
.
contains-letters-or-digits-only?
(contains-letters-or-digits-only? s)
Checks if s
contains only letters and digits according to Character.isLetterOrDigit(cp)
.
contains-letters-or-digits?
(contains-letters-or-digits? s)
Checks if s
contains any letters or any digits according to Character.isLetterOrDigit(cp)
.
contains-letters?
(contains-letters? s)
Checks if s
contains any letters according to Character.isLetter(cp)
.
contains-non-bmp?
(contains-non-bmp? s)
Checks if s
contains non-bmp characters according to !Character.isBmpCodePoint(cp)
.
contains-punct-only?
(contains-punct-only? s)
Checks if s
contains only punctuation according to Character.getType(cp)
equal to *_PUNCTUATION
classes.
contains-punct?
(contains-punct? s)
Checks if s
contains punctuation according to Character.getType(cp)
equal to *_PUNCTUATION classes.
contains-whitespace-only?
(contains-whitespace-only? s)
Checks if s
contains only whitespace according to Character.isWhitespace(cp)
. Be warned that some intuitively whitespace characters from Unicode are excluded (e.g., hard spaces). See tests.
contains-whitespace?
(contains-whitespace? s)
Checks if s
contains whitespace according to Character.isWhitespace(cp)
. Some intuitively whitespace characters from Unicode are excluded (e.g., hard spaces). See tests.
count-latin-vowel-groups
(count-latin-vowel-groups s)
Counts groups of latin vowels in the string s
, e.g. for ‘employee’ it should return 2.
count-latin-vowel-groups-without-final
(count-latin-vowel-groups-without-final s)
Counts groups of latin vowels in string s
without the group ending the word, e.g. for ‘employee’ it should return 1.
en-count-chars-bi
(en-count-chars-bi s)
Counts number of characters in s
using Break Iterator. Uses English locale.
en-count-chars-icu-bi
(en-count-chars-icu-bi s)
Counts number of characters in s
using ICU Break Iterator. Uses English locale.
lg-count-chars-icu-bi
(lg-count-chars-icu-bi lang s)
Counts number of characters in s
using Break Iterator. Uses locale corresponding to lang
.
remove-diacritics
(remove-diacritics s)
Remove diacritical marks from the string s
, E.g., ‘żółw’ is transformed to ‘zolw’.