langlab.core.detectors

Module contains language and encoding detection utilities.

Language is represented with two-letter strings containing ISO 639-1 codes, see http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes.

Unfortunately, there is no standard for encoding string representation.

detect-all-encod-prob-icu

(detect-all-encod-prob-icu fname)

Returns multiple string identifiers of encodings for file fname as detected by ICU4j. For each encoding the confidence in the detection is provided (double from the range [0.0.1.0]). The result is a map

{ encod1 prob1, encod2 prob2, ... }.

detect-all-lang-prob-cybozu

(detect-all-lang-prob-cybozu s env)(detect-all-lang-prob-cybozu s)

Returns multiple language code strings for s and their probabilities according to the Cybozu Labs library. The result is a map

{ lang1 prob1, lang2 prob2, ... }.

Optional env parameter can contain the following optional keys:

  • :alpha - alpha smoothing parameter of the Cybozu algorithm (default 0.5),
  • :max-len - maximum length of s to be taken for lang detection.

Note. Cybozu is not well suited to very short texts (at least 10-20 words). For very short text with 1-10 words, it may return wrong answer.

detect-encod-icu

(detect-encod-icu fname)

Returns a string identifier of encoding for file fname as detected by ICU4j.

detect-encod-prob-icu

(detect-encod-prob-icu fname)

Returns a string identifier of encoding for file fname as detected by ICU4j together with the confidence in the detection (double from the range [0.0.1.0]). The result is a map

{ encod prob }.

detect-encod-unichardet

(detect-encod-unichardet fname)

Returns a string identifier of encoding for file fname as detected by juniversalchardet.

detect-lang-cybozu

(detect-lang-cybozu s env)(detect-lang-cybozu s)

Returns language code string for s using the Cybozu Labs library. The optional env parameter can contain the following optional keys:

  • :alpha - alpha smoothing parameter of the Cybozu algorithm (default 0.5),
  • :max-len - maximum length of s to be taken for lang detection.

Note. Cybozu is not well suited to very short texts (at least 10-20 words). For very short text with 1-10 words, it may return wrong answer.

detect-lang-icu

(detect-lang-icu s)

detect-lang-prob-tika

(detect-lang-prob-tika s)

Returns a map { lang prob } where lang is language code string for s. prob represents confidence in the detection. Because library offers only a boolean function isReasonablyCertain() there are only values 0.0 (not certain) and (1.0) certain.

Note. The probability is very conservative. According to the apidocs for short texts it always gives uncertain. Even on long English texts I could not find any example for which it returns certain (Tika 1.4).

detect-lang-tika

(detect-lang-tika s)

Returns language code string for s obtained using Apache Tika.

get-encod-avail-icu

(get-encod-avail-icu)

Returns a set of string identifiers for encodings available in encoding detection tools of ICU4j.

get-encod-avail-unichardet

(get-encod-avail-unichardet)

Returns a set of string identifiers for encodings available in encoding detection library juniversalchardet.

get-lang-avail-cybozu

(get-lang-avail-cybozu)

Returns a set of language code strings recognized in Cybozu Labs library.

get-lang-avail-icu

(get-lang-avail-icu)

Returns a set of language code strings recognized by ICU library. No function to get actual available language list is present in ICU4j. This was obtained by running grep in the com.ibm.icu sources:

grep -h -o -e '"[a-z][a-z]"' Charset*.java | sort | uniq

get-lang-avail-tika

(get-lang-avail-tika)

Returns a set of language code strings recognized by Tika.