langlab.core.detectors documentation

langlab.core.detectors

Module contains language and encoding detection utilities.

Language is represented with two-letter strings containing ISO 639-1 codes, see http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes.

Unfortunately, there is no standard for encoding string representation.

detect-all-encod-prob-icu

(detect-all-encod-prob-icu fname)

Returns multiple string identifiers of encodings for file fname as detected by ICU4j. For each encoding the confidence in the detection is provided (double from the range [0.0.1.0]). The result is a map

{ encod1 prob1, encod2 prob2, ... }.

view source

detect-all-lang-prob-cybozu

(detect-all-lang-prob-cybozu s env)(detect-all-lang-prob-cybozu s)

Returns multiple language code strings for s and their probabilities according to the Cybozu Labs library. The result is a map

{ lang1 prob1, lang2 prob2, ... }.

Optional env parameter can contain the following optional keys:

:alpha - alpha smoothing parameter of the Cybozu algorithm (default 0.5),
:max-len - maximum length of s to be taken for lang detection.

Note. Cybozu is not well suited to very short texts (at least 10-20 words). For very short text with 1-10 words, it may return wrong answer.

view source

detect-encod-icu

(detect-encod-icu fname)

Returns a string identifier of encoding for file fname as detected by ICU4j.

view source

detect-encod-prob-icu

(detect-encod-prob-icu fname)

Returns a string identifier of encoding for file fname as detected by ICU4j together with the confidence in the detection (double from the range [0.0.1.0]). The result is a map

{ encod prob }.

view source

detect-encod-unichardet

(detect-encod-unichardet fname)

Returns a string identifier of encoding for file fname as detected by juniversalchardet.

view source

detect-lang-cybozu

(detect-lang-cybozu s env)(detect-lang-cybozu s)

Returns language code string for s using the Cybozu Labs library. The optional env parameter can contain the following optional keys:

:alpha - alpha smoothing parameter of the Cybozu algorithm (default 0.5),
:max-len - maximum length of s to be taken for lang detection.

Note. Cybozu is not well suited to very short texts (at least 10-20 words). For very short text with 1-10 words, it may return wrong answer.

view source

detect-lang-icu

(detect-lang-icu s)

view source

detect-lang-prob-tika

(detect-lang-prob-tika s)

Returns a map { lang prob } where lang is language code string for s. prob represents confidence in the detection. Because library offers only a boolean function isReasonablyCertain() there are only values 0.0 (not certain) and (1.0) certain.

Note. The probability is very conservative. According to the apidocs for short texts it always gives uncertain. Even on long English texts I could not find any example for which it returns certain (Tika 1.4).

view source

detect-lang-tika

(detect-lang-tika s)

Returns language code string for s obtained using Apache Tika.

view source

get-encod-avail-icu

(get-encod-avail-icu)

Returns a set of string identifiers for encodings available in encoding detection tools of ICU4j.

view source

get-encod-avail-unichardet

(get-encod-avail-unichardet)

Returns a set of string identifiers for encodings available in encoding detection library juniversalchardet.

view source

get-lang-avail-cybozu

(get-lang-avail-cybozu)

Returns a set of language code strings recognized in Cybozu Labs library.

view source

get-lang-avail-icu

(get-lang-avail-icu)

Returns a set of language code strings recognized by ICU library. No function to get actual available language list is present in ICU4j. This was obtained by running grep in the com.ibm.icu sources:

grep -h -o -e '"[a-z][a-z]"' Charset*.java | sort | uniq

view source

get-lang-avail-tika

(get-lang-avail-tika)

Returns a set of language code strings recognized by Tika.

view source

Generated by Codox

Langlab 1.3.0 API documentation

Namespaces

Public Vars