langlab.algs.tagging
Module contains functionality related to tagging with dictionary tags. The implemented functionality can be divided into three main areas:
- generating dictionaries and frequency dictionaries,
- tagging itself,
- tags algebra (union, intersection, difference, etc.).
calc-tags-difference
(calc-tags-difference tags1 tags2 env)
Returns all the { tag freq }
pairs from tags1
map, where normalized tag
is not among normalized tags from tags2
. Normalization function splits tokens, transforms them, stems and finally merges. It is constructed according to the keys in env
.
The following keywords mapping to functions can be included in env
:
:split-tokens-f
- tokenizer function (mandatory),:stem-f
- stemming function (mandatory),:trans-tokens-f
- transforming function (default: identity),:merge-tokens-f
- merging tokens (default: merge-tokens-with-space).
calc-tags-intersection
(calc-tags-intersection tags1 tags2 env)
Returns all the { tag freq }
pairs from tags1
map, where normalized tag
is among normalized tags from tags2
. Normalization function splits tokens, transforms them, stems and finally merges. The normalization function is constructed according to the keys in env
.
The following keywords mapping to functions can be included in env
:
:split-tokens-f
- tokenizer function (mandatory),:stem-f
- stemming function (mandatory),:trans-tokens-f
- transforming function (default: identity),:merge-tokens-f
- merging tokens (default: merge-tokens-with-space).
calc-tags-union
(calc-tags-union tags1 tags2 env)
Returns all the { tag freq }
pairs from tags1
and those pairs from tags2
where normalized tag is not included already from tags1
. Normalization function splits tokens, transforms them, stems and finally merges. It is constructed according to the keys in env
.
The following keywords mapping to functions can be included in env
:
:split-tokens-f
- tokenizer function (mandatory),:stem-f
- stemming function (mandatory),:trans-tokens-f
- transforming function (default: identity),:merge-tokens-f
- merging tokens (default: merge-tokens-with-space).
conv-fdict-to-dict
(conv-fdict-to-dict fdict)
Converts frequency dictionary fdict
of the form
{ normalized-entry { entry1 freq1 entry2 freq2 }, ... },
to an ordinary dictionary by selecting most frequent entry1
, entry2
, … From the entries having the same frequencies the shortest is selected.
gen-dict-from-reader
(gen-dict-from-reader r env)
Generates dictionary from a given reader r
, passing env
to the gen-dict-from-seq
.
gen-dict-from-seq
(gen-dict-from-seq seq env)
Generates dictionary from a given seq
. The result is a map of the form { normalized-entry normalized-entry-without-stemming }. Normalization function splits tokens, transforms them, stems, and finally merges. It is constructed according to the keys in env
.
The following keywords mapping to functions can be included in env
:
:split-tokens-f
- tokenizer function (mandatory),:stem-f
- stemming function (mandatory),:trans-tokens-f
- transforming function (default: identity),:merge-tokens-f
- merging tokens (default: merge-tokens-with-space).
gen-fdict-from-reader
(gen-fdict-from-reader r env)
Generates frequency dictionary from a given reader r
. passing env
to the gen-fdict-from-seq
.
gen-fdict-from-seq
(gen-fdict-from-seq seq env)
Generate frequency dictionary (fdict
) from seq
. The fdict is useful when there are many tokens that stem to the same entry in a given seq
. The resulting fdict has the form:
{ normalized-entry { entry1 freq1 entry2 freq2 }, ...}.
All entry1
, entry2
, … were normalized without stemming. When applying stemming they normalize to the same normalized-entry
. The normalization function is constructed according to the keys in env
.
The following keywords mapping to functions can be included in env
:
:split-tokens-f
- tokenizer function (mandatory),:stem-f
- stemming function (mandatory),:trans-tokens-f
- transforming function (default: identity),:merge-tokens-f
- merging tokens (default: merge-tokens-with-space).
make-tag-f
(make-tag-f dict env)
Creates a function that tags string with dictionary dict
based on functions contained in the env
. The result is a map { tag freq }
.
The following keywords mapping to functions can be included in env
:
:split-tokens-f
- tokenizer function (mandatory),:stem-f
- stemming function (mandatory),:trans-tokens-f
- transforming function (default: identity),:merge-tokens-f
- merging tokens (default: merge-tokens-with-space),:split-sentences-f
- parses string into sentences (optional).
If :split-sentences-f
is given, sentences are parsed separately and tag maps from all sentences are merged.