langlab.algs.tagging

Module contains functionality related to tagging with dictionary tags. The implemented functionality can be divided into three main areas:

  • generating dictionaries and frequency dictionaries,
  • tagging itself,
  • tags algebra (union, intersection, difference, etc.).

calc-tags-difference

(calc-tags-difference tags1 tags2 env)

Returns all the { tag freq } pairs from tags1 map, where normalized tag is not among normalized tags from tags2. Normalization function splits tokens, transforms them, stems and finally merges. It is constructed according to the keys in env.

The following keywords mapping to functions can be included in env:

  • :split-tokens-f - tokenizer function (mandatory),
  • :stem-f - stemming function (mandatory),
  • :trans-tokens-f - transforming function (default: identity),
  • :merge-tokens-f - merging tokens (default: merge-tokens-with-space).

calc-tags-intersection

(calc-tags-intersection tags1 tags2 env)

Returns all the { tag freq } pairs from tags1 map, where normalized tag is among normalized tags from tags2. Normalization function splits tokens, transforms them, stems and finally merges. The normalization function is constructed according to the keys in env.

The following keywords mapping to functions can be included in env:

  • :split-tokens-f - tokenizer function (mandatory),
  • :stem-f - stemming function (mandatory),
  • :trans-tokens-f - transforming function (default: identity),
  • :merge-tokens-f - merging tokens (default: merge-tokens-with-space).

calc-tags-union

(calc-tags-union tags1 tags2 env)

Returns all the { tag freq } pairs from tags1 and those pairs from tags2 where normalized tag is not included already from tags1. Normalization function splits tokens, transforms them, stems and finally merges. It is constructed according to the keys in env.

The following keywords mapping to functions can be included in env:

  • :split-tokens-f - tokenizer function (mandatory),
  • :stem-f - stemming function (mandatory),
  • :trans-tokens-f - transforming function (default: identity),
  • :merge-tokens-f - merging tokens (default: merge-tokens-with-space).

conv-fdict-to-dict

(conv-fdict-to-dict fdict)

Converts frequency dictionary fdict of the form

{ normalized-entry { entry1 freq1 entry2 freq2 }, ... },

to an ordinary dictionary by selecting most frequent entry1, entry2, … From the entries having the same frequencies the shortest is selected.

gen-dict-from-reader

(gen-dict-from-reader r env)

Generates dictionary from a given reader r, passing env to the gen-dict-from-seq.

gen-dict-from-seq

(gen-dict-from-seq seq env)

Generates dictionary from a given seq. The result is a map of the form { normalized-entry normalized-entry-without-stemming }. Normalization function splits tokens, transforms them, stems, and finally merges. It is constructed according to the keys in env.

The following keywords mapping to functions can be included in env:

  • :split-tokens-f - tokenizer function (mandatory),
  • :stem-f - stemming function (mandatory),
  • :trans-tokens-f - transforming function (default: identity),
  • :merge-tokens-f - merging tokens (default: merge-tokens-with-space).

gen-fdict-from-reader

(gen-fdict-from-reader r env)

Generates frequency dictionary from a given reader r. passing env to the gen-fdict-from-seq.

gen-fdict-from-seq

(gen-fdict-from-seq seq env)

Generate frequency dictionary (fdict) from seq. The fdict is useful when there are many tokens that stem to the same entry in a given seq. The resulting fdict has the form:

{ normalized-entry { entry1 freq1 entry2 freq2 }, ...}.

All entry1, entry2, … were normalized without stemming. When applying stemming they normalize to the same normalized-entry. The normalization function is constructed according to the keys in env.

The following keywords mapping to functions can be included in env:

  • :split-tokens-f - tokenizer function (mandatory),
  • :stem-f - stemming function (mandatory),
  • :trans-tokens-f - transforming function (default: identity),
  • :merge-tokens-f - merging tokens (default: merge-tokens-with-space).

make-tag-f

(make-tag-f dict env)

Creates a function that tags string with dictionary dict based on functions contained in the env. The result is a map { tag freq }.

The following keywords mapping to functions can be included in env:

  • :split-tokens-f - tokenizer function (mandatory),
  • :stem-f - stemming function (mandatory),
  • :trans-tokens-f - transforming function (default: identity),
  • :merge-tokens-f - merging tokens (default: merge-tokens-with-space),
  • :split-sentences-f - parses string into sentences (optional).

If :split-sentences-f is given, sentences are parsed separately and tag maps from all sentences are merged.