Settings for the pattern extraction
The main configuration for mining constructions with cxnMiner needs to be given either directly as json or as a file containing json. The following options need to be set:
- word_level
The name of the column from the conllu data that is used for words in patterns.
- levels
Additional levels to extract for hybrid n-grams. This can be any of the following:
- form
the plain form of the token
- lemma
the lemma
- upos
the universal part-of-speech tag
- xpos
the language specific part-of-speech tag
- np_function
the dependency relation for tokens with given upos tags (see phrase_tags below).
Do not include the word_level in this list. It is added automatically.
- phrase_tags
Here a list of POS tags can be given. Whenever a token with one of these tags in the column upos is seen, the corresponding subtree is also collapsed into a single node during the pattern extraction. This allows to approximate phrases in the patterns.
- unknown
Represent unknown elements with the given string. Unknown elements appear when the vocabulary is extracted from a corpus that is different from the corpus used to extract pattern or because the items have been removed when filtering the vocabulary (Filter dictionary).
- extractor
A json object containing the settings for the extractor that is used to extract the patterns.
Pattern extractor
A pattern extractor is used to to extract patterns from a given sentence. For now the only available extractor is an extractor for syntactic (hybrid) n-grams. For a description of syntactic n-grams see Sidorov (2019).
An example json object used to represent a syntactic n-gram extractor in the config is given below:
{
"type": "sngram",
"options": {
"max_size": 8,
"left_bracket": "[",
"right_bracket": "]",
"comma": ","
}
}
The extractor takes the following options:
- max_size
Upper limit for the size of the patterns.
- left_bracket, right_bracket, comma
These options allow to adapt some elements of the meta language as described by Sidorov (2019) for the representation of syntactic n-grams.