Pattern extraction

From data that has been prepared as described in Data preparation patterns can be extracted and filtered to find potential constructions.

Extract patterns

Get lists of patterns and corresponding base patterns from the corpus.

cxnminer extract-patterns infile outfile_patterns outfile_base encoded_dictionary config
cxnminer extract-patterns example_data/example_data_encoded.conllu example_data/example_data_patterns.tsv example_data/example_data_base_patterns.tsv example_data/example_data_dict_filtered_encoded.json example_data/example_config.json

Options

infile

The name of the file that contains the encoded text in the conllu format. If the filename ends with “.gz” it is assumed to be a compressed file, otherwise it is assumed to be plain conllu.

outfile_patterns

The name of the file that is created by the script (if it exists, it will be overwritten). It will contain a pair of pattern and basepattern in each line. If the filename ends with “.gz” the file will be compressed.

outfile_base

The name of the file that is created by the script (if it exists, it will be overwritten). It will contain a pair of basepattern and the id of the sentence in which it appeared in each line. If the filename ends with “.gz” the file will be compressed.

encoded_dictionaries

Encoded dictionaries as described in Encode dictionary.

config

The configuration for construction mining as described in Settings for the pattern extraction.

--keep_only_word

Keep only patterns with the given word.

--keep_only_dict_words

Removes all patterns that contain an element which is not contained in the dictionary. This does not remove patterns which contain the special element “__unknown__” which might has been introduced during the preparation of the data due to filtering the dicitionary.

--skip_unknown

Removes all patterns that contain the element “__unknown__”.

Afterwards the lists of patterns and base patterns can be converted to pattern sets for further processing.

For this the lists needs to be sorted. This can be done, e.g., with the following commands. LC_ALL=c is added in order to assure that punctation symbols are not ignored:

LC_ALL=c sort example_data/example_data_patterns.tsv > example_data/example_data_patterns_sorted.tsv
LC_ALL=c sort example_data/example_data_base_patterns.tsv > example_data/example_data_base_patterns_sorted.tsv

The sorted lists are converted using the following command:

cxnminer utils convert-pattern-list infile outfile
cxnminer utils convert-pattern-list example_data/example_data_base_patterns_sorted.tsv example_data/example_data_base_pattern_set.jsonl
cxnminer utils convert-pattern-list example_data/example_data_patterns_sorted.tsv example_data/example_data_pattern_set.jsonl

Options

infile

The name of the file that contains the list of patterns or base patterns. If the filename ends with “.gz” it is assumed to be a compressed file, otherwise it is assumed to be plain conllu.

outfile

The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the a (base) pattern in each line. The patterns are represented using json (see JSON lines). If the filename ends with “.gz” the file will be compressed.

--remove_hapax

This removes patterns that only appear with one base pattern. Please note, that this is a different criterion than filtering using the frequency, since the frequency is based on sentences and not base patterns.

Get statistics about patterns

For finding relevant patterns, the pattern set can be enriched with statistics about the patterns. Applying the command add-pattern-stats to an encoded set of patterns and corresponding basepatterns collects simple statistics like frequency.

cxnminer utils add-pattern-stats example_data/example_data_pattern_set.jsonl example_data/example_data_patterns_simple_stats.json --base_patterns example_data/example_data_base_pattern_set.jsonl

These statistics can then be used to filter the patterns, e.g. by removing patterns that appear only once:

cxnminer utils filter-patterns example_data/example_data_pattern_set.jsonl example_data/example_data_patterns_simple_stats.json frequency 2 example_data/example_data_pattern_set_frequent.jsonl

The relation between the given statistics and the threshold can be defined by adding the option –operator which defaults to >=.

In order to collect statistics that need access to the individual elements of the patterns, e.g., the schematicity, the pattern set has to be decoded:

cxnminer utils decode-patterns example_data/example_data_pattern_set_frequent.jsonl example_data/example_data_encoder example_data/example_data_pattern_set_frequent_decoded --processes 4

After having decoded the pattern set, further statistics can be collected:

cxnminer utils get-vocabulary-probs example_data/example_data_dict.json example_data/example_data_dictionary_probs.json
cxnminer utils get-pattern-type-freq example_data/example_data_pattern_set_frequent_decoded example_data/example_data_patterns_simple_stats.json example_data/example_data_pattern_set_frequent_type_frequencies.json
cxnminer utils add-pattern-stats example_data/example_data_pattern_set_frequent.jsonl example_data/example_data_patterns_stats.json --decoded_patterns example_data/example_data_pattern_set_frequent_decoded --config example_data/example_config.json --vocabulary_probs example_data/example_data_dictionary_probs.json --known_stats example_data/example_data_patterns_simple_stats.json --pattern_profile_frequency example_data/example_data_pattern_set_frequent_type_frequencies.json

Get best patterns

After having collected patterns and statistics about them, these statistics can be used for further filtering the pattern set and extract patterns that are likely constructions. E.g., UIF-PMI, the measure proposed by Forsberg et al. (2014), can be used with the following command to get a decoded pattern set containing the 2 patterns with the highest UIF-PMI value with 1 corresponding instantiation (base pattern) that appears most frequently in the data:

cxnminer utils get-top-n example_data/example_data_pattern_set_frequent.jsonl example_data/example_data_patterns_stats.json uif-pmi 2 example_data/example_data_pattern_set_top_2_uifpmi.jsonl
cxnminer utils get-top-n-base-patterns example_data/example_data_pattern_set_top_2_uifpmi.jsonl example_data/example_data_base_pattern_set.jsonl 1 example_data/example_data_pattern_set_top_2_uifpmi_basesel_1.jsonl --example_ids example_data/example_data_pattern_set_top_2_uifpmi_basesel_1_exampleids.json
cxnminer utils decode-pattern-collection example_data/example_data_pattern_set_top_2_uifpmi_basesel_1.jsonl example_data/example_data_encoder example_data/example_data_pattern_set_top_2_uifpmi_basesel_1_decoded.jsonl --string
cxnminer corpus2sentences example_data/example_data.conllu example_data/sentences --example_ids example_data/example_data_pattern_set_top_2_uifpmi_basesel_1_exampleids.json