Data preparation

cxnMiner takes an annotated corpus in CoNNL-U format as input.

Download and annotate Wikipedia dumps

cxnMiner uses an annotated corpus to automatically identify constructions. One example for a text collection that can be used is Wikipedia. The following commands retrieve the latest German Wikipedia dump and then extract the texts using segment_wiki

wget -P data/ https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
python -m gensim.scripts.segment_wiki -f data/dewiki-latest-pages-articles.xml.bz2 -o data/dewiki-latest.json.gz

The package cxnminer comes with a tool to annotate the text in the format that is output by segement_wiki. The output of the tool is in CoNNL-U format as expected for construction mining.

Call this script using:

bin/process_wiki_data infile outfile '{"annotator": "spacy", "annotator_options": {"model_name": "de_core_news_sm"}, "exclude_sections": ["Literatur", "Weblinks", "Einzelnachweise"], "max_sent_len": 70}' --logging_config='{"handlers": { "h":{ "level": "DEBUG", "class": "logging.FileHandler", "filename": "logfile.txt", "mode": "w", "formatter": "f"}}}'"

Options

infile

The name of the file that contains the text in a json format. If the filename ends with “.gz” it is assumed to be a compressed file, otherwise it is assumed to be plain text json.

outfile

The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the annotated corpus in CoNNL-U format. If the filename ends with “.gz” the file will be compressed.

config

Config expects a json-object with the following entries:

annotator
the name of an annotator (see below)

annotator_options
a json-object with options for the selected annotator

exclude_sections
a list with section names that should be removed (e.g. References)

max_sent_len
optional, default: 70; sentences longer than this are skipped

--loging_config: Optionally the logging configuration can be set. logging_config expects a json object that represents a dict as used for logger configuration.

Annotators

spacy

Uses spacy. It expects the following options:

model_name
the name of a model that is used. It has to be available in the current environment (See the spacy documentaiton for more information on installing models).

Encode data

For the extraction of constructions, the constructions need to be encoded efficiently using Huffman coding. This is done in several steps.

Extract dictionary

Extract a dictionary of possible pattern elements to create an encoder.

cxnminer utils extract-vocabulary infile outfile config
cxnminer utils extract-vocabulary example_data/example_data.conllu example_data/example_data_dict.json example_data/example_config.json

Options

infile: The name of the file that contains the annotated corpus in CoNLL-U format. If the filename ends with “.gz” it is assumed to be a compressed file.
outfile: The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the vocabulary for the given levels in json format. If the filename ends with “.gz” the file will be compressed.
config: The configuration for construction mining as described in Settings for the pattern extraction.

--drop_frequencies: The list can contain the frequencies (needed to create a Huffman encoder) or they can optionally be dropped.

Filter dictionary

Remove items with a frequency below a given threshold from an extracted dictionary.

bin/filter_vocabulary dictionaries outfile min_frequency
bin/filter_vocabulary example_data/example_data_dict.json example_data/example_data_dict_filtered.json 2

Options

dictionaries: The name of the file that contains the dictionary (including frequencies) extracted from the corpus. If the filename ends with “.gz” it is assumed to be a compressed file.
outfile: The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the filtered vocabulary json format. If the filename ends with “.gz” the file will be compressed.
min_frequency: Items with a lower frequency will be dropped from the dictionary.

Prepare encoder

Create and pickle an encoder based on an extracted dictionary.

cxnminer utils create-encoder dictionaries outfile config
cxnminer utils create-encoder example_data/example_data_dict_filtered.json example_data/example_data_encoder example_data/example_config.json

Options

dictionaries: The name of the file that contains the dictionary extracted from the corpus. If the filename ends with “.gz” it is assumed to be a compressed file.
outfile: The pickled encoder. If the filename ends with “.gz” the file will be compressed.
config: The configuration for construction mining as described in Settings for the pattern extraction.

Encode dictionary

Encodes the complete dictionary - creating a dictionary that can be used to encode the corpus using lookup.

cxnminer utils encode-vocabulary vocabulary outfile encoder config
cxnminer utils encode-vocabulary example_data/example_data_dict_filtered.json example_data/example_data_dict_filtered_encoded.json example_data/example_data_encoder example_data/example_config.json

Options

vocabulary: Either the name of the file that contains the dictionary extracted from the corpus. If the filename ends with “.gz” it is assumed to be a compressed file. Or a json-String containing the vocabulary directly.
outfile: The name of the file that is created by the script (if it exists, it will be overwritten). It will contain a lookup table for the vocabulary and the encoded versions in json format. If the filename ends with “.gz” the file will be compressed.
encoder: The pickled encoder.
config: The configuration for construction mining as described in Settings for the pattern extraction.

--no_frequencies: Add this flag if the dictionary does not contain frequencies.
--loging_config: See above.

Encode corpus

To make the pattern extraction more efficient, the corpus can be pre-encoded. Uses an encoded dicitionary to efficiently encode the corpus.

cxnminer utils encode-corpus infile outfile dictionary config
cxnminer utils encode-corpus example_data/example_data.conllu example_data/example_data_encoded.conllu example_data/example_data_dict_filtered_encoded.json example_data/example_config.json

Options

infile: The name of the file that contains the annotated corpus in CoNLL-U format. If the filename ends with “.gz” it is assumed to be a compressed file.
outfile: The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the annotated corpus in CoNNL-U format with encoded levels. If the filename ends with “.gz” the file will be compressed.
dictionary: The encoded dictionary.
config: The configuration for construction mining as described in Settings for the pattern extraction.

--processes: Controls the number of processes to be used.
--loging_config: See above.