Data preparation
cxnMiner takes an annotated corpus in CoNNL-U format as input.
Download and annotate Wikipedia dumps
cxnMiner uses an annotated corpus to automatically identify constructions. One example for a text collection that can be used is Wikipedia. The following commands retrieve the latest German Wikipedia dump and then extract the texts using segment_wiki
wget -P data/ https://dumps.wikimedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
python -m gensim.scripts.segment_wiki -f data/dewiki-latest-pages-articles.xml.bz2 -o data/dewiki-latest.json.gz
The package cxnminer comes with a tool to annotate the text in the format that is output by segement_wiki.
The output of the tool is in CoNNL-U format as expected for construction mining.
Call this script using:
bin/process_wiki_data infile outfile '{"annotator": "spacy", "annotator_options": {"model_name": "de_core_news_sm"}, "exclude_sections": ["Literatur", "Weblinks", "Einzelnachweise"], "max_sent_len": 70}' --logging_config='{"handlers": { "h":{ "level": "DEBUG", "class": "logging.FileHandler", "filename": "logfile.txt", "mode": "w", "formatter": "f"}}}'"
Options
- infile
The name of the file that contains the text in a json format. If the filename ends with “.gz” it is assumed to be a compressed file, otherwise it is assumed to be plain text json.
- outfile
The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the annotated corpus in CoNNL-U format. If the filename ends with “.gz” the file will be compressed.
- config
Config expects a json-object with the following entries:
- annotator
the name of an annotator (see below)
- annotator_options
a json-object with options for the selected annotator
- exclude_sections
a list with section names that should be removed (e.g. References)
- max_sent_len
optional, default: 70; sentences longer than this are skipped
- --loging_config
Optionally the logging configuration can be set. logging_config expects a json object that represents a dict as used for logger configuration.
Annotators
- spacy
Uses spacy. It expects the following options:
- model_name
the name of a model that is used. It has to be available in the current environment (See the spacy documentaiton for more information on installing models).
Encode data
For the extraction of constructions, the constructions need to be encoded efficiently using Huffman coding. This is done in several steps.
Extract dictionary
Extract a dictionary of possible pattern elements to create an encoder.
cxnminer utils extract-vocabulary infile outfile config
cxnminer utils extract-vocabulary example_data/example_data.conllu example_data/example_data_dict.json example_data/example_config.json
Options
- infile
The name of the file that contains the annotated corpus in CoNLL-U format. If the filename ends with “.gz” it is assumed to be a compressed file.
- outfile
The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the vocabulary for the given levels in json format. If the filename ends with “.gz” the file will be compressed.
- config
The configuration for construction mining as described in Settings for the pattern extraction.
- --drop_frequencies
The list can contain the frequencies (needed to create a Huffman encoder) or they can optionally be dropped.
Filter dictionary
Remove items with a frequency below a given threshold from an extracted dictionary.
bin/filter_vocabulary dictionaries outfile min_frequency
bin/filter_vocabulary example_data/example_data_dict.json example_data/example_data_dict_filtered.json 2
Options
- dictionaries
The name of the file that contains the dictionary (including frequencies) extracted from the corpus. If the filename ends with “.gz” it is assumed to be a compressed file.
- outfile
The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the filtered vocabulary json format. If the filename ends with “.gz” the file will be compressed.
- min_frequency
Items with a lower frequency will be dropped from the dictionary.
Prepare encoder
Create and pickle an encoder based on an extracted dictionary.
cxnminer utils create-encoder dictionaries outfile config
cxnminer utils create-encoder example_data/example_data_dict_filtered.json example_data/example_data_encoder example_data/example_config.json
Options
- dictionaries
The name of the file that contains the dictionary extracted from the corpus. If the filename ends with “.gz” it is assumed to be a compressed file.
- outfile
The pickled encoder. If the filename ends with “.gz” the file will be compressed.
- config
The configuration for construction mining as described in Settings for the pattern extraction.
Encode dictionary
Encodes the complete dictionary - creating a dictionary that can be used to encode the corpus using lookup.
cxnminer utils encode-vocabulary vocabulary outfile encoder config
cxnminer utils encode-vocabulary example_data/example_data_dict_filtered.json example_data/example_data_dict_filtered_encoded.json example_data/example_data_encoder example_data/example_config.json
Options
- vocabulary
Either the name of the file that contains the dictionary extracted from the corpus. If the filename ends with “.gz” it is assumed to be a compressed file. Or a json-String containing the vocabulary directly.
- outfile
The name of the file that is created by the script (if it exists, it will be overwritten). It will contain a lookup table for the vocabulary and the encoded versions in json format. If the filename ends with “.gz” the file will be compressed.
- encoder
The pickled encoder.
- config
The configuration for construction mining as described in Settings for the pattern extraction.
- --no_frequencies
Add this flag if the dictionary does not contain frequencies.
- --loging_config
See above.
Encode corpus
To make the pattern extraction more efficient, the corpus can be pre-encoded. Uses an encoded dicitionary to efficiently encode the corpus.
cxnminer utils encode-corpus infile outfile dictionary config
cxnminer utils encode-corpus example_data/example_data.conllu example_data/example_data_encoded.conllu example_data/example_data_dict_filtered_encoded.json example_data/example_config.json
Options
- infile
The name of the file that contains the annotated corpus in CoNLL-U format. If the filename ends with “.gz” it is assumed to be a compressed file.
- outfile
The name of the file that is created by the script (if it exists, it will be overwritten). It will contain the annotated corpus in CoNNL-U format with encoded levels. If the filename ends with “.gz” the file will be compressed.
- dictionary
The encoded dictionary.
- config
The configuration for construction mining as described in Settings for the pattern extraction.
- --processes
Controls the number of processes to be used.
- --loging_config
See above.