srtk.preprocessing package

srtk.preprocessing.load_dataset module

1. Load the dataset This scripts provides an example of how to prepare the dataset. It filters the grounded samples, removing those without any answer or question entity.

The example input contains the following fields: - id: the sample id - sent: the question - qc: the question entities - ac: the answer entities

Example usage:

For Mintaka dataset, those without any answer entity are removed.

$ python preprocess/load_dataset.py –dataset mintaka –ground-path data/preprocess/mintaka-ground-raw.jsonl –output-path data/preprocess/mintaka-ground.jsonl Train + Validation + Test: Processed 12188 samples, skipped 7812 samples, total 20000 samples

For MKQA dataset, those without any question entity are removed.

$ python preprocess/load_dataset.py –dataset mkqa –ground-path data/preprocess/mkqa-ground-raw.jsonl –output-path data/preprocess/mkqa-ground.jsonl Processed 2112 samples, skipped 7888 samples, total 10000 samples

srtk.preprocessing.load_dataset.main(args)

srtk.preprocessing.merge_ground module

This script merges the grounded data into one training data.

e.g. python preprocess/merge_ground.py –output-path data/preprocess/merged-ground.jsonl –ground-files data/preprocess/mintaka-ground.jsonl data/preprocess/mkqa-ground.jsonl

srtk.preprocessing.merge_ground.main(args)

srtk.preprocessing.search_path module

Search Path

This corresponds to search_to_get_path.py in the RUC’s code. It enumerates all paths from the question entities to answer entities.

python preprocess/search_path.py –ground-path data/preprocess/merged-ground.jsonl –output-path data/preprocess/paths.jsonl –remove-sample-without-path

srtk.preprocessing.search_path.generate_paths(src_entities, dst_entities, kg: KnowledgeGraphBase, max_path=50): Generate paths from question entities to answer entities.

srtk.preprocessing.search_path.has_type_relation(path): A utility function to check whether the path contain certain relations.

srtk.preprocessing.search_path.main(args)

srtk.preprocessing.score_path module

Score path

The score of a relation path is defined as the HIT rate of the prediction with the ground truth entities. The prediction refers to the search results from the question entities following the relation path.

Personal notes: Why this is necessary? Isn’t the relation path already the path from the question entities to the ground truth entities? In my understanding, this is similar to TF-IDF, the path is more precise if the results is a smaller set of entities but have a higher intersection with the ground truth entities.

e.g. python preprocess/score_path.py –paths-file data/preprocess/paths.jsonl –output-path data/preprocess/paths_scored.jsonl

srtk.preprocessing.score_path.main(args)

srtk.preprocessing.score_path.score_path(kg: KnowledgeGraphBase, src, path, answers, metric='jaccard')

Calculate the HIT score of a given path.

Parameters:

kg (KnowledgeGraphBase) – knowledge graph instance
src (str) – the source entity
path (list) – the path
answers (list) – the ground truth entities
metric (str) – how the paths are scored. ‘jaccard’ or ‘recall’ Default: ‘jaccard’, per the original implementation

srtk.preprocessing.negative_sampling module

Negative Sampling

Regarding negative sampling method, the author states in the paper: > We replace the observed relation at each time step with other sampled relations as the negative instances to optimize the probability of the observed ones.

e.g. python preprocess/negative_sampling.py –scored-path-file data/preprocess/paths_scored.jsonl –output-file data/preprocess/train.jsonl –positive-threshold 0.3

srtk.preprocessing.negative_sampling.convert_records_relation_id_to_lable(records, kg): Convert relation ids to relation labels in each record.

srtk.preprocessing.negative_sampling.create_jsonl_dataset(records)

It combines the question and prev_path to query. Each train sample is a dict with the following fields: - query (str): question + prev_path - positive (str): the next relation of the prev_path is regarded as the positive relation - negatives (list): a list of negative relations

Parameters:: records (list[dict]) – list of records
Returns:: list of train samples
Return type:: list[dict]

srtk.preprocessing.negative_sampling.get_positive_connections_along_paths(paths)

Collect positive connections along paths. A positive connection is defined as {prev_relations: next_relation}. END_REL is added to the end of each path.

Returns:: a dictionary of positive connections
Return type:: dict

srtk.preprocessing.negative_sampling.is_candidate_space_too_large(path, question_entities, kg: KnowledgeGraphBase, candidate_depth_multiplier=5)

Check whether the number of the candidate entities along the path is too large.

Parameters:

path (list[str]) – path from source entity to destination entity
question_entities (list[str]) – list of question entities
kg (KnowledgeGraphBase) – a knowledge graph instance
candidate_depth_multiplier (int, optional) – a multiplier to control the number of candidate entities at each depth. Defaults to 10.

srtk.preprocessing.negative_sampling.main(args)

srtk.preprocessing.negative_sampling.sample_negative_relations(soruce_entities, prev_path, positive_connections, num_negative, kg: KnowledgeGraphBase)

A helper function to sample negative relations.

Parameters:

soruce_entities (list[str]) – list of source entities
prev_path (list[str]) – previous path / relations
positive_connections (dict) – a dictionary of positive connections
num_negative (int) – number of negative relations to sample
kg (KnowledgeGraphBase) – a knoledge graph instance

Returns:

list of negative relations

Return type:

list[str]

srtk.preprocessing.negative_sampling.sample_records_from_path(path, question, question_entities, positive_connections, kg: KnowledgeGraphBase, num_negative)

Sample training records from a path.

Returns:

list of training records, each record has the following fields:

question (str): the question
prev_path (list): previous relations up to a relation (positive_relation) in the path
positive_relation (str): the next relation of the prev_path is regarded as the positive relation
negative_relations (list): a list of negative relations, the number is specified by num_negative

Return type:

list[dict]