srtk.preprocessing package

srtk.preprocessing.load_dataset module

1. Load the dataset This scripts provides an example of how to prepare the dataset. It filters the grounded samples, removing those without any answer or question entity.

The example input contains the following fields: - id: the sample id - sent: the question - qc: the question entities - ac: the answer entities

Example usage:

  • For Mintaka dataset, those without any answer entity are removed.

$ python preprocess/load_dataset.py –dataset mintaka –ground-path data/preprocess/mintaka-ground-raw.jsonl –output-path data/preprocess/mintaka-ground.jsonl Train + Validation + Test: Processed 12188 samples, skipped 7812 samples, total 20000 samples

  • For MKQA dataset, those without any question entity are removed.

$ python preprocess/load_dataset.py –dataset mkqa –ground-path data/preprocess/mkqa-ground-raw.jsonl –output-path data/preprocess/mkqa-ground.jsonl Processed 2112 samples, skipped 7888 samples, total 10000 samples

srtk.preprocessing.load_dataset.main(args)

srtk.preprocessing.merge_ground module

  1. This script merges the grounded data into one training data.

e.g. python preprocess/merge_ground.py –output-path data/preprocess/merged-ground.jsonl –ground-files data/preprocess/mintaka-ground.jsonl data/preprocess/mkqa-ground.jsonl

srtk.preprocessing.merge_ground.main(args)

srtk.preprocessing.search_path module

  1. Search Path

This corresponds to search_to_get_path.py in the RUC’s code. It enumerates all paths from the question entities to answer entities.

python preprocess/search_path.py –ground-path data/preprocess/merged-ground.jsonl –output-path data/preprocess/paths.jsonl –remove-sample-without-path

srtk.preprocessing.search_path.generate_paths(src_entities, dst_entities, kg: KnowledgeGraphBase, max_path=50)

Generate paths from question entities to answer entities.

srtk.preprocessing.search_path.has_type_relation(path)

A utility function to check whether the path contain certain relations.

srtk.preprocessing.search_path.main(args)

srtk.preprocessing.score_path module

  1. Score path

The score of a relation path is defined as the HIT rate of the prediction with the ground truth entities. The prediction refers to the search results from the question entities following the relation path.

Personal notes: Why this is necessary? Isn’t the relation path already the path from the question entities to the ground truth entities? In my understanding, this is similar to TF-IDF, the path is more precise if the results is a smaller set of entities but have a higher intersection with the ground truth entities.

e.g. python preprocess/score_path.py –paths-file data/preprocess/paths.jsonl –output-path data/preprocess/paths_scored.jsonl

srtk.preprocessing.score_path.main(args)
srtk.preprocessing.score_path.score_path(kg: KnowledgeGraphBase, src, path, answers, metric='jaccard')

Calculate the HIT score of a given path.

Parameters:
  • kg (KnowledgeGraphBase) – knowledge graph instance

  • src (str) – the source entity

  • path (list) – the path

  • answers (list) – the ground truth entities

  • metric (str) – how the paths are scored. ‘jaccard’ or ‘recall’ Default: ‘jaccard’, per the original implementation

srtk.preprocessing.negative_sampling module

  1. Negative Sampling

Regarding negative sampling method, the author states in the paper: > We replace the observed relation at each time step with other sampled relations as the negative instances to optimize the probability of the observed ones.

e.g. python preprocess/negative_sampling.py –scored-path-file data/preprocess/paths_scored.jsonl –output-file data/preprocess/train.jsonl –positive-threshold 0.3

srtk.preprocessing.negative_sampling.convert_records_relation_id_to_lable(records, kg)

Convert relation ids to relation labels in each record.

srtk.preprocessing.negative_sampling.create_jsonl_dataset(records)

It combines the question and prev_path to query. Each train sample is a dict with the following fields: - query (str): question + prev_path - positive (str): the next relation of the prev_path is regarded as the positive relation - negatives (list): a list of negative relations

Parameters:

records (list[dict]) – list of records

Returns:

list of train samples

Return type:

list[dict]

srtk.preprocessing.negative_sampling.get_positive_connections_along_paths(paths)

Collect positive connections along paths. A positive connection is defined as {prev_relations: next_relation}. END_REL is added to the end of each path.

Returns:

a dictionary of positive connections

Return type:

dict

srtk.preprocessing.negative_sampling.is_candidate_space_too_large(path, question_entities, kg: KnowledgeGraphBase, candidate_depth_multiplier=5)

Check whether the number of the candidate entities along the path is too large.

Parameters:
  • path (list[str]) – path from source entity to destination entity

  • question_entities (list[str]) – list of question entities

  • kg (KnowledgeGraphBase) – a knowledge graph instance

  • candidate_depth_multiplier (int, optional) – a multiplier to control the number of candidate entities at each depth. Defaults to 10.

srtk.preprocessing.negative_sampling.main(args)
srtk.preprocessing.negative_sampling.sample_negative_relations(soruce_entities, prev_path, positive_connections, num_negative, kg: KnowledgeGraphBase)

A helper function to sample negative relations.

Parameters:
  • soruce_entities (list[str]) – list of source entities

  • prev_path (list[str]) – previous path / relations

  • positive_connections (dict) – a dictionary of positive connections

  • num_negative (int) – number of negative relations to sample

  • kg (KnowledgeGraphBase) – a knoledge graph instance

Returns:

list of negative relations

Return type:

list[str]

srtk.preprocessing.negative_sampling.sample_records_from_path(path, question, question_entities, positive_connections, kg: KnowledgeGraphBase, num_negative)

Sample training records from a path.

Returns:

list of training records, each record has the following fields:
  • question (str): the question

  • prev_path (list): previous relations up to a relation (positive_relation) in the path

  • positive_relation (str): the next relation of the prev_path is regarded as the positive relation

  • negative_relations (list): a list of negative relations, the number is specified by num_negative

Return type:

list[dict]