KitanaQA Trainer¶

Adversarial Squad Processor Module¶

class kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor¶

Bases: transformers.data.processors.utils.DataProcessor

Processor for the SQuAD data set. Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.

alum_get_dev_examples(data_dir, filename=None)¶

Returns the training examples from the data directory.

Parameters

data_dir – Directory containing the data files used for training and evaluating.
filename – None by default, specify this if the training file has a different name than the original one which is train-v1.1.json and train-v2.0.json for squad versions 1.1 and 2.0 respectively.

dev_file = None¶

train_file = None¶

class kitanaqa.trainer.alum_squad_processor.AlumSquadV1Processor¶

Bases: kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor

dev_file = 'dev-v1.1.json'¶

train_file = 'train-v1.1.json'¶

class kitanaqa.trainer.alum_squad_processor.AlumSquadV2Processor¶

Bases: kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor

dev_file = 'dev-v2.0.json'¶

train_file = 'train-v2.0.json'¶

kitanaqa.trainer.alum_squad_processor.alum_squad_convert_examples_to_features(examples, tokenizer, max_seq_length, doc_stride, max_query_length, padding_strategy='max_length', return_dataset=False, threads=1, tqdm_enabled=True)¶

Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.

Parameters

examples – list of SquadExample
tokenizer – an instance of a child of PreTrainedTokenizer
max_seq_length – The maximum sequence length of the inputs.
doc_stride – The stride used when the context is too large and is split across several features.
max_query_length – The maximum length of the query.
padding_strategy – Default to “max_length”. Which padding strategy to use
return_dataset – Default False. Optional ‘pt’ if ‘pt’: returns a torch.data.TensorDataset,
threads – multiple processing threadsa-smi

Returns

list of SquadFeatures

Example:

processor = SquadV2Processor()
examples = processor.alum_get_dev_examples(data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=args.max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=args.max_query_length,
)

Arguments Module¶

class kitanaqa.trainer.arguments.ModelArguments(model_name_or_path: str, train_file_path: str, predict_file_path: Dict[str, str], model_type: str, aug_file_path: Optional[str] = None, tokenizer_name_or_path: Optional[str] = None, cache_dir: Optional[str] = None, do_lower_case: Optional[str] = True, version_2_with_negative: bool = False, null_score_diff_threshold: float = 0.0, n_best_size: int = 20, verbose_logging: bool = False, overwrite_cache: bool = False, max_seq_length: Optional[int] = 512, max_query_length: Optional[int] = 64, max_answer_length: Optional[int] = 30, doc_stride: Optional[int] = 128, freeze_embeds: bool = False, do_aug: bool = True, do_adv_eval: bool = False, do_alum: bool = True, eta: Optional[float] = 0.001, alpha: Optional[float] = 1, alpha_final: Optional[float] = None, alpha_schedule: Optional[str] = None, eps: Optional[float] = 1e-05, sigma: Optional[float] = 1e-05, K: Optional[float] = 1, data_dir: Optional[str] = None, eval_all_checkpoints: Optional[str] = False)¶

Bases: object

Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.

K: Optional[float] = 1¶

alpha: Optional[float] = 1¶

alpha_final: Optional[float] = None¶

alpha_schedule: Optional[str] = None¶

aug_file_path: str = None¶

cache_dir: Optional[str] = None¶

data_dir: Optional[str] = None¶

do_adv_eval: bool = False¶

do_alum: bool = True¶

do_aug: bool = True¶

do_lower_case: Optional[str] = True¶

doc_stride: Optional[int] = 128¶

eps: Optional[float] = 1e-05¶

eta: Optional[float] = 0.001¶

eval_all_checkpoints: Optional[str] = False¶

freeze_embeds: bool = False¶

max_answer_length: Optional[int] = 30¶

max_query_length: Optional[int] = 64¶

max_seq_length: Optional[int] = 512¶

model_name_or_path: str¶

model_type: str¶

n_best_size: int = 20¶

null_score_diff_threshold: float = 0.0¶

overwrite_cache: bool = False¶

predict_file_path: Dict[str, str]¶

sigma: Optional[float] = 1e-05¶

tokenizer_name_or_path: Optional[str] = None¶

train_file_path: str¶

verbose_logging: bool = False¶

version_2_with_negative: bool = False¶

kitanaqa.trainer.arguments.default_logdir() → str¶: Same default as PyTorch

Custom Schedulers Module¶

kitanaqa.trainer.custom_schedulers.custom_scheduler(max_steps: int, update_fn: Callable[[int], float]) → float¶: Create a custom generator for an input param

kitanaqa.trainer.custom_schedulers.get_custom_exp(max_steps: int, start_val: float, end_val: float) → Iterable¶: Create a custom exponential scheduler

kitanaqa.trainer.custom_schedulers.get_custom_linear(max_steps: int, start_val: float, end_val: float) → Iterable¶: Create a custom linear scheduler

Train Module¶

class kitanaqa.trainer.train.Trainer(model_args=None, **kwargs)¶

Bases: transformers.trainer.Trainer

A class to provide the adversarial and augmented training and evaluation …

model_args¶

The arguments to tweak for training. Will default to a basic instance in arguments.py if not provided. For a list of the model_args, refer to arguments.py.

Type: dataclass, ‘optional’ (?)

log(logs, iterator=None)¶: Modified from HFTrainer base class, Log logs on the various objects watching training.

training_step(model, batch)¶: Performs one step of training (might be adversarial and/or augmented) and returns the loss.

evaluate(prefix, args, tokenizer, dataset, examples, features)¶: Performs the evaluation on the dataset and returns the evaluation metrics (Exact Match (EM) and F1-score).

adv_evaluate(prefix: str, args, tokenizer, dataset, examples, features) → torch.Tensor¶

Performs PGD attack on each example in the evaluation dataset, recording aggregate metrics

Parameters

prefix (str) – The model to be used for training
args –
tokenizer – The tokenizer used to preprocess the data.
dataset (List(torch.utils.data.TensorDataset)) – The evaluation dataset
examples (List(torch.utils.data.TensorDataset)) – The examples in the evaluation dataset
features (List(torch.utils.data.TensorDataset)) – SQuAD-like features corresponding to the evalaution dataset

Returns

The evaluation metrics (Exact Match (EM) and F1-score)

Return type

torch.Tensor

evaluate(prefix: str, args, tokenizer, dataset, examples, features) → torch.Tensor¶

Performs evaluation on the dataset

Parameters

prefix (str) – The model to be used for training
args –
tokenizer – The tokenizer used to preprocess the data.
dataset (List(torch.utils.data.TensorDataset)) – The evaluation dataset
examples (List(torch.utils.data.TensorDataset)) – The examples in the evaluation dataset
features (List(torch.utils.data.TensorDataset)) – SQuAD-like features corresponding to the evalaution dataset

Returns

The evaluation metrics (Exact Match (EM) and F1-score)

Return type

torch.Tensor

log(logs: Dict[str, float], iterator: Optional[tqdm.std.tqdm] = None) → None¶

Modified from HF Trainer base class

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

Parameters

logs (Dict[str, float]) – The values to log.
iterator (tqdm, optional) – A potential tqdm progress bar to write the logs on.

setup_comet()¶

training_step(model: torch.nn.modules.module.Module, batch: List) → torch.Tensor¶

Performs one step of training (might be adversarial and/or augmented)

Parameters

model (nn.Module) – The model to be used for training
batch (List) – The btach used for one step of training. Includes the input_ids, attention_masks, token_type_ids, start_positions, and end_positions

Returns

The training loss after one step of training

Return type

torch.Tensor

kitanaqa.trainer.train.tensor_to_list(tensor)¶: Convert a Tensor to List

Utils module¶

kitanaqa.trainer.utils.build_flow(args, label: str = 'default', model=None, tokenizer=None, train_dataset=None) → prefect.core.flow.Flow¶

Constructs a Prefect flow composed of sequential modeling steps

Parameters

label (Optional(str)) – The unique tag used to identify the Flow instance. The default value is ‘default’
model (Optional(transformers.PreTrainedModel)) – The pre-trained Transformer model. This parameter is required for training. The default value is None.
tokenizer (Optional(transformers.PreTrainedTokenizer)) – The tokenizer used to preprocess the data for the model.
train_dataset (torch.utils.data.TensorDataset) – The training dataset. This parameter is required for training. The default value is None.

Returns

A prefect Flow object contained the specified steps and parameters

Return type

Flow object

kitanaqa.trainer.utils.load_and_cache_examples(args, tokenizer, evaluate=False, use_aug_path=False, output_examples=False) → torch.utils.data.dataset.TensorDataset¶

Loads SQuAD-like data features from dataset file (or cache)

Parameters

args (kitanaqa.trainer.arguments.ModelArguments) –
A set of arguments related to the model. Specifically, the following arguments are used in this function: - args.train_file_path : str

Path to the training data file
- args.do_augbool
  Flag to specify whether to use the augmented training set. If True, will be merged with the original training set specified in train_file_path. The default value is False.
- args.aug_file_pathstr
  Path for augmented train dataset
- args.data_dirstr
  Path for data files
- args.model_name_or_pathstr
  Path to pretrained model or model identifier from huggingface.co/models
- args.max_seq_lengthOptional[int]
  Max length for the input tokens, specified to the Transformer model defined in model_name_or_path
- args.overwrite_cacheBool
  Overwrite cached data on load
- args.predict_file_pathDict[str, str]
  Paths for eval datasets, where the key is the data file tag, and the value is the data file path. Multiple file paths may be given for evaluation, and each will be cached and loaded separately.
- args.version_2_with_negativeBool
  Flag that specifies to use the SQuAD v2.0 preprocessors. The default value is False.
- args.doc_strideOptional[int]
  Corresponds to the doc_stride input param for some Huggingface Transformer models.
- args.max_query_lengthOptional[int]
  Max length for the query segment in the Transformer model input.
tokenizer – The Transformer model tokenizer used to preprocess the data.
evaluate (Optional(Bool)) – A flag to set the trainer task to either train or evaluate. The default value is False.
use_aug_path (Optional(Bool)) – A flag to define whether to use the aug_file_path or the train_file_path. If True, the augmented data path is used when loading and caching the data.
output_examples (Optional(Bool)) – A flag to define whether the examples and features should be returned by the data preprocessor. If False, the preprocessor only returns the dataset. This is necessary if the Trainer is used for evaluation or in a pipeline where training is followed by evaluation.

Returns

The dataset containing the data to be used for training or evaluation. Important Notes: - If the output_examples is True, examples and features also are returned. - If evaluate = True, the output will be a dictionary for which the keys are the name of the datasets used for evaluation and the values are the dataset (and optionally the examples and features).

Return type

torch.utils.data.TensorDataset

kitanaqa.trainer.utils.post_to_slack(obj, old_state, new_state)¶: Post a msg to Slack url if configured, else simply return new_state object

KitanaQA Trainer¶

Adversarial Squad Processor Module¶

Arguments Module¶

Custom Schedulers Module¶

Train Module¶

Utils module¶

Module contents¶