KitanaQA Trainer

Adversarial Squad Processor Module

class kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor

Bases: transformers.data.processors.utils.DataProcessor

Processor for the SQuAD data set. Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.

alum_get_dev_examples(data_dir, filename=None)

Returns the training examples from the data directory.

Parameters
  • data_dir – Directory containing the data files used for training and evaluating.

  • filename – None by default, specify this if the training file has a different name than the original one which is train-v1.1.json and train-v2.0.json for squad versions 1.1 and 2.0 respectively.

dev_file = None
train_file = None
class kitanaqa.trainer.alum_squad_processor.AlumSquadV1Processor

Bases: kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor

dev_file = 'dev-v1.1.json'
train_file = 'train-v1.1.json'
class kitanaqa.trainer.alum_squad_processor.AlumSquadV2Processor

Bases: kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor

dev_file = 'dev-v2.0.json'
train_file = 'train-v2.0.json'
kitanaqa.trainer.alum_squad_processor.alum_squad_convert_examples_to_features(examples, tokenizer, max_seq_length, doc_stride, max_query_length, padding_strategy='max_length', return_dataset=False, threads=1, tqdm_enabled=True)

Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.

Parameters
  • examples – list of SquadExample

  • tokenizer – an instance of a child of PreTrainedTokenizer

  • max_seq_length – The maximum sequence length of the inputs.

  • doc_stride – The stride used when the context is too large and is split across several features.

  • max_query_length – The maximum length of the query.

  • padding_strategy – Default to “max_length”. Which padding strategy to use

  • return_dataset – Default False. Optional ‘pt’ if ‘pt’: returns a torch.data.TensorDataset,

  • threads – multiple processing threadsa-smi

Returns

list of SquadFeatures

Example:

processor = SquadV2Processor()
examples = processor.alum_get_dev_examples(data_dir)

features = squad_convert_examples_to_features(
    examples=examples,
    tokenizer=tokenizer,
    max_seq_length=args.max_seq_length,
    doc_stride=args.doc_stride,
    max_query_length=args.max_query_length,
)

Arguments Module

class kitanaqa.trainer.arguments.ModelArguments(model_name_or_path: str, train_file_path: str, predict_file_path: Dict[str, str], model_type: str, aug_file_path: Optional[str] = None, tokenizer_name_or_path: Optional[str] = None, cache_dir: Optional[str] = None, do_lower_case: Optional[str] = True, version_2_with_negative: bool = False, null_score_diff_threshold: float = 0.0, n_best_size: int = 20, verbose_logging: bool = False, overwrite_cache: bool = False, max_seq_length: Optional[int] = 512, max_query_length: Optional[int] = 64, max_answer_length: Optional[int] = 30, doc_stride: Optional[int] = 128, freeze_embeds: bool = False, do_aug: bool = True, do_adv_eval: bool = False, do_alum: bool = True, eta: Optional[float] = 0.001, alpha: Optional[float] = 1, alpha_final: Optional[float] = None, alpha_schedule: Optional[str] = None, eps: Optional[float] = 1e-05, sigma: Optional[float] = 1e-05, K: Optional[float] = 1, data_dir: Optional[str] = None, eval_all_checkpoints: Optional[str] = False)

Bases: object

Arguments pertaining to which model/config/tokenizer we are going to fine-tune from.

K: Optional[float] = 1
alpha: Optional[float] = 1
alpha_final: Optional[float] = None
alpha_schedule: Optional[str] = None
aug_file_path: str = None
cache_dir: Optional[str] = None
data_dir: Optional[str] = None
do_adv_eval: bool = False
do_alum: bool = True
do_aug: bool = True
do_lower_case: Optional[str] = True
doc_stride: Optional[int] = 128
eps: Optional[float] = 1e-05
eta: Optional[float] = 0.001
eval_all_checkpoints: Optional[str] = False
freeze_embeds: bool = False
max_answer_length: Optional[int] = 30
max_query_length: Optional[int] = 64
max_seq_length: Optional[int] = 512
model_name_or_path: str
model_type: str
n_best_size: int = 20
null_score_diff_threshold: float = 0.0
overwrite_cache: bool = False
predict_file_path: Dict[str, str]
sigma: Optional[float] = 1e-05
tokenizer_name_or_path: Optional[str] = None
train_file_path: str
verbose_logging: bool = False
version_2_with_negative: bool = False
kitanaqa.trainer.arguments.default_logdir() → str

Same default as PyTorch

Custom Schedulers Module

kitanaqa.trainer.custom_schedulers.custom_scheduler(max_steps: int, update_fn: Callable[[int], float]) → float

Create a custom generator for an input param

kitanaqa.trainer.custom_schedulers.get_custom_exp(max_steps: int, start_val: float, end_val: float) → Iterable

Create a custom exponential scheduler

kitanaqa.trainer.custom_schedulers.get_custom_linear(max_steps: int, start_val: float, end_val: float) → Iterable

Create a custom linear scheduler

Train Module

class kitanaqa.trainer.train.Trainer(model_args=None, **kwargs)

Bases: transformers.trainer.Trainer

A class to provide the adversarial and augmented training and evaluation …

model_args

The arguments to tweak for training. Will default to a basic instance in arguments.py if not provided. For a list of the model_args, refer to arguments.py.

Type

dataclass, ‘optional’ (?)

log(logs, iterator=None)

Modified from HFTrainer base class, Log logs on the various objects watching training.

training_step(model, batch)

Performs one step of training (might be adversarial and/or augmented) and returns the loss.

evaluate(prefix, args, tokenizer, dataset, examples, features)

Performs the evaluation on the dataset and returns the evaluation metrics (Exact Match (EM) and F1-score).

adv_evaluate(prefix: str, args, tokenizer, dataset, examples, features) → torch.Tensor

Performs PGD attack on each example in the evaluation dataset, recording aggregate metrics

Parameters
  • prefix (str) – The model to be used for training

  • args

  • tokenizer – The tokenizer used to preprocess the data.

  • dataset (List(torch.utils.data.TensorDataset)) – The evaluation dataset

  • examples (List(torch.utils.data.TensorDataset)) – The examples in the evaluation dataset

  • features (List(torch.utils.data.TensorDataset)) – SQuAD-like features corresponding to the evalaution dataset

Returns

The evaluation metrics (Exact Match (EM) and F1-score)

Return type

torch.Tensor

evaluate(prefix: str, args, tokenizer, dataset, examples, features) → torch.Tensor

Performs evaluation on the dataset

Parameters
  • prefix (str) – The model to be used for training

  • args

  • tokenizer – The tokenizer used to preprocess the data.

  • dataset (List(torch.utils.data.TensorDataset)) – The evaluation dataset

  • examples (List(torch.utils.data.TensorDataset)) – The examples in the evaluation dataset

  • features (List(torch.utils.data.TensorDataset)) – SQuAD-like features corresponding to the evalaution dataset

Returns

The evaluation metrics (Exact Match (EM) and F1-score)

Return type

torch.Tensor

log(logs: Dict[str, float], iterator: Optional[tqdm.std.tqdm] = None) → None

Modified from HF Trainer base class

Log logs on the various objects watching training.

Subclass and override this method to inject custom behavior.

Parameters
  • logs (Dict[str, float]) – The values to log.

  • iterator (tqdm, optional) – A potential tqdm progress bar to write the logs on.

setup_comet()
training_step(model: torch.nn.modules.module.Module, batch: List) → torch.Tensor

Performs one step of training (might be adversarial and/or augmented)

Parameters
  • model (nn.Module) – The model to be used for training

  • batch (List) – The btach used for one step of training. Includes the input_ids, attention_masks, token_type_ids, start_positions, and end_positions

Returns

The training loss after one step of training

Return type

torch.Tensor

kitanaqa.trainer.train.tensor_to_list(tensor)

Convert a Tensor to List

Utils module

kitanaqa.trainer.utils.build_flow(args, label: str = 'default', model=None, tokenizer=None, train_dataset=None) → prefect.core.flow.Flow

Constructs a Prefect flow composed of sequential modeling steps

Parameters
  • label (Optional(str)) – The unique tag used to identify the Flow instance. The default value is ‘default’

  • model (Optional(transformers.PreTrainedModel)) – The pre-trained Transformer model. This parameter is required for training. The default value is None.

  • tokenizer (Optional(transformers.PreTrainedTokenizer)) – The tokenizer used to preprocess the data for the model.

  • train_dataset (torch.utils.data.TensorDataset) – The training dataset. This parameter is required for training. The default value is None.

Returns

A prefect Flow object contained the specified steps and parameters

Return type

Flow object

kitanaqa.trainer.utils.load_and_cache_examples(args, tokenizer, evaluate=False, use_aug_path=False, output_examples=False) → torch.utils.data.dataset.TensorDataset

Loads SQuAD-like data features from dataset file (or cache)

Parameters
  • args (kitanaqa.trainer.arguments.ModelArguments) –

    A set of arguments related to the model. Specifically, the following arguments are used in this function: - args.train_file_path : str

    Path to the training data file

    • args.do_augbool

      Flag to specify whether to use the augmented training set. If True, will be merged with the original training set specified in train_file_path. The default value is False.

    • args.aug_file_pathstr

      Path for augmented train dataset

    • args.data_dirstr

      Path for data files

    • args.model_name_or_pathstr

      Path to pretrained model or model identifier from huggingface.co/models

    • args.max_seq_lengthOptional[int]

      Max length for the input tokens, specified to the Transformer model defined in model_name_or_path

    • args.overwrite_cacheBool

      Overwrite cached data on load

    • args.predict_file_pathDict[str, str]

      Paths for eval datasets, where the key is the data file tag, and the value is the data file path. Multiple file paths may be given for evaluation, and each will be cached and loaded separately.

    • args.version_2_with_negativeBool

      Flag that specifies to use the SQuAD v2.0 preprocessors. The default value is False.

    • args.doc_strideOptional[int]

      Corresponds to the doc_stride input param for some Huggingface Transformer models.

    • args.max_query_lengthOptional[int]

      Max length for the query segment in the Transformer model input.

  • tokenizer – The Transformer model tokenizer used to preprocess the data.

  • evaluate (Optional(Bool)) – A flag to set the trainer task to either train or evaluate. The default value is False.

  • use_aug_path (Optional(Bool)) – A flag to define whether to use the aug_file_path or the train_file_path. If True, the augmented data path is used when loading and caching the data.

  • output_examples (Optional(Bool)) – A flag to define whether the examples and features should be returned by the data preprocessor. If False, the preprocessor only returns the dataset. This is necessary if the Trainer is used for evaluation or in a pipeline where training is followed by evaluation.

Returns

The dataset containing the data to be used for training or evaluation. Important Notes: - If the output_examples is True, examples and features also are returned. - If evaluate = True, the output will be a dictionary for which the keys are the name of the datasets used for evaluation and the values are the dataset (and optionally the examples and features).

Return type

torch.utils.data.TensorDataset

kitanaqa.trainer.utils.post_to_slack(obj, old_state, new_state)

Post a msg to Slack url if configured, else simply return new_state object

Module contents