KitanaQA Trainer¶
Adversarial Squad Processor Module¶
-
class
kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor¶ Bases:
transformers.data.processors.utils.DataProcessorProcessor for the SQuAD data set. Overriden by SquadV1Processor and SquadV2Processor, used by the version 1.1 and version 2.0 of SQuAD, respectively.
-
alum_get_dev_examples(data_dir, filename=None)¶ Returns the training examples from the data directory.
- Parameters
data_dir – Directory containing the data files used for training and evaluating.
filename – None by default, specify this if the training file has a different name than the original one which is train-v1.1.json and train-v2.0.json for squad versions 1.1 and 2.0 respectively.
-
dev_file= None¶
-
train_file= None¶
-
-
class
kitanaqa.trainer.alum_squad_processor.AlumSquadV1Processor¶ Bases:
kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor-
dev_file= 'dev-v1.1.json'¶
-
train_file= 'train-v1.1.json'¶
-
-
class
kitanaqa.trainer.alum_squad_processor.AlumSquadV2Processor¶ Bases:
kitanaqa.trainer.alum_squad_processor.AlumSquadProcessor-
dev_file= 'dev-v2.0.json'¶
-
train_file= 'train-v2.0.json'¶
-
-
kitanaqa.trainer.alum_squad_processor.alum_squad_convert_examples_to_features(examples, tokenizer, max_seq_length, doc_stride, max_query_length, padding_strategy='max_length', return_dataset=False, threads=1, tqdm_enabled=True)¶ Converts a list of examples into a list of features that can be directly given as input to a model. It is model-dependant and takes advantage of many of the tokenizer’s features to create the model’s inputs.
- Parameters
examples – list of
SquadExampletokenizer – an instance of a child of
PreTrainedTokenizermax_seq_length – The maximum sequence length of the inputs.
doc_stride – The stride used when the context is too large and is split across several features.
max_query_length – The maximum length of the query.
padding_strategy – Default to “max_length”. Which padding strategy to use
return_dataset – Default False. Optional ‘pt’ if ‘pt’: returns a torch.data.TensorDataset,
threads – multiple processing threadsa-smi
- Returns
list of
SquadFeatures
Example:
processor = SquadV2Processor() examples = processor.alum_get_dev_examples(data_dir) features = squad_convert_examples_to_features( examples=examples, tokenizer=tokenizer, max_seq_length=args.max_seq_length, doc_stride=args.doc_stride, max_query_length=args.max_query_length, )
Arguments Module¶
-
class
kitanaqa.trainer.arguments.ModelArguments(model_name_or_path: str, train_file_path: str, predict_file_path: Dict[str, str], model_type: str, aug_file_path: Optional[str] = None, tokenizer_name_or_path: Optional[str] = None, cache_dir: Optional[str] = None, do_lower_case: Optional[str] = True, version_2_with_negative: bool = False, null_score_diff_threshold: float = 0.0, n_best_size: int = 20, verbose_logging: bool = False, overwrite_cache: bool = False, max_seq_length: Optional[int] = 512, max_query_length: Optional[int] = 64, max_answer_length: Optional[int] = 30, doc_stride: Optional[int] = 128, freeze_embeds: bool = False, do_aug: bool = True, do_adv_eval: bool = False, do_alum: bool = True, eta: Optional[float] = 0.001, alpha: Optional[float] = 1, alpha_final: Optional[float] = None, alpha_schedule: Optional[str] = None, eps: Optional[float] = 1e-05, sigma: Optional[float] = 1e-05, K: Optional[float] = 1, data_dir: Optional[str] = None, eval_all_checkpoints: Optional[str] = False)¶ Bases:
objectArguments pertaining to which model/config/tokenizer we are going to fine-tune from.
-
K: Optional[float] = 1¶
-
alpha: Optional[float] = 1¶
-
alpha_final: Optional[float] = None¶
-
alpha_schedule: Optional[str] = None¶
-
aug_file_path: str = None¶
-
cache_dir: Optional[str] = None¶
-
data_dir: Optional[str] = None¶
-
do_adv_eval: bool = False¶
-
do_alum: bool = True¶
-
do_aug: bool = True¶
-
do_lower_case: Optional[str] = True¶
-
doc_stride: Optional[int] = 128¶
-
eps: Optional[float] = 1e-05¶
-
eta: Optional[float] = 0.001¶
-
eval_all_checkpoints: Optional[str] = False¶
-
freeze_embeds: bool = False¶
-
max_answer_length: Optional[int] = 30¶
-
max_query_length: Optional[int] = 64¶
-
max_seq_length: Optional[int] = 512¶
-
model_name_or_path: str¶
-
model_type: str¶
-
n_best_size: int = 20¶
-
null_score_diff_threshold: float = 0.0¶
-
overwrite_cache: bool = False¶
-
predict_file_path: Dict[str, str]¶
-
sigma: Optional[float] = 1e-05¶
-
tokenizer_name_or_path: Optional[str] = None¶
-
train_file_path: str¶
-
verbose_logging: bool = False¶
-
version_2_with_negative: bool = False¶
-
-
kitanaqa.trainer.arguments.default_logdir() → str¶ Same default as PyTorch
Custom Schedulers Module¶
-
kitanaqa.trainer.custom_schedulers.custom_scheduler(max_steps: int, update_fn: Callable[[int], float]) → float¶ Create a custom generator for an input param
-
kitanaqa.trainer.custom_schedulers.get_custom_exp(max_steps: int, start_val: float, end_val: float) → Iterable¶ Create a custom exponential scheduler
-
kitanaqa.trainer.custom_schedulers.get_custom_linear(max_steps: int, start_val: float, end_val: float) → Iterable¶ Create a custom linear scheduler
Train Module¶
-
class
kitanaqa.trainer.train.Trainer(model_args=None, **kwargs)¶ Bases:
transformers.trainer.TrainerA class to provide the adversarial and augmented training and evaluation …
-
model_args¶ The arguments to tweak for training. Will default to a basic instance in arguments.py if not provided. For a list of the model_args, refer to arguments.py.
- Type
dataclass, ‘optional’ (?)
-
log(logs, iterator=None)¶ Modified from HFTrainer base class, Log
logson the various objects watching training.
-
training_step(model, batch)¶ Performs one step of training (might be adversarial and/or augmented) and returns the loss.
-
evaluate(prefix, args, tokenizer, dataset, examples, features)¶ Performs the evaluation on the dataset and returns the evaluation metrics (Exact Match (EM) and F1-score).
-
adv_evaluate(prefix: str, args, tokenizer, dataset, examples, features) → torch.Tensor¶ Performs PGD attack on each example in the evaluation dataset, recording aggregate metrics
- Parameters
prefix (str) – The model to be used for training
args –
tokenizer – The tokenizer used to preprocess the data.
dataset (List(torch.utils.data.TensorDataset)) – The evaluation dataset
examples (List(torch.utils.data.TensorDataset)) – The examples in the evaluation dataset
features (List(torch.utils.data.TensorDataset)) – SQuAD-like features corresponding to the evalaution dataset
- Returns
The evaluation metrics (Exact Match (EM) and F1-score)
- Return type
torch.Tensor
-
evaluate(prefix: str, args, tokenizer, dataset, examples, features) → torch.Tensor¶ Performs evaluation on the dataset
- Parameters
prefix (str) – The model to be used for training
args –
tokenizer – The tokenizer used to preprocess the data.
dataset (List(torch.utils.data.TensorDataset)) – The evaluation dataset
examples (List(torch.utils.data.TensorDataset)) – The examples in the evaluation dataset
features (List(torch.utils.data.TensorDataset)) – SQuAD-like features corresponding to the evalaution dataset
- Returns
The evaluation metrics (Exact Match (EM) and F1-score)
- Return type
torch.Tensor
-
log(logs: Dict[str, float], iterator: Optional[tqdm.std.tqdm] = None) → None¶ Modified from HF Trainer base class
Log
logson the various objects watching training.Subclass and override this method to inject custom behavior.
- Parameters
logs (
Dict[str, float]) – The values to log.iterator (
tqdm, optional) – A potential tqdm progress bar to write the logs on.
-
setup_comet()¶
-
training_step(model: torch.nn.modules.module.Module, batch: List) → torch.Tensor¶ Performs one step of training (might be adversarial and/or augmented)
- Parameters
model (nn.Module) – The model to be used for training
batch (List) – The btach used for one step of training. Includes the input_ids, attention_masks, token_type_ids, start_positions, and end_positions
- Returns
The training loss after one step of training
- Return type
torch.Tensor
-
-
kitanaqa.trainer.train.tensor_to_list(tensor)¶ Convert a Tensor to List
Utils module¶
-
kitanaqa.trainer.utils.build_flow(args, label: str = 'default', model=None, tokenizer=None, train_dataset=None) → prefect.core.flow.Flow¶ Constructs a Prefect flow composed of sequential modeling steps
- Parameters
label (Optional(str)) – The unique tag used to identify the Flow instance. The default value is ‘default’
model (Optional(transformers.PreTrainedModel)) – The pre-trained Transformer model. This parameter is required for training. The default value is None.
tokenizer (Optional(transformers.PreTrainedTokenizer)) – The tokenizer used to preprocess the data for the model.
train_dataset (torch.utils.data.TensorDataset) – The training dataset. This parameter is required for training. The default value is None.
- Returns
A prefect Flow object contained the specified steps and parameters
- Return type
Flow object
-
kitanaqa.trainer.utils.load_and_cache_examples(args, tokenizer, evaluate=False, use_aug_path=False, output_examples=False) → torch.utils.data.dataset.TensorDataset¶ Loads SQuAD-like data features from dataset file (or cache)
- Parameters
args (kitanaqa.trainer.arguments.ModelArguments) –
A set of arguments related to the model. Specifically, the following arguments are used in this function: - args.train_file_path : str
Path to the training data file
- args.do_augbool
Flag to specify whether to use the augmented training set. If True, will be merged with the original training set specified in train_file_path. The default value is False.
- args.aug_file_pathstr
Path for augmented train dataset
- args.data_dirstr
Path for data files
- args.model_name_or_pathstr
Path to pretrained model or model identifier from huggingface.co/models
- args.max_seq_lengthOptional[int]
Max length for the input tokens, specified to the Transformer model defined in model_name_or_path
- args.overwrite_cacheBool
Overwrite cached data on load
- args.predict_file_pathDict[str, str]
Paths for eval datasets, where the key is the data file tag, and the value is the data file path. Multiple file paths may be given for evaluation, and each will be cached and loaded separately.
- args.version_2_with_negativeBool
Flag that specifies to use the SQuAD v2.0 preprocessors. The default value is False.
- args.doc_strideOptional[int]
Corresponds to the doc_stride input param for some Huggingface Transformer models.
- args.max_query_lengthOptional[int]
Max length for the query segment in the Transformer model input.
tokenizer – The Transformer model tokenizer used to preprocess the data.
evaluate (Optional(Bool)) – A flag to set the trainer task to either train or evaluate. The default value is False.
use_aug_path (Optional(Bool)) – A flag to define whether to use the aug_file_path or the train_file_path. If True, the augmented data path is used when loading and caching the data.
output_examples (Optional(Bool)) – A flag to define whether the examples and features should be returned by the data preprocessor. If False, the preprocessor only returns the dataset. This is necessary if the Trainer is used for evaluation or in a pipeline where training is followed by evaluation.
- Returns
The dataset containing the data to be used for training or evaluation. Important Notes: - If the output_examples is True, examples and features also are returned. - If evaluate = True, the output will be a dictionary for which the keys are the name of the datasets used for evaluation and the values are the dataset (and optionally the examples and features).
- Return type
torch.utils.data.TensorDataset
-
kitanaqa.trainer.utils.post_to_slack(obj, old_state, new_state)¶ Post a msg to Slack url if configured, else simply return new_state object