KitanaQA Data Augmentation¶
Squad Augmentation Module¶
-
class
kitanaqa.augment.augment_squad.SQuADDataset(raw_examples: Dict, custom_importance_scores: Optional[Dict] = None, is_training: bool = False, sample_ratio: float = 4.0, num_replacements: int = 2, sampling_k: int = 3, sampling_strategy: str = 'topK', p_replace: float = 0.1, p_dropword: float = 0.1, p_misspelling: float = 0.1, save_freq: int = 100, from_checkpoint: bool = False, out_prefix: Optional[str] = None, verbose: bool = False)¶ Bases:
Generic[torch.utils.data.dataset.T_co]-
generate()¶ Generate perturbations for the raw SQuAD-like examples :param term: The input term for which we are looking for synonyms. :type term: str :param num_syns: The number of synonyms for the input term. The number of synonyms should be greater than 1. The default value is 10. :type num_syns: Optional(int) :param similarity_thre: The similarity threshold. The function returns the synonyms with higher similarity than the threshold. :type similarity_thre: Optional(float)
- Returns
- Return type
None
Example
>>> from augment_squad import SQuADDataset >>> with open('support/squad-dev-v1.1.json', 'r') as f: >>> squad_dev_examples = json.read(f) >>> ds = SQuADDataset(squad_dev_examples, sample_ratio = 0.0001) >>> ds.generate() >>> ds()
-
-
kitanaqa.augment.augment_squad.format_squad(examples: Dict, title_map: Dict, context_map: Dict, version: str = '1.1') → Dict¶ Convert a flat list of dicts to nested SQuAD format
Generators Module¶
-
class
kitanaqa.augment.generators.BaseGenerator¶ Bases:
objectA base class for generating token-level perturbations … .. method:: _check_sent(sent)
Validate and sanitize an input sentence
-
_cosine_similarity(v1, v2)¶ Calculate the cosine similarity between two vectors
-
-
class
kitanaqa.augment.generators.MLMSynonymReplace¶ Bases:
kitanaqa.augment.generators.BaseGeneratorA class to replace synonyms using a masked language model (MLM) … .. method:: generate(term, num_target, toks, token_idx)
Generate misspellings for an input term
-
generate(term: str, num_target: int, **kwargs) → List¶ Generate a certain number of synonyms using an MLM.
- Parameters
term (str) – The input term for which we are looking for misspellings.
num_target (int) – The target number of synonyms to generate for the input term. The number should be greater than 0
kwargs (Dict) –
A set of generator-specific arguments - toks : List
The tokenized source string containing the target term.
- token_idxint
The index of the target string in the tokenized source string
- Returns
Returns a list of synonyms if any. Otherwise an empty list
- Return type
[str]
Example
>>> from generators import MLMSynonymReplace >>> mr = MLMSynonymReplace() >>> term = "small" >>> toks = ['I', 'was', 'born', 'in', 'a', 'small', 'town'] >>> token_idx = 5 >>> num_target = 2 >>> mr.generate(term=term, num_target=num_target, {'toks':toks, 'token_idx':token_idx}) ['little', 'mining']
-
-
class
kitanaqa.augment.generators.MisspReplace¶ Bases:
kitanaqa.augment.generators.BaseGeneratorA class to replace commonly misspelled terms … .. method:: generate(term, num_target)
Generate misspellings for an input term
-
generate(term: str, num_target: int, **kwargs) → List¶ Generate a certain number of misspellings for the input term.
- Parameters
term (str) – The input term for which we are looking for misspellings.
num_target (int) – The target number of misspellings to generate for the input term. The number of misspelling should be greater than 0
kwargs (Dict) – A set of generator-specific arguments.
- Returns
Returns a list of misspellings if any. Otherwise an empty list
- Return type
[str]
Example
>>> from generators import MisspReplace >>> mr = MisspReplace() >>> term = "worried" >>> num_target = 5 >>> mr.generate(term=term, num_target=num_target) ['woried', 'worred']
-
-
class
kitanaqa.augment.generators.SynonymReplace¶ Bases:
kitanaqa.augment.generators.BaseGeneratorA class to generate synonyms for an input term using word2vec … .. method:: generate(term, num_targets, {‘similarity_thre’:0.5})
Generate synonyms for an input term
-
generate(term: str, num_target: int, **kwargs) → List¶ Generate a certain number of synonyms using a word2vec model
- Parameters
term (str) – The input term for which we are looking for misspellings.
num_target (int) – The target number of synonyms to generate for the input term. The number should be greater than 0
kwargs (Optional(Dict)) –
A set of generator-specific arguments - similarity_thre : Optional(float)
Threshold of cosine similarity values in generated terms. The default value is 0.7
- Returns
Returns a list of synonyms if any. Otherwise an empty list
- Return type
[str]
Example
>>> from generators import SynonymReplace >>> mr = SynonymReplace() >>> term = "worried" >>> num_target = 3 >>> similarity_thre = 0.7 >>> sr.generate(term, num_target, {'similarity_thre': 0.7}) ['apprehensive', 'preoccupied', 'worry']
-
Term Replacement Module¶
-
class
kitanaqa.augment.term_replacement.DropTerms(use_stop: bool = True)¶ Bases:
objectA class to generate sentence perturbations by dropping target terms … .. method:: drop_terms(sentence, num_terms, num_output_sents)
Generate synonyms for an input term
-
drop_terms(sentence: str, num_terms: int = 1, num_output_sents: int = 1) → List¶ Generate a certain number of sentence perturbations by dropping select terms
- Parameters
sentence (str) – The input sentence to be perturbed.
num_terms (Optional(int)) – The number of terms to target in the original sentence, with this perturbation. The number should be greater than 0. The default value is 1.
num_output_sents (Optional(int)) – The number of perturbed sentences to produce. The number should be greater than 0. The default is 1.
- Returns
Returns a list of perturbed sentences for the input sentence.
- Return type
[str]
Example
>>> from term_replacement import DropTerms >>> p = DropTerms() >>> sent = "I was born in a small town" >>> num_terms = 3 >>> num_output_sents = 1 >>> p.generate(sent, num_terms, num_output_sents) ['I born small town']
-
-
class
kitanaqa.augment.term_replacement.RepeatTerms(use_stop: bool = True)¶ Bases:
objectA class to generate sentence perturbations by repeating target terms … .. method:: repeat_terms(sentence, num_terms, num_output_sents)
Generate synonyms for an input term
-
repeat_terms(sentence: str, num_terms: int = 1, num_output_sents: int = 1) → List¶ Generate a certain number of sentence perturbations by repeating select terms
- Parameters
sentence (str) – The input sentence to be perturbed.
num_terms (Optional(int)) – The number of terms to target in the original sentence, with this perturbation. The number should be greater than 0. The default value is 1.
num_output_sents (Optional(int)) – The number of perturbed sentences to produce. The number should be greater than 0. The default is 1.
- Returns
Returns a list of perturbed sentences for the input sentence.
- Return type
[str]
Example
>>> from term_replacement import RepeatTerms >>> p = RepeatTerms() >>> sent = "I was born in a small town" >>> num_terms = 1 >>> num_output_sents = 1 >>> p.generate(sent, num_terms, num_output_sents) ['I was was born in a small town']
-
-
class
kitanaqa.augment.term_replacement.ReplaceTerms(rep_type: str = 'synonym', use_ner: bool = True)¶ Bases:
objectA class to generate sentence perturbations by replacement … .. method:: replace_terms(sentence, importance_scores, num_replacements, num_output_sents, sampling_strategy, sampling_k)
Generate synonyms for an input term
-
replace_terms(sentence: str, importance_scores: Optional[List] = None, num_replacements: int = 1, num_output_sents: int = 1, sampling_strategy: str = 'random', sampling_k: Optional[int] = None) → List¶ Generate a certain number of sentence perturbations by replacement using either misspelling or synonyms
- Parameters
sentence (str) – The input sentence to be perturbed.
importance_scores (Optional(List)) – List of tuples defining a weight for each term in the tokenized sentence. These weights are used during sampling to influence perturnation probabilities. If None, uniform sampling is used by default.
num_replacements (Optional(int)) – Target number of terms to replace in the original sentence. The number is chosen randomly using the target as an upper bound, and lower bound of 1. The default is 1.
num_output_sents (Optional(int)) – Target number of perturbed sentences to generate based on the original sentence. The default is 1.
sampling_strategy (Optional(str)) – Strategy used to sample terms to perturb in the original sentence. The default is random. If importance_scores is given, then sampling_strategy may be topK or bottomK, in which case the importance_scores (or inverted scores) vector is used for weighted sampling.
sampling_k (Optional(int)) – The number of terms in the importance score vector to include in topK or bottomK sampling. This parameter is not used by the default sampling_strategy, random sampling.
- Returns
Returns a list of perturbed sentences for the input sentence.
- Return type
[str]
Example
>>> from term_replacement import ReplaceTerms >>> p = ReplaceTerms(rep_type="synonym") >>> sent = "I was born in a small town" >>> num_terms = 1 >>> num_output_sents = 1 >>> p.generate(sent, num_terms, num_output_sents) ['I born in a small village']
>>> from term_replacement import ReplaceTerms >>> p = ReplaceTerms(rep_type="misspelling") >>> sent = "I was born in a small town" >>> num_terms = 1 >>> num_output_sents = 1 >>> p.generate(sent, num_terms, num_output_sents) ['I born in a smal town']
-
-
kitanaqa.augment.term_replacement.get_scores(tokens: List[str], mode: str = 'random', mode_k: Optional[int] = None, scores: Optional[List[Tuple]] = None, remove_stop: bool = True) → List[Tuple]¶ Initialize and sanitize importance scores
-
kitanaqa.augment.term_replacement.validate_inputs(num_replacements: int, num_output_sents: int, sampling_strategy: Optional[str] = None) → List¶ Ensures valid input parameters