KitanaQA Data Augmentation

Squad Augmentation Module

class kitanaqa.augment.augment_squad.SQuADDataset(raw_examples: Dict, custom_importance_scores: Optional[Dict] = None, is_training: bool = False, sample_ratio: float = 4.0, num_replacements: int = 2, sampling_k: int = 3, sampling_strategy: str = 'topK', p_replace: float = 0.1, p_dropword: float = 0.1, p_misspelling: float = 0.1, save_freq: int = 100, from_checkpoint: bool = False, out_prefix: Optional[str] = None, verbose: bool = False)

Bases: Generic[torch.utils.data.dataset.T_co]

generate()

Generate perturbations for the raw SQuAD-like examples :param term: The input term for which we are looking for synonyms. :type term: str :param num_syns: The number of synonyms for the input term. The number of synonyms should be greater than 1. The default value is 10. :type num_syns: Optional(int) :param similarity_thre: The similarity threshold. The function returns the synonyms with higher similarity than the threshold. :type similarity_thre: Optional(float)

Returns

Return type

None

Example

>>> from augment_squad import SQuADDataset
>>> with open('support/squad-dev-v1.1.json', 'r') as f:
>>>     squad_dev_examples = json.read(f)
>>> ds = SQuADDataset(squad_dev_examples, sample_ratio = 0.0001)
>>> ds.generate()
>>> ds()
kitanaqa.augment.augment_squad.format_squad(examples: Dict, title_map: Dict, context_map: Dict, version: str = '1.1') → Dict

Convert a flat list of dicts to nested SQuAD format

Generators Module

class kitanaqa.augment.generators.BaseGenerator

Bases: object

A base class for generating token-level perturbations … .. method:: _check_sent(sent)

Validate and sanitize an input sentence

_cosine_similarity(v1, v2)

Calculate the cosine similarity between two vectors

class kitanaqa.augment.generators.MLMSynonymReplace

Bases: kitanaqa.augment.generators.BaseGenerator

A class to replace synonyms using a masked language model (MLM) … .. method:: generate(term, num_target, toks, token_idx)

Generate misspellings for an input term

generate(term: str, num_target: int, **kwargs) → List

Generate a certain number of synonyms using an MLM.

Parameters
  • term (str) – The input term for which we are looking for misspellings.

  • num_target (int) – The target number of synonyms to generate for the input term. The number should be greater than 0

  • kwargs (Dict) –

    A set of generator-specific arguments - toks : List

    The tokenized source string containing the target term.

    • token_idxint

      The index of the target string in the tokenized source string

Returns

Returns a list of synonyms if any. Otherwise an empty list

Return type

[str]

Example

>>> from generators import MLMSynonymReplace
>>> mr = MLMSynonymReplace()
>>> term = "small"
>>> toks = ['I', 'was', 'born', 'in', 'a', 'small', 'town']
>>> token_idx = 5
>>> num_target = 2
>>> mr.generate(term=term, num_target=num_target, {'toks':toks, 'token_idx':token_idx})
['little', 'mining']
class kitanaqa.augment.generators.MisspReplace

Bases: kitanaqa.augment.generators.BaseGenerator

A class to replace commonly misspelled terms … .. method:: generate(term, num_target)

Generate misspellings for an input term

generate(term: str, num_target: int, **kwargs) → List

Generate a certain number of misspellings for the input term.

Parameters
  • term (str) – The input term for which we are looking for misspellings.

  • num_target (int) – The target number of misspellings to generate for the input term. The number of misspelling should be greater than 0

  • kwargs (Dict) – A set of generator-specific arguments.

Returns

Returns a list of misspellings if any. Otherwise an empty list

Return type

[str]

Example

>>> from generators import MisspReplace
>>> mr = MisspReplace()
>>> term = "worried"
>>> num_target = 5
>>> mr.generate(term=term, num_target=num_target)
['woried', 'worred']
class kitanaqa.augment.generators.SynonymReplace

Bases: kitanaqa.augment.generators.BaseGenerator

A class to generate synonyms for an input term using word2vec … .. method:: generate(term, num_targets, {‘similarity_thre’:0.5})

Generate synonyms for an input term

generate(term: str, num_target: int, **kwargs) → List

Generate a certain number of synonyms using a word2vec model

Parameters
  • term (str) – The input term for which we are looking for misspellings.

  • num_target (int) – The target number of synonyms to generate for the input term. The number should be greater than 0

  • kwargs (Optional(Dict)) –

    A set of generator-specific arguments - similarity_thre : Optional(float)

    Threshold of cosine similarity values in generated terms. The default value is 0.7

Returns

Returns a list of synonyms if any. Otherwise an empty list

Return type

[str]

Example

>>> from generators import SynonymReplace
>>> mr = SynonymReplace()
>>> term = "worried"
>>> num_target = 3
>>> similarity_thre = 0.7
>>> sr.generate(term, num_target, {'similarity_thre': 0.7})
['apprehensive', 'preoccupied', 'worry']

Term Replacement Module

class kitanaqa.augment.term_replacement.DropTerms(use_stop: bool = True)

Bases: object

A class to generate sentence perturbations by dropping target terms … .. method:: drop_terms(sentence, num_terms, num_output_sents)

Generate synonyms for an input term

drop_terms(sentence: str, num_terms: int = 1, num_output_sents: int = 1) → List

Generate a certain number of sentence perturbations by dropping select terms

Parameters
  • sentence (str) – The input sentence to be perturbed.

  • num_terms (Optional(int)) – The number of terms to target in the original sentence, with this perturbation. The number should be greater than 0. The default value is 1.

  • num_output_sents (Optional(int)) – The number of perturbed sentences to produce. The number should be greater than 0. The default is 1.

Returns

Returns a list of perturbed sentences for the input sentence.

Return type

[str]

Example

>>> from term_replacement import DropTerms
>>> p = DropTerms()
>>> sent = "I was born in a small town"
>>> num_terms = 3
>>> num_output_sents = 1
>>> p.generate(sent, num_terms, num_output_sents)
['I born small town']
class kitanaqa.augment.term_replacement.RepeatTerms(use_stop: bool = True)

Bases: object

A class to generate sentence perturbations by repeating target terms … .. method:: repeat_terms(sentence, num_terms, num_output_sents)

Generate synonyms for an input term

repeat_terms(sentence: str, num_terms: int = 1, num_output_sents: int = 1) → List

Generate a certain number of sentence perturbations by repeating select terms

Parameters
  • sentence (str) – The input sentence to be perturbed.

  • num_terms (Optional(int)) – The number of terms to target in the original sentence, with this perturbation. The number should be greater than 0. The default value is 1.

  • num_output_sents (Optional(int)) – The number of perturbed sentences to produce. The number should be greater than 0. The default is 1.

Returns

Returns a list of perturbed sentences for the input sentence.

Return type

[str]

Example

>>> from term_replacement import RepeatTerms
>>> p = RepeatTerms()
>>> sent = "I was born in a small town"
>>> num_terms = 1
>>> num_output_sents = 1
>>> p.generate(sent, num_terms, num_output_sents)
['I was was born in a small town']
class kitanaqa.augment.term_replacement.ReplaceTerms(rep_type: str = 'synonym', use_ner: bool = True)

Bases: object

A class to generate sentence perturbations by replacement … .. method:: replace_terms(sentence, importance_scores, num_replacements, num_output_sents, sampling_strategy, sampling_k)

Generate synonyms for an input term

replace_terms(sentence: str, importance_scores: Optional[List] = None, num_replacements: int = 1, num_output_sents: int = 1, sampling_strategy: str = 'random', sampling_k: Optional[int] = None) → List

Generate a certain number of sentence perturbations by replacement using either misspelling or synonyms

Parameters
  • sentence (str) – The input sentence to be perturbed.

  • importance_scores (Optional(List)) – List of tuples defining a weight for each term in the tokenized sentence. These weights are used during sampling to influence perturnation probabilities. If None, uniform sampling is used by default.

  • num_replacements (Optional(int)) – Target number of terms to replace in the original sentence. The number is chosen randomly using the target as an upper bound, and lower bound of 1. The default is 1.

  • num_output_sents (Optional(int)) – Target number of perturbed sentences to generate based on the original sentence. The default is 1.

  • sampling_strategy (Optional(str)) – Strategy used to sample terms to perturb in the original sentence. The default is random. If importance_scores is given, then sampling_strategy may be topK or bottomK, in which case the importance_scores (or inverted scores) vector is used for weighted sampling.

  • sampling_k (Optional(int)) – The number of terms in the importance score vector to include in topK or bottomK sampling. This parameter is not used by the default sampling_strategy, random sampling.

Returns

Returns a list of perturbed sentences for the input sentence.

Return type

[str]

Example

>>> from term_replacement import ReplaceTerms
>>> p = ReplaceTerms(rep_type="synonym")
>>> sent = "I was born in a small town"
>>> num_terms = 1
>>> num_output_sents = 1
>>> p.generate(sent, num_terms, num_output_sents)
['I born in a small village']
>>> from term_replacement import ReplaceTerms
>>> p = ReplaceTerms(rep_type="misspelling")
>>> sent = "I was born in a small town"
>>> num_terms = 1
>>> num_output_sents = 1
>>> p.generate(sent, num_terms, num_output_sents)
['I born in a smal town']
kitanaqa.augment.term_replacement.get_scores(tokens: List[str], mode: str = 'random', mode_k: Optional[int] = None, scores: Optional[List[Tuple]] = None, remove_stop: bool = True) → List[Tuple]

Initialize and sanitize importance scores

kitanaqa.augment.term_replacement.validate_inputs(num_replacements: int, num_output_sents: int, sampling_strategy: Optional[str] = None) → List

Ensures valid input parameters

Module contents