What’s Generative Insertion Transformer?

2022-05-20 06:59:31 By : Ms. candy zhu

Continuous annotation of user data is a challenge while deploying NLU techniques at scale in commercial applications. Models must be re-trained and updated to keep the performance at an optimal level. However, the process is expensive, labour-intensive, and time-consuming. Furthermore, with the rising concerns around privacy, manual review of user data needed for annotation is not ideal. 

Researchers at Amazon and the University of Massachusetts Lowell have proposed a generative model to produce labelled synthetic data. The idea is to improve model robustness and performance by generating synthetic utterances and augmenting the original training data.

The Generative Insertion Transformer (GIT) is based on a non-autoregressive insertion transformer model that extends the idea to solve the inverse NLU problem by producing valid labelled data utterance that matches the annotation with a given template.

In this generative model, the decoder generates a sequence by inserting tokens between previously generated tokens. The carrier tokens are inserted between labels in the template iteratively. The insertion process at each position in the utterance is independent of every other position and stops when the EOS token is generated at all positions, resulting in a fully annotated synthetic utterance that can be directly augmented with real data for model building purposes.

The process can be divided into three sections:

Pretraining: GIT is pre-trained using the BERT encoder and KERMIT objective on an unsupervised LM task: Given a sentence with masked tokens, GIT is trained to insert the masked tokens. Two tests are configured on this model:

Fine-tuning: The pre-trained GIT model is then fine-tuned for each domain using annotated real data. A template is provided as model input for each utterance and the complete utterance as output. During training, at each insertion slot, there are multiple candidate tokens from the ground truth, unlike autoregressive generation, which entails a single token per generation step. The ground truth distribution sets non-candidate token probabilities to 0 and uniformly weighs all candidate token probabilities.

Generation: To generate synthetic data for NLU, a template is constructed that contains the desired intent, slot types, and slot values for the synthetic example. This priming sequence is provided as an input to the decoder, which inserts carrier tokens in an iterative manner to form a coherent utterance. The generation process addresses both the label projection and entity control challenges. Templates used in inference are constructed from the reduced real data.

To study the effectiveness of synthetically generated data, the NLU model performance was evaluated in a reduced data regime. For each domain, multiple IC-NER models are built using all real data, a reduced set of real data and a combination of real and synthetic data. All models within a domain share the same training hyper-parameters, including architecture and encoder. They differ only in training data composition. 

The researchers demonstrated DA using GIT as a feasible data generation technique to mitigate reduced annotation volumes for IC and NER tasks. The NLU models trained on 33% real data and synthetic data performed on par with models trained on full real data. Further, on domains with the highest SemER regressions, the quality of synthetic data was improved by filtering them with model confidence scores. Among domains that benefit from synthetic data, appropriate carrier token insertion enhanced utterances’ semantics and their value as training samples. The future represents data generation with entities replaced through knowledge base sampling. Such finer control over entities supports new feature expansion and enhances customer privacy.

Conference, in-person (Bangalore) MachineCon 2022 24th Jun

Conference, Virtual Deep Learning DevCon 2022 30th Jul

Conference, in-person (Bangalore) Cypher 2022 21-23rd Sep

Stay Connected with a larger ecosystem of data science and ML Professionals

Discover special offers, top stories, upcoming events, and more.

In the next few months, DealShare looks to grow its data science team by 15-20 members.

The idea was if I give you a sequence of amino acids, can you predict what will be the structure or the shape that it will take in the 3D space?

GeoIQ’s AI-based location tool will help Lenskart with its aggressive store rollout strategy.

The main highlights of this release are performance enhancement with oneDNN and the release of a new API for model distribution, called DTensor

Now that we are in the beta phase, we are looking at scalability mainly.

Rajini++ or rajinipp runs on Python version 3.8 or higher.

A backend for PyTorch, Apple’s Metal Performance Shaders (MPS) help accelerate GPU training.

DeepMind has yet to confirm this new development.

Cypher22 takes pride in being the largest and the best conference in India centred around the artificial intelligence landscape.

The workshop covered various initiatives and projects launched by Intel®, alongside deep-diving into Intel® Optimisation for TensorFlow to enhance the performance on Intel platforms and more.

Stay up to date with our latest news, receive exclusive deals, and more.

© Analytics India Magazine Pvt Ltd 2022