Motivated by the success of masked language modeling (MLM) in natural language processing models, programmers propose W2V-Bert, which examines MLM to learn to represent speech.
W2V-Bert is a frame that combines contrasting learning and MLM, in which the first trains the model of discretization of input speech signals in a finished set of discriminatory speech tokens, and the second trains the model to learn contextualized speech representation by solving the task of a masked forecast, consuming discreetized tokens.
Unlike existing frames based on MLM speech, such as Hubert, which is based on the iterative process of re-clustering and re-training or VQ-Wav2VEC, which combine two separate modules, W2V-Bert can be optimized from end to end, by solving two self-extinguishing tapes (contrasting tasks and MLM).
Experiments show that W2V-Bert achieves competitive results compared to the current most modern pre-trained models with reference to Libriseech when you use Libri-Lekka data ~ 60 thousand. Corps as data without supervision.
In particular, compared to published models, such as WAV2VEC based on Konformer ~ 2.0 and Hubert, the represented model shows 5% to 10% relative WER reduction in test and test submissions. After applying to the Google data set in the W2V-BECT voice search, it exceeds our internal WAV2VEC ~ 2.0 by over 30% relatively.
You can display a full article Here
There is also a tutorial video On YouTube.