No Access

EfficientWord-Net: An Open Source Hotword Detection Engine Based on Few-Shot Learning

School of Computer Science and Engineering, Center of Excellence in Artificial Intelligence and Robotics (AIR), VIT-AP University, Andhra Pradesh, India

E-mail Address: chidha1434@gmail.com

Search for more papers by this author

Aman Rangapur

School of Computer Science and Engineering, Center of Excellence in Artificial Intelligence and Robotics (AIR), VIT-AP University, Andhra Pradesh, India

E-mail Address: aman.rangapur987@gmail.com

Search for more papers by this author

S. Sibi Chakkaravarthy

https://orcid.org/0000-0001-7778-0453

School of Computer Science and Engineering, Center of Excellence in Artificial Intelligence and Robotics (AIR), VIT-AP University, Andhra Pradesh, India

E-mail Address: sb.sibi@gmail.com

Corresponding author.

Search for more papers by this author

Aswani Kumar Cherukuri

School of Information Technology and Engineering, VIT University, Tamilnadu, India

E-mail Address: aswanis@gmail.com

Search for more papers by this author

Meenalosini Vimal Cruz

Department of Information Technology, Allen E. Paulson College of Engineering and Computing, Georgia Southern University, Georgia, USA

E-mail Address: mvimalcruz@georgiasouthern.edu

Search for more papers by this author

, and

S. Sudhakar Ilango

School of Computer Science and Engineering, VIT-AP University, Andhra Pradesh, India

E-mail Address: sudhakar.ilango@vitap.ac.in

Search for more papers by this author

https://doi.org/10.1142/S0219649222500599Cited by:2 (Source: Crossref)

Abstract

Voice assistants like Siri, Google Assistant and Alexa are used widely across the globe for home automation. They require the use of unique phrases, also known as hotwords, to wake them up and perform an action like “Hey Alexa!”, “Ok, Google!”, “Hey, Siri!”. These hotword detectors are lightweight real-time engines whose purpose is to detect the hotwords uttered by the user. However, existing engines require thousands of training samples or is closed source seeking a fee. This paper attempts to solve the same, by presenting the design and implementation of a lightweight, easy-to-implement hotword detection engine based on few-shot learning. The engine detects the hotword uttered by the user in real-time with just a few training samples of the hotword. This approach is efficient when compared to existing implementations because the process of adding a new hotword to the existing systems requires enormous amounts of positive and negative training samples, and the model needs to retrain for every hotword, making the existing implementations inefficient in terms of computation and cost. The architecture proposed in this paper has achieved an accuracy of 95.40%.

Keywords:

References

Amodei, D, R Anubhai, E Battenberg, C Case, J Casper, B Catanzaro, J Chen, M Chrzanowski, A Coates, G Diamos, E Elsen, J Engel, L Fan, C Fougner, T Han, A Hannun, B Jun, P LeGresley, L Lin, S Narang, A Ng, S Ozair, R Prenger, J Raiman, S Satheesh, D Seetapun, S Sengupta, Y Wang, Z Wang, C Wang, B Xiao, D Yogatama, J Zhan and Z Zhu (2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, https://doi.org/10.48550/arXiv.1807.03418. Google Scholar
Atsavasirilert, K, T Theeramunkong, S Usanavasin, A Rugchatjaroen, S Boonkla, J Karnjana, S Keerativittayanun and M Okumura (2019). A light-weight deep convolutional neural network for speech emotion recognition using mel-spectrograms. In 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–4. Chiang Mai, Thailand. https://doi.org/10.1109/iSAI-NLP48611.2019.9045511 Crossref, Google Scholar
Becker, S, M Ackermann, S Lapuschkin, K-R Müller and W Samek (2019). Interpreting and explaining deep neural networks for classification of audio signals, https://arxiv.org/abs/1807.03418. Google Scholar
Chamorro-Padial, J and R Rodríguez-Sánchez (2020). Text categorisation through dimensionality reduction using wavelet transform. Journal of Information & Knowledge Management, 19(4), 2050039. https://doi.org/10.1142/S0219649220500392 Link, Web of Science, Google Scholar
Chen, G and X Yao (2019). Snowboy: KITT-AI’s hotword detection. Kitt-AI. https://github.com/Kitt-AI/snowboy. Google Scholar
Deng, J, W Dong, R Socher, L-J Li, K Li and L Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Miami, FL, USA. https://doi.org/10.1109/CVPR.2009.5206848 Crossref, Google Scholar
He, K, X Zhang, S Ren and J Sun (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Las Vegas, NV, USA. https://doi.org/10.1109/ CVPR.2016.90 Crossref, Google Scholar
Hershey, S, S Chaudhuri, DPW Ellis, JF Gemmeke, A Jansen, RC Moore, M Plakal, D Platt, RA Saurous, B Seybold, M Slaney, RJ Weiss and K Wilson (2017). CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, New Orleans, Louisiana. https://doi.org/10.1109/ICASSP.2017.7952132 Crossref, Google Scholar
Huang, Y, T Shabestary and A Gruenstein (2019a). Hotword cleaner: Dual-microphone adaptive noise cancellation with deferred filter coefficients for robust keyword spotting. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6346–6350, Brighton, UK. https://doi.org/10.1109/ICASSP.2019.8682682. Crossref, Google Scholar
Huang, Y, T Shabestary, A Gruenstein and L Wan (2019b). Multi-microphone adaptive noise cancellation for robust hotword detection. In Interspeech, pp. 1233–1237. Graz, Austria, https://doi.org/10.21437/Interspeech.2019-3006. Google Scholar
Kalith, I (2012). Design and development of automatic speech recognition system for Tamil language using CMU Sphinx 4. In ASRS – 2012, Sri Lanka. Google Scholar
Kaufman, J, J Bathman, E Myers and W Hingston. Google 10000 English Words. In Google. Google. https://github.com/first20hours/google-10000-english. Google Scholar
Kenarsari, A. (2018). Rhino speech-to-intent engine: Context-aware NLU. Picovoice. https://picovoice.ai/platform/rhino/. Google Scholar
Kubota, K (2014). Text-to-speech device and text-to-speech method. Justia Patents. https://patents.justia.com/patent/9437192. Google Scholar
Liu, Y and Q Quan (2022). AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning. Journal of Information & Knowledge Management. https://doi.org/10.1142/S0219649222400287 Link, Web of Science, Google Scholar
Lin, T-Y, P Goyal, R Girshick, K He and P Dollar (2018). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826 Crossref, Web of Science, Google Scholar
Michaely, A, X Zhang, G Simko, C Parada and P Aleksic (2017). Keyword spotting for Google assistant using contextual speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 272–278. https://doi.org/10.1109/ASRU.2017.8268946. Crossref, Google Scholar
Ooi, C, W-H Tan, S Cheong, YL Lee, V Baskaran and YL Low (2019). FPGA-based embedded architecture for IoT home automation application. Indonesian Journal of Electrical Engineering and Computer Science, 14, 646–652. https://doi.org/10.11591/ijeecs.v14.i2.pp646-652 Crossref, Google Scholar
Picovoice, AK, A Kenarsari, D Bartle and RP Rostam (2018). Picovoice/Porcu pine: On-device wake word detection powered by deep learning. In GitHub. https://github.com/Picovoice/porcupine. Google Scholar
Reis, A, D Paulino, H Paredes, I Barroso, M Monteiro, V Rodrigues and J Barroso (2018). Using intelligent personal assistants to assist the elderlies an evaluation of Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri. pp. 1–5. Published in HCI, https://doi.org/10.1109/TISHW.2018.8559503. Google Scholar
Saeed, A, D Grangier and N Zeghidour (2021). Contrastive learning of general-purpose audio representations. In ICASSP 2021 — 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3875–3879. Canada. https://doi.org/10.1109/ICASSP39728.2021.9413528 Crossref, Google Scholar
Schroff, F, D Kalenichenko and J Philbin (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Boston. https://doi.org/10.1109/cvpr.2015.7298682 Crossref, Google Scholar
Tan, M and QV Le (2020). EfficientNet: Rethinking model scaling for convolutional neural networks, https://arxiv.org/abs/2008.09606. Google Scholar
Tang, R, J Lee, A Razi, J Cambre, I Bicking, J Kaye and J Lin (2020). Howl: A deployed, open-source wake word detection system. CoRR abs/2008.09606, https://arxiv.org/abs/2008.09606. Google Scholar
Tang, R and J Lin (2018). Deep residual learning for small-footprint keyword spot- ting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Calgary, AB, Canada. https://doi.org/10.1109/ICASSP.2018.8462688 Crossref, Google Scholar
Team, S (2021). Silero models: Pre-trained enterprise-grade STT/TTS models and benchmarks. GitHub. https://github.com/snakers4/silero-models. Google Scholar
Todisco, M, X Wang, V Vestman, M Sahidullah, H Delgado, A Nautsch, J Yamagishi, N Evans, T Kinnunen and KA Lee (2019). ASVspoof 2019: Future horizons in spoofed and fake audio detection. pp. 1008–1012. https://doi.org/10.21437/Interspeech.2019-2249. Google Scholar
Tom, F, M Jain and P Dey (2018). End-to-end audio replay attack detection using deep convolutional networks with attention. In Interspeech, pp. 681–685. Hyderabad, https://doi.org/10.21437/Interspeech.2018-2279. Google Scholar
Uitdenbogerd, A (2004). International conference on music information retrieval 2003 [review article]. Computer Music Journal, 28, 83–85. https://doi.org/10.1162/comj.2004.28.2.83 Crossref, Google Scholar
Vargas, C, Q Zhang and E Izquierdo (2020). One shot logo recognition based on Siamese neural networks. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pp. 321–325. https://doi.org/10.1145/3372278.3390734. Crossref, Google Scholar
Yang, K-P, A Jee, D Leblanc, J Weaver and Z Armand (2020). Experimenting with hotword detection: The pao-pal. European Journal of Electrical Engineering and Computer Science, 4(5). https://doi.org/10.24018/ejece.2020.4.5.246 Crossref, Google Scholar
Yu, X, Z Yu and S Ramalingam (2018). Learning strict identity mappings in deep residual networks. pp. 4432–4440. https://doi.org/10.1109/CVPR.2018.00466. Google Scholar