World Scientific
  • Search
  •   
Skip main navigation

Cookies Notification

We use cookies on this site to enhance your user experience. By continuing to browse the site, you consent to the use of our cookies. Learn More
×
Our website is made possible by displaying certain online content using javascript.
In order to view the full content, please disable your ad blocker or whitelist our website www.worldscientific.com.

System Upgrade on Tue, Oct 25th, 2022 at 2am (EDT)

Existing users will be able to log into the site and access content. However, E-commerce and registration of new users may not be available for up to 12 hours.
For online purchase, please visit us again. Contact us at [email protected] for any enquiries.

EfficientWord-Net: An Open Source Hotword Detection Engine Based on Few-Shot Learning

    https://doi.org/10.1142/S0219649222500599Cited by:0 (Source: Crossref)

    Voice assistants like Siri, Google Assistant and Alexa are used widely across the globe for home automation. They require the use of unique phrases, also known as hotwords, to wake them up and perform an action like “Hey Alexa!”, “Ok, Google!”, “Hey, Siri!”. These hotword detectors are lightweight real-time engines whose purpose is to detect the hotwords uttered by the user. However, existing engines require thousands of training samples or is closed source seeking a fee. This paper attempts to solve the same, by presenting the design and implementation of a lightweight, easy-to-implement hotword detection engine based on few-shot learning. The engine detects the hotword uttered by the user in real-time with just a few training samples of the hotword. This approach is efficient when compared to existing implementations because the process of adding a new hotword to the existing systems requires enormous amounts of positive and negative training samples, and the model needs to retrain for every hotword, making the existing implementations inefficient in terms of computation and cost. The architecture proposed in this paper has achieved an accuracy of 95.40%.

    References

    • Amodei, D, R Anubhai, E Battenberg, C Case, J Casper, B Catanzaro, J Chen, M Chrzanowski, A Coates, G Diamos, E Elsen, J Engel, L Fan, C Fougner, T Han, A Hannun, B Jun, P LeGresley, L Lin, S Narang, A Ng, S Ozair, R Prenger, J Raiman, S Satheesh, D Seetapun, S Sengupta, Y Wang, Z Wang, C Wang, B Xiao, D Yogatama, J Zhan and Z Zhu (2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, https://doi.org/10.48550/arXiv.1807.03418. Google Scholar
    • Atsavasirilert, K, T Theeramunkong, S Usanavasin, A Rugchatjaroen, S Boonkla, J Karnjana, S Keerativittayanun and M Okumura (2019). A light-weight deep convolutional neural network for speech emotion recognition using mel-spectrograms. In 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–4. Chiang Mai, Thailand. https://doi.org/10.1109/iSAI-NLP48611.2019.9045511 CrossrefGoogle Scholar
    • Becker, S, M Ackermann, S Lapuschkin, K-R Müller and W Samek (2019). Interpreting and explaining deep neural networks for classification of audio signals, https://arxiv.org/abs/1807.03418. Google Scholar
    • Chamorro-Padial, J and R Rodríguez-Sánchez (2020). Text categorisation through dimensionality reduction using wavelet transform. Journal of Information & Knowledge Management, 19(4), 2050039. https://doi.org/10.1142/S0219649220500392 Link, ISIGoogle Scholar
    • Chen, G and X Yao (2019). Snowboy: KITT-AI’s hotword detection. Kitt-AI. https://github.com/Kitt-AI/snowboy. Google Scholar
    • Deng, J, W Dong, R Socher, L-J Li, K Li and L Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Miami, FL, USA. https://doi.org/10.1109/CVPR.2009.5206848 CrossrefGoogle Scholar
    • He, K, X Zhang, S Ren and J Sun (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Las Vegas, NV, USA. https://doi.org/10.1109/ CVPR.2016.90 CrossrefGoogle Scholar
    • Hershey, S, S Chaudhuri, DPW Ellis, JF Gemmeke, A Jansen, RC Moore, M Plakal, D Platt, RA Saurous, B Seybold, M Slaney, RJ Weiss and K Wilson (2017). CNN architectures for large-scale audio classification. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135, New Orleans, Louisiana. https://doi.org/10.1109/ICASSP.2017.7952132 CrossrefGoogle Scholar
    • Huang, Y, T Shabestary and A Gruenstein (2019a). Hotword cleaner: Dual-microphone adaptive noise cancellation with deferred filter coefficients for robust keyword spotting. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6346–6350, Brighton, UK. https://doi.org/10.1109/ICASSP.2019.8682682. CrossrefGoogle Scholar
    • Huang, Y, T Shabestary, A Gruenstein and L Wan (2019b). Multi-microphone adaptive noise cancellation for robust hotword detection. In Interspeech, pp. 1233–1237. Graz, Austria, https://doi.org/10.21437/Interspeech.2019-3006. Google Scholar
    • Kalith, I (2012). Design and development of automatic speech recognition system for Tamil language using CMU Sphinx 4. In ASRS – 2012, Sri Lanka. Google Scholar
    • Kaufman, J, J Bathman, E Myers and W Hingston. Google 10000 English Words. In Google. Google. https://github.com/first20hours/google-10000-english. Google Scholar
    • Kenarsari, A. (2018). Rhino speech-to-intent engine: Context-aware NLU. Picovoice. https://picovoice.ai/platform/rhino/. Google Scholar
    • Kubota, K (2014). Text-to-speech device and text-to-speech method. Justia Patents. https://patents.justia.com/patent/9437192. Google Scholar
    • Liu, Y and Q Quan (2022). AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning. Journal of Information & Knowledge Management. https://doi.org/10.1142/S0219649222400287 Link, ISIGoogle Scholar
    • Lin, T-Y, P Goyal, R Girshick, K He and P Dollar (2018). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327. https://doi.org/10.1109/TPAMI.2018.2858826 Crossref, ISIGoogle Scholar
    • Michaely, A, X Zhang, G Simko, C Parada and P Aleksic (2017). Keyword spotting for Google assistant using contextual speech recognition. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 272–278. https://doi.org/10.1109/ASRU.2017.8268946. CrossrefGoogle Scholar
    • Ooi, C, W-H Tan, S Cheong, YL Lee, V Baskaran and YL Low (2019). FPGA-based embedded architecture for IoT home automation application. Indonesian Journal of Electrical Engineering and Computer Science, 14, 646–652. https://doi.org/10.11591/ijeecs.v14.i2.pp646-652 CrossrefGoogle Scholar
    • Picovoice, AK, A Kenarsari, D Bartle and RP Rostam (2018). Picovoice/Porcu pine: On-device wake word detection powered by deep learning. In GitHub. https://github.com/Picovoice/porcupine. Google Scholar
    • Reis, A, D Paulino, H Paredes, I Barroso, M Monteiro, V Rodrigues and J Barroso (2018). Using intelligent personal assistants to assist the elderlies an evaluation of Amazon Alexa, Google Assistant, Microsoft Cortana, and Apple Siri. pp. 1–5. Published in HCI, https://doi.org/10.1109/TISHW.2018.8559503. Google Scholar
    • Saeed, A, D Grangier and N Zeghidour (2021). Contrastive learning of general-purpose audio representations. In ICASSP 2021 — 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3875–3879. Canada. https://doi.org/10.1109/ICASSP39728.2021.9413528 CrossrefGoogle Scholar
    • Schroff, F, D Kalenichenko and J Philbin (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 815–823. Boston. https://doi.org/10.1109/cvpr.2015.7298682 CrossrefGoogle Scholar
    • Tan, M and QV Le (2020). EfficientNet: Rethinking model scaling for convolutional neural networks, https://arxiv.org/abs/2008.09606. Google Scholar
    • Tang, R, J Lee, A Razi, J Cambre, I Bicking, J Kaye and J Lin (2020). Howl: A deployed, open-source wake word detection system. CoRR abs/2008.09606, https://arxiv.org/abs/2008.09606. Google Scholar
    • Tang, R and J Lin (2018). Deep residual learning for small-footprint keyword spot- ting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5484–5488. Calgary, AB, Canada. https://doi.org/10.1109/ICASSP.2018.8462688 CrossrefGoogle Scholar
    • Team, S (2021). Silero models: Pre-trained enterprise-grade STT/TTS models and benchmarks. GitHub. https://github.com/snakers4/silero-models. Google Scholar
    • Todisco, M, X Wang, V Vestman, M Sahidullah, H Delgado, A Nautsch, J Yamagishi, N Evans, T Kinnunen and KA Lee (2019). ASVspoof 2019: Future horizons in spoofed and fake audio detection. pp. 1008–1012. https://doi.org/10.21437/Interspeech.2019-2249. Google Scholar
    • Tom, F, M Jain and P Dey (2018). End-to-end audio replay attack detection using deep convolutional networks with attention. In Interspeech, pp. 681–685. Hyderabad, https://doi.org/10.21437/Interspeech.2018-2279. Google Scholar
    • Uitdenbogerd, A (2004). International conference on music information retrieval 2003 [review article]. Computer Music Journal, 28, 83–85. https://doi.org/10.1162/comj.2004.28.2.83 CrossrefGoogle Scholar
    • Vargas, C, Q Zhang and E Izquierdo (2020). One shot logo recognition based on Siamese neural networks. In Proceedings of the 2020 International Conference on Multimedia Retrieval, pp. 321–325. https://doi.org/10.1145/3372278.3390734. CrossrefGoogle Scholar
    • Yang, K-P, A Jee, D Leblanc, J Weaver and Z Armand (2020). Experimenting with hotword detection: The pao-pal. European Journal of Electrical Engineering and Computer Science, 4(5). https://doi.org/10.24018/ejece.2020.4.5.246 CrossrefGoogle Scholar
    • Yu, X, Z Yu and S Ramalingam (2018). Learning strict identity mappings in deep residual networks. pp. 4432–4440. https://doi.org/10.1109/CVPR.2018.00466. Google Scholar