Positive Transfer of the Whisper Speech Transformer to Human and Animal Voice Activity Detection
Abstract
This paper introduces WhisperSeg, utilizing the Whisper Transformer pre-trained for Automatic Speech Recognition (ASR) for human and animal Voice Activity Detection (VAD). Contrary to traditional methods that detect human voice or animal vocalizations from a short audio frame and rely on careful threshold selection, WhisperSeg processes entire spectrograms of long audio and generates plain text representations of onset, offset, and type of voice activity. Processing a longer audio context with a larger network greatly improves detection accuracy from few labeled examples. We further demonstrate a positive transfer of detection performance to new animal species, making our approach viable in the data-scarce multi-species setting. Show more
Publication status
publishedExternal links
Journal / series
bioRxivPublisher
Cold Spring Harbor LaboratorySubject
Voice Activity Detection; Audio segmentation; Transformer; WhisperOrganisational unit
03774 - Hahnloser, Richard H.R. / Hahnloser, Richard H.R.
More
Show all metadata
ETH Bibliography
yes
Altmetrics