Select content type to embed

All Research products

arrow_drop_down

<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=anr_________::9155007e3c7be6255ea864bd36fd47e2&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

ODESSA

Online Diarization Enhanced by recent Speaker identification and Sequential learning Approaches

assignment_turned_inprojectFrom01 Mar 2016

Funder: French National Research Agency (ANR)Project code: ANR-15-CE39-0010

Funder Contribution: 308,406 EUR

ODESSA

- Summary
- DMPs

Description

Speaker diarization is an unsupervised process that aims at identifying each speaker within an audio stream and determining when each speaker is active. It considers that the number of speakers, their identities and their speech turns are all unknown. Speaker diarization has become an important key technology in many domains such as content-based information retrieval, voice biometrics, forensics or social-behavioural analysis. Examples of applications of speaker diarization include speech and speaker indexing, speaker recognition (in the presence of multiple speakers), speaker role detection, speech-to-text transcription, speech-to-speech translation and document content structuring. Although speaker diarization has been studied for almost two decades, current state-of-the-art systems suffer from many limitations. Such systems are extremely domain-dependent: for instance, a speaker diarization system trained on radio/TV broadcast news experiences drastically degraded performance when tested on a different type of recordings such as radio/TV debates, meetings, lectures, conversational telephone speech or conversational voice-over-ip speech. Overlap speech, spontaneous speaking style, background noise, music and other non-speech sources (laugh, applause, etc.) are all nuisance factors that badly affect the quality of speaker diarization. Furthermore, most existing work addresses the problem of offline speaker diarization, that is, the system has full access to the entire audio recording beforehand and no real time processing is required. Therefore, the multi-pass processing over the same data is feasible and a bunch of elegant machine learning tools can be used. Nevertheless, these compromises are not admissible in real-time applications mainly when it comes to public security and fight against terrorism and cyber-criminality. Moreover, after an initial step of segmentation into speech turns, most approaches address speaker diarization as a bag-of-speech-turns clustering problem and do not take into account the inherent temporal structure of interactions between speakers. One goal of the project is to integrate this information and rely on structured prediction techniques to improve over standard hierarchical clustering methods. Since our main application is related to the fight against cyber-criminality and public security, designing an online speaker diarization system is necessary. Therefore, the focus on industrial research will be supplemented by addressing more fundamental research issues related to structured prediction and methods such as conditional random fields. Speaker diarization is inherently related to speaker recognition. In the recent years, state-of-the-art speaker recognition systems have shown good improvement, thanks to the emergence of new recognition paradigms such as i-vectors and deep learning, new session compensation techniques such as probabilistic linear discriminant analysis, and new score normalization techniques such as adaptive symmetric score normalization. However, existing speaker diarization systems did not take full advantages of those new techniques. Therefore, one goal of the project is to adapt those techniques for speaker diarization, and thus fill the research gap in the current literature. To evaluate the proposed algorithms and to ensure their genericness, different existing databases will be considered such as NIST SRE 2008 summed-channel telephone data, NIST RT 2003-2004 conversational telephone data, REPERE TV broadcast data and AMI meeting corpus. Furthermore, we are aiming to collect a medium-size database that suits our main application of fight against cyber-criminality.

Partners

EURECOM , LIMSI , Institut de recherche Idiap , Laboratoire dInformatique pour la Mécanique et les Sciences de lIngénieur

Data Management Plans

Start a new DMP in Argos

Select content type to embed

All Research products

arrow_drop_down

<script type="text/javascript">
<!--
document.write('<div id="oa_widget"></div>');
document.write('<script type="text/javascript" src="https://www.openaire.eu/index.php?option=com_openaire&view=widget&format=raw&projectId=anr_________::9155007e3c7be6255ea864bd36fd47e2&type=result"></script>');
-->
</script>

COPY SCRIPT

For further information contact us at helpdesk@openaire.eu

ODESSA

ODESSA

Loading