Extraction of audio features specific to speech production for
multimodal speaker detection

J 2008

Extraction of audio features specific to speech production for multimodal speaker detection

BESSON, Patricia; Vlad POPOVICI; Jean-Marc VESIN; Jean-Philippe THIRAN; Murat KUNT et al.

Základní údaje

Originální název

Extraction of audio features specific to speech production for multimodal speaker detection

Autoři

BESSON, Patricia; Vlad POPOVICI; Jean-Marc VESIN; Jean-Philippe THIRAN a Murat KUNT

Vydání

IEEE TRANSACTIONS ON MULTIMEDIA, PISCATAWAY, 2008, 1520-9210

Další údaje

Jazyk

angličtina

Typ výsledku

Článek v odborném periodiku

Utajení

není předmětem státního či obchodního tajemství

Impakt faktor

Impact factor: 2.288

Označené pro přenos do RIV

DOI

https://doi.org/10.1109/TMM.2007.911302

UT WoS

000251952200007

Klíčová slova anglicky

audio features; differential evolution; multimodal; mutual information; speaker detection; speech

Změněno: 4. 3. 2013 15:32, doc. Ing. Vlad Calin Popovici, PhD

Anotace

V originále

A method that exploits an information theoretic framework to extract optimized audio features using video information is presented. A simple measure of mutual information (MI) between the resulting audio and video features allows the detection of the active speaker among different candidates. This method involves the optimization of an MI-based objective function. No approximation is needed to solve this optimization problem, neither for the estimation of the probability density functions (pdfs) of the features, nor for the cost function itself. The pdfs are estimated from the samples using a nonparametric approach. The challenging optimization problem is solved using a global method: the differential evolution algorithm. Two information theoretic optimization criteria are compared and their ability to extract audio features specific to speech production is discussed. Using these specific audio features, candidate video features are then classified as member of the "speaker" or "non-speaker" class, resulting in a speaker detection scheme. As a result, our method achieves a speaker detection rate of 100% on in-house test sequences, and of 85% on most commonly used sequences.

Citovat

BESSON, Patricia; Vlad POPOVICI; Jean-Marc VESIN; Jean-Philippe THIRAN a Murat KUNT. Extraction of audio features specific to speech production for multimodal speaker detection. IEEE TRANSACTIONS ON MULTIMEDIA. PISCATAWAY, 2008, roč. 10, č. 1, s. 63-73. ISSN 1520-9210. Dostupné z: https://doi.org/10.1109/TMM.2007.911302.

@article{1090316,
   author = {Besson, Patricia and Popovici, Vlad and Vesin, JeanandMarc and Thiran, JeanandPhilippe and Kunt, Murat},
   article_location = {PISCATAWAY},
   article_number = {1},
   doi = {https://doi.org/10.1109/TMM.2007.911302},
   keywords = {audio features; differential evolution; multimodal; mutual information; speaker detection; speech},
   language = {eng},
   issn = {1520-9210},
   journal = {IEEE TRANSACTIONS ON MULTIMEDIA},
   title = {Extraction of audio features specific to speech production for multimodal speaker detection},
   volume = {10},
   year = {2008}
}

TY  - JOUR
ID  - 1090316
AU  - Besson, Patricia - Popovici, Vlad - Vesin, Jean-Marc - Thiran, Jean-Philippe - Kunt, Murat
PY  - 2008
TI  - Extraction of audio features specific to speech production for multimodal speaker detection
JF  - IEEE TRANSACTIONS ON MULTIMEDIA
VL  - 10
IS  - 1
SP  - 63-73
EP  - 63-73
SN  - 15209210
KW  - audio features
KW  - differential evolution
KW  - multimodal
KW  - mutual information
KW  - speaker detection
KW  - speech
N2  - A method that exploits an information theoretic framework to extract optimized audio features using video information is presented. A simple measure of mutual information (MI) between the resulting audio and video features allows the detection of the active speaker among different candidates. This method involves the optimization of an MI-based objective function. No approximation is needed to solve this optimization problem, neither for the estimation of the probability density functions (pdfs) of the features, nor for the cost function itself. The pdfs are estimated from the samples using a nonparametric approach. The challenging optimization problem is solved using a global method: the differential evolution algorithm. Two information theoretic optimization criteria are compared and their ability to extract audio features specific to speech production is discussed. Using these specific audio features, candidate video features are then classified as member of the "speaker" or "non-speaker" class, resulting in a speaker detection scheme. As a result, our method achieves a speaker detection rate of 100% on in-house test sequences, and of 85% on most commonly used sequences.
ER  -

BESSON, Patricia; Vlad POPOVICI; Jean-Marc VESIN; Jean-Philippe THIRAN a Murat KUNT. Extraction of audio features specific to speech production for multimodal speaker detection. \textit{IEEE TRANSACTIONS ON MULTIMEDIA}. PISCATAWAY, 2008, roč.~10, č.~1, s.~63-73. ISSN~1520-9210. Dostupné z: https://doi.org/10.1109/TMM.2007.911302.

Přehled o publikaci