Results

Publications HAL du projet européen Speech Unit(e)s

2018

Journal articles

titre
Does the Visual Channel Improve the Perception of Consonants Produced by Speakers of French With Down Syndrome?
auteur
Alexandre Hennequin, Amélie Rochet-Capellan, Silvain Gerber, Marion Dohen
article
Journal of Speech, Language, and Hearing Research, American Speech-Language-Hearing Association, 2018, 61 (4), 〈10.1044/2017_JSLHR-H-17-0112〉
resume
Purpose: This work evaluates whether seeing the speaker's face could improve the speech intelligibility of adults with Down syndrome (DS). This is not straightforward because DS induces a number of anatomical and motor anomalies affecting the orofacial zone. Method: A speech-in-noise perception test was used to evaluate the intelligibility of 16 consonants (Cs) produced in a vowel–consonant–vowel context (Vo = /a/) by 4 speakers with DS and 4 control speakers. Forty-eight naïve participants were asked to identify the stimuli in 3 modalities: auditory (A), visual (V), and auditory–visual (AV). The probability of correct responses was analyzed, as well as AV gain, confusions, and transmitted information as a function of modality and phonetic features. Results: The probability of correct response follows the trend AV > A > V, with smaller values for the DS than the control speakers in A and AV but not in V. This trend depended on the C: the V information particularly improved the transmission of place of articulation and to a lesser extent of manner, whereas voicing remained specifically altered in DS. Conclusions: The results suggest that the V information is intact in the speech of people with DS and improves the perception of some phonetic features in Cs in a similar way as for control speakers. This result has implications for further studies, rehabilitation protocols, and specific training of caregivers.
typdoc
Journal articles
DOI
DOI : 10.1044/2017_JSLHR-H-17-0112
Accès au bibtex
BibTex
titre
What drives the perceptual change resulting from speech motor adaptation? Evaluation of hypotheses in a Bayesian modeling framework
auteur
Jean-François Patri, Pascal Perrier, Jean-Luc Schwartz, Julien Diard
article
PLoS Computational Biology, Public Library of Science, 2018, 14 (1), 〈10.1371/journal.pcbi.1005942〉
resume
Shifts in perceptual boundaries resulting from speech motor learning induced by perturbations of the auditory feedback were taken as evidence for the involvement of motor functions in auditory speech perception. Beyond this general statement, the precise mechanisms underlying this involvement are not yet fully understood. In this paper we propose a quantitative evaluation of some hypotheses concerning the motor and auditory updates that could result from motor learning, in the context of various assumptions about the roles of the auditory and somatosensory pathways in speech perception. This analysis was made possible thanks to the use of a Bayesian model that implements these hypotheses by expressing the relationships between speech production and speech perception in a joint probability distribution. The evaluation focuses on how the hypotheses can (1) predict the location of perceptual boundary shifts once the perturbation has been removed, (2) account for the magnitude of the compensation in presence of the perturbation, and (3) describe the correlation between these two behavioral characteristics. Experimental findings about changes in speech perception following adaptation to auditory feedback perturbations serve as reference. Simulations suggest that they are compatible with a framework in which motor adaptation updates both the auditory-motor internal model and the auditory characterization of the perturbed phoneme, and where perception involves both auditory and somatosensory pathways. Author summary Experimental evidence suggest that motor learning influences categories in speech perception. These observations are consistent with studies of arm motor control showing that motor learning alters the perception of the arm location in the space, and that these perceptual changes are associated with increased connectivity between regions of the motor cortex. Still, the interpretation of experimental findings is severely handicapped by a lack of precise hypotheses about underlying mechanisms. We reanalyze the results of the most PLOS Computational Biology | https://doi.
typdoc
Journal articles
DOI
DOI : 10.1371/journal.pcbi.1005942
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01701562/file/Patri%20et%20al.%20-%202018%20-%20What%20drives%20the%20perceptual%20change%20resulting%20from%20speech%20motor%20adaptation%20Evaluation%20of%20hypotheses%20in%20a%20Bayesian.pdf BibTex

Theses

titre
Modélisation bayésienne du développement conjoint de la perception, l'action et la phonologie
auteur
Marie-Lou Barnaud
article
Sciences cognitives. Grenoble 1 UGA - Université Grenoble Alpe, 2018. Français
resume
Les unités phonétiques peuvent être associées à des représentations auditives et motrices. Dans cette thèse, nous étudions le développement de cette double représentation et ses conséquences, principalement durant la perception. Nous abordons ces questions à l’aide de simulations informatiques réalisées à l’aide d’un modèle bayésien de la communication, nommé COSMO (“Communicating Objects using Sensory-Motor Operations”). Nous analysons dans ce modèle les processus d’apprentissage auditif et moteur. A la lumière des simulations, il apparaît que les représentations auditives sont acquises rapidement, sont basées sur des processus exogènes et caractérisent mieux les voyelles. Par contraste, les représentations motrices sont acquises plus lentement, sont basées sur des processus endogènes et caractérisent mieux les consonnes. Nous observons trois conséquences issues de ces différences d’apprentissage. D’abord, elles permettent de mettre en avant l’existence possible de deux voies complémentaires durant la perception : les représentations auditives seraient ajustées de manière optimale pour reconnaître les stimuli standards, tandis que les représentations motrices traiteraient mieux les stimuli inhabituels, dans des conditions de communication « adverses ». Nous appelons ceci la propriété « auditif-bande étroite/moteur-bande large ». Ensuite, ces différences servent à mieux comprendre comment apparaît la variabilité interpersonnelle, ce qui est nommé idiosyncrasies. Les simulations suggèrent que les représentations motrices sont acquises à l’aide d’un processus communicatif plutôt que par un processus purement imitatif. Finalement, ces différences d’apprentissage sont utilisées pour étudier plus spécifiquement le développement des unités phonétiques. Nous montrons que la communication peut s’effectuer même lorsque les deux interlocuteurs possèdent des représentations internes différentes, et nous proposons une version du modèle, intitulée COSMO SylPhon, permettant de mettre en correspondance les développements des syllabes et le développement des phonèmes. A travers ces trois axes, nous avons implémenté différentes versions de notre modèle COSMO en nous basant sur les données de la littérature, et en les discutant en retour à la lumière des simulations.
typdoc
Theses
Accès au texte intégral et bibtex
https://tel.archives-ouvertes.fr/tel-01706721/file/phdthesis.pdf BibTex

2017

Journal articles

titre
Reanalyzing neurocognitive data on the role of the motor system in speech perception within COSMO, a Bayesian perceptuo-motor model of speech communication
auteur
Marie-Lou Barnaud, Pierre Bessière, Julien Diard, Jean-Luc Schwartz
article
Brain and Language, Elsevier, 2017, 〈10.1016/j.bandl.2017.12.003〉
resume
While neurocognitive data provide clear evidence for the involvement of the motor system in speech perception, its precise role and the way motor information is involved in perceptual decision remain unclear. In this paper, we discuss some recent experimental results in light of COSMO, a Bayesian perceptuo-motor model of speech communication. COSMO enables us to model both speech perception and speech production with probability distributions relating phonological units with sensory and motor variables. Speech perception is conceived as a sensory-motor architecture combining an auditory and a motor decoder thanks to a Bayesian fusion process. We propose the sketch of a neuroanatomical architecture for COSMO, and we capitalize on properties of the auditory vs. motor decoders to address three neurocognitive studies of the literature. Altogether, this computational study reinforces functional arguments supporting the role of a motor decoding branch in the speech perception process .
typdoc
Journal articles
DOI
DOI : 10.1016/j.bandl.2017.12.003
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01669961/file/1-s2.0-S0093934X17300974-main.pdf BibTex
titre
Effects of linear and nonlinear speech rate changes on speech intelligibility in stationary and fluctuating maskers
auteur
Martin Cooke, Vincent Aubanel
article
Journal of the Acoustical Society of America, Acoustical Society of America, 2017, 141 (6), pp.4126-4135. 〈10.1121/1.4983826〉
resume
Algorithmic modifications to the durational structure of speech designed to avoid intervals of intense masking lead to increases in intelligibility, but the basis for such gains is not clear. The current study addressed the possibility that the reduced information load produced by speech rate slowing might explain some or all of the benefits of durational modifications. The study also investigated the influence of masker stationarity on the effectiveness of durational changes. Listeners identified keywords in sentences that had undergone linear and nonlinear speech rate changes resulting in overall temporal lengthening in the presence of stationary and fluctuating maskers. Relative to unmodified speech, a slower speech rate produced no intelligibility gains for the stationary masker, suggesting that a reduction in information rate does not underlie intelligibility benefits of durationally modified speech. However, both linear and nonlinear modifications led to substantial intelligibility increases in fluctuating noise. One possibility is that overall increases in speech duration provide no new phonetic information in stationary masking conditions, but that temporal fluctuations in the background increase the likelihood of glimpsing additional salient speech cues. Alternatively, listeners may have benefitted from an increase in the difference in speech rates between the target and background.
typdoc
Journal articles
DOI
DOI : 10.1121/1.4983826
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01615914/file/retimeJASA%20sumitted%2002%20may.pdf BibTex
titre
The complementary roles of auditory and motor information evaluated in a Bayesian perceptuo-motor model of speech perception
auteur
Raphaël Laurent, Marie-Lou Barnaud, Jean-Luc Schwartz, Pierre Bessière, Julien Diard
article
Psychological Review, American Psychological Association, 2017, 〈10.1037/rev0000069〉
resume
There is a consensus concerning the view that both auditory and motor representations intervene in the perceptual processing of speech units. However, the question of the functional role of each of these systems remains seldom addressed and poorly understood. We capitalized on the formal framework of Bayesian Programming to develop COSMO (Communicating Objects using Sensory-Motor Operations), an integrative model that allows principled comparisons of purely motor or purely auditory implementations of a speech perception task and tests the gain of efficiency provided by their Bayesian fusion. Here, we show three main results. (i) In a set of precisely defined “perfect conditions”, auditory and motor theories of speech perception are indistinguishable. (ii) When a learning process that mimics speech development is introduced into COSMO, it departs from these perfect conditions. Then auditory recognition becomes more efficient than motor recognition in dealing with learned stimuli, while motor recognition is more efficient in adverse conditions. We interpret this result as a general “auditory-narrowband vs. motor-wideband” property. (iii) Simulations of plosive-vowel syllable recognition reveal possible cues from motor recognition for the invariant specification of the place of plosive articulation in context, that are lacking in the auditory pathway. This provides COSMO with a second property, where auditory cues would be more efficient for vowel decoding and motor cues for plosive articulation decoding. These simulations provide several predictions, which are in good agreement with experimental data and suggest that there is natural complementarity between auditory and motor processing within a perceptuo-motor theory of speech perception.
typdoc
Journal articles
DOI
DOI : 10.1037/rev0000069
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01484383/file/Laurent_PsychRev_Revised.pdf BibTex
titre
Inside Speech: Multisensory and Modality-specific Processing of Tongue and Lip Speech Actions
auteur
Avril Treille, Coriandre Vilain, Thomas Hueber, Laurent Lamalle, Marc Sato
article
Journal of Cognitive Neuroscience, Massachusetts Institute of Technology Press (MIT Press), 2017, 29, pp.448 - 466. 〈10.1162/jocn_a_01057〉
resume
Action recognition has been found to rely not only on sensory brain areas but also partly on the observer's motor system. However, whether distinct auditory and visual experiences of an action modulate sensorimotor activity remains largely unknown. In the present sparse sampling fMRI study, we determined to which extent sensory and motor representations interact during the perception of tongue and lip speech actions. Tongue and lip speech actions were selected because tongue movements of our interlocutor are accessible via their impact on speech acoustics but not visible because of its position inside the vocal tract, whereas lip movements are both " audible " and visible. Participants were presented with auditory, visual, and audiovisual speech actions, with the visual inputs related to either a sagittal view of the tongue movements or a facial view of the lip movements of a speaker, previously recorded by an ultrasound imaging system and a video camera. Although the neural networks involved in visual visuo-lingual and visuo-facial perception largely overlapped, stronger motor and somato-sensory activations were observed during visuo-lingual perception. In contrast, stronger activity was found in auditory and visual cortices during visuo-facial perception. Complementing these findings, activity in the left premotor cortex and in visual brain areas was found to correlate with visual recognition scores observed for visuo-lingual and visuo-facial speech stimuli , respectively, whereas visual activity correlated with RTs for both stimuli. These results suggest that unimodal and multi-modal processing of lip and tongue speech actions rely on common sensorimotor brain areas. They also suggest that visual processing of audible but not visible movements induces motor and visual mental simulation of the perceived actions to facilitate recognition and/or to learn the association between auditory and visual signals.
typdoc
Journal articles
DOI
DOI : 10.1162/jocn_a_01057
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01484978/file/JOCN_a_01057-Treille_Proof1_corrected.pdf BibTex
titre
The syllable in the light of motor skills and neural oscillations
auteur
Antje Strauß, Jean-Luc Schwartz
article
Language, Cognition and Neuroscience, Taylor and Francis, 2017, 〈10.1080/23273798.2016.1253852〉
resume
Recent advances in neuroscience have brought a great focus on how the auditory cortex tracks speech at certain time scales corresponding to pre-lexical speech units in order to achieve comprehension. In particular, it has been claimed that it is the syllabic rhythm to which slow neural oscillations in the auditory cortex entrain in order to chunk the speech stream into smaller informational units. However, the terms “syllable” and “rhythm” have been treated quite loosely in the current literature. We revisit classic approaches to show that both concepts do not necessarily have an acoustic or phonetic counterpart, which could be directly extracted by neural processes. We would like to suggest that the syllabic rhythm could emerge at the intersection of acoustic–phonetic and motor knowledge of speech. We furthermore propose that nesting of cortical oscillations might be the key mechanism to understand the timing constraints that lead to the emergence of the syllable.
typdoc
Journal articles
DOI
DOI : 10.1080/23273798.2016.1253852
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01424458/file/Manuscript_syllable_HAL.pdf BibTex
titre
Audiovisual Binding for Speech Perception in Noise and in Aging
auteur
Attigodu Chandrashekara Ganesh, Frédéric Berthommier, Jean-Luc Schwartz
article
Language Learning, Wiley, 2017, 〈10.1111/lang.12271〉
resume
Speech Perception involves fusion of multiple sensory input and it doesn’t fuse automatically, perhaps it depends on numerous external/internal factors (e.g. attention, noise or age). In this paper, we exploit a specific paradigm in which a short audiovisual context made of coherent or incoherent speech material is displayed before an incongruent audiovisual target likely to provide fusion (McGurk effect, McGurk & MacDonald, 1976). We confirm that incoherent context leads to unbinding, that is a reduction in the amount of fusion. Importantly, adding acoustic noise in the context though not in the target increases fusion. This suggests that listeners systematically evaluate the reliability of their sensory channels and weight them accordingly in the fusion process. We also show that older subjects display more unbinding, and discuss the potential consequences concerning their ability to understand speech in adverse conditions. We relate all these data to a “Binding-and-Fusion” model of audiovisual speech perception.
typdoc
Journal articles
DOI
DOI : 10.1111/lang.12271
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01615573/file/Main%20Document_Final_Final.pdf BibTex
titre
Electrophysiological evidence for a self-processing advantage during audiovisual speech integration
auteur
Avril Treille, Coriandre Vilain, Sonia Kandel, Marc Sato
article
Experimental Brain Research, Springer Verlag, 2017, 235 (9), pp.2867-2876. 〈10.1007/s00221-017-5018-0〉
resume
Previous electrophysiological studies have provided strong evidence for early multisensory integrative mechanisms during audiovisual speech perception. From these studies, one unanswered issue is whether hearing our own voice and seeing our own articulatory gestures facilitate speech perception, possibly through a better processing and integration of sensory inputs with our own sensory-motor knowledge. The present EEG study examined the impact of self-knowledge during the perception of auditory (A), visual (V) and audiovisual (AV) speech stimuli that were previously recorded from the participant or from a speaker he/she had never met. Audiovisual interactions were estimated by comparing N1 and P2 auditory evoked potentials during the bimodal condition (AV) with the sum of those observed in the unimodal conditions (A + V). In line with previous EEG studies, our results revealed an amplitude decrease of P2 auditory evoked potentials in AV compared to A + V conditions. Crucially, a temporal facilitation of N1 responses was observed during the visual perception of self speech movements compared to those of another speaker. This facilitation was negatively correlated with the saliency of visual stimuli. These results provide evidence for a temporal facilitation of the integration of auditory and visual speech signals when the visual situation involves our own speech gestures.
typdoc
Journal articles
DOI
DOI : 10.1007/s00221-017-5018-0
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01616078/file/EXBR-D-17-00084_R2.pdf BibTex
titre
Auditory and Audiovisual Close-shadowing in Post-Lingually Deaf Cochlear-Implanted Patients and Normal-Hearing Elderly Adults
auteur
Lucie Scarbel, Denis Beautemps, Jean-Luc Schwartz, Marc Sato
article
Ear and Hearing, Lippincott, Williams & Wilkins, 2017, 〈10.1097/AUD.0000000000000474〉
resume
Objectives: The goal of this study was to determine the impact of auditory deprivation and age-related speech decline on perceptuo-motor abilities during speech processing in post-lingually deaf cochlear-implanted participants and in normal-hearing elderly participants.Design: A close-shadowing experiment was carried out on ten cochlear-implanted patients and ten normal-hearing elderly participants, with two groups of normal-hearing young participants as controls. To this end, participants had to categorize auditory and audiovisual syllables as quickly as possible, either manually or orally. Reaction times and percentages of correct responses were compared depending on response modes, stimulus modalities and syllables. Results: Responses of cochlear-implanted subjects were globally slower and less accurate than those of both young and elderly normal-hearing people. Adding the visual modality was found to enhance performance for cochlear-implanted patients, whereas no significant effect was obtained for the normal-hearing elderly group. Critically, oral responses were faster than manual ones for all groups. In addition, for normal-hearing elderly participants, manual responses were more accurate than oral responses, as was the case for normal-hearing young participants when presented with noisy speech stimuli. Conclusions: Faster reaction times were observed for oral than for manual responses in all groups, suggesting that perceptuo-motor relationships were somewhat successfully functional after cochlear implantation, and remain efficient in the normal-hearing elderly group. These results are in agreement with recent perceptuo-motor theories of speech perception. They are also supported by the theoretical assumption that implicit motor knowledge and motor representations partly constrain auditory speech processing. In this framework, oral responses would have been generated at an earlier stage of a sensorimotor loop, whereas manual responses would appear late, leading to slower but more accurate responses. The difference between oral and manual responses suggests that the perceptuo-motor loop is still effective for normal-hearing elderly subjects, and also for cochlear-implanted participants despite degraded global performance.
typdoc
Journal articles
DOI
DOI : 10.1097/AUD.0000000000000474
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01546756/file/E%26H_Scarbel_etal_2017.pdf BibTex

Conference papers

titre
Assessing phonological learning in COSMO, a Bayesian model of speech communication
auteur
Marie-Lou Barnaud, Jean-Luc Schwartz, Julien Diard, Pierre Bessìère
article
EPIROB-ICDL, Oct 2017, Lisbonne, Portugal. 2017
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01614145/file/EPIROB_abstract.pdf BibTex
titre
Perceptuo-motor speech units in the brain with COSMO, a Bayesian model of communication
auteur
Marie-Lou Barnaud, Julien Diard, Pierre Bessière, Jean-Luc Schwartz
article
The 11th International Seminar on Speech Production ISSP 2017, Oct 2017, Tianjin, China
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01614179/file/ISSP_abstract.pdf BibTex
titre
Modeling sensory preference in speech motor planning
auteur
Jean-François Patri, Pascal Perrier, Julien Diard
article
11th International Seminar on Speech Production ISSP 2017, Oct 2017, Tianjin, China. 2017
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01614760/file/AbstractISSP2017.pdf BibTex
titre
Contribution of visual rhythmic information to speech perception in noise
auteur
Vincent Aubanel, Cassandra Masters, Jeesun Kim, Chris Davis
article
AVSP 2017 (The 14th International Conference on Auditory-Visual Speech Processing), Aug 2017, Stockholm, Sweden
resume
Visual speech information helps listeners perceive speech in noise. The cues underpinning this visual advantage appear to be global and distributed, and previous research hasn't succeeded in pinning down simple dimensions to explain the effect. In this study we focus on the temporal aspects of visual speech cues. In comparison to a baseline of auditory only sentences mixed with noise, we tested the effect of making available a visual speech signal that carries the rhythm of the spoken sentence, through a temporal visual mask function linked to the times of the auditory p-centers, as quantified by stressed syllable onsets. We systematically varied the relative alignment of the peaks of the maximum exposure of visual speech cues with the presumed anchors of sentence rhythm and contrasted these speech cues against an abstract visual condition, whereby the visual signal consisted of a stylised moving curve with its dynamics determined by the mask function. We found that both visual signal types provided a significant benefit to speech recognition in noise, with the speech cues providing the largest benefit. The benefit was largely independent of the amount of delay in relation to the auditory p-centers. Taken together, the results call for further inquiry into temporal dynamics of visual and auditory speech.
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01615908/file/AVrhythm.pdf BibTex
titre
Perception audio-visuelle de séquences VCV produites par des personnes porteuses de Trisomie 21
auteur
Alexandre Hennequin, Amélie Rochet-Capellan, Marion Dohen
article
Journées Phonétique Clinique, Jun 2017, Paris, France
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01614522/file/Hennequin_JPC2017.pdf BibTex

Poster communications

titre
Perceptual learning of speech produced by a speaker with Down Syndrome
auteur
Alexandre Hennequin, Amélie Rochet-Capellan, Jean-Luc Schwartz, Marion Dohen
article
7th International Conference on Speech Motor Control, Jul 2017, Groningen, Netherlands
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01614528/file/Poster_SMC2017_HennequinEtAl.pdf BibTex
titre
Modeling sensory preference in speech motor planning
auteur
Jean-François Patri, Julien Diard, Pascal Perrier
article
Neural Control of Movement, May 2017, Dublin, Ireland. 2017
resume
Speech is a stream of specific sounds performed by gestures of articulators of the vocal tract. The sensory correlates of speech production are therefore both auditory (concerning sounds) and somatosensory (concerning the position and configuration of articulators of the vocal tract). Since sounds are a consequence of speech gestures, these two sensory correlates appear to be redundant in unperturbed conditions. This raises questions about their functional involvement in the monitoring of speech production: is only one useful, and if so, which one? Are they instead both useful, and if so, are they equivalent or complementary? Experimental studies of compensations for auditory and somatosensory perturbations indicate that both types of sensory information are taken into account during speech production. In addition, individual sensory preferences in speech production have been observed: subjects who compensate less for somatosensory perturbations compensate more for auditory perturbations, and vice versa. Our goal is to understand how sensory preferences can operate during speech production and influence it, by using our recently designed Bayesian model of speech motor planning. To our knowledge, models of speech motor control have generally not addressed this issue since they did not systematically evaluate the consequences of variations in the weight of each modality in the specification of the motor goals. In this work, we present extensions of our original Bayesian model of speech motor planning in which speech units are characterized both in auditory and somatosensory terms. We show that sensory preferences can be modeled in two ways. In the first variant, sensory preferences are attributed to the relative precision of sensory regions characterizing speech motor goals. This is inspired from classical models of multisensory fusion for perception. Under this approach, precisions of sensory regions correspond to their tolerance to perturbations: the smaller the region, the higher the precision and the lower the tolerance to perturbations. In other words, subjects who compensate more to auditory than somatosensory perturbations would have auditory target regions smaller than their somatosensory target regions. However, since auditory and somatosensory consequences of speech gestures are highly correlated, why would these motor goal regions differ so considerably? In the second variant of our model, sensory preferences are the consequence of the precision by which the predicted sensory consequences of motor commands are compared to the sensory characterizations of motor goals. We demonstrate that under specific assumptions, our two implementations of sensory preferences are formally equivalent. This reconciles these two approaches and suggests an alternative and original interpretation of sensory preferences.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01614559/file/2017.04%5BPoster_NCM%5D.pdf BibTex

2016

Journal articles

titre
Phonology in the mirror
auteur
Jean-Luc Schwartz, Marie-Lou Barnaud, Pierre Bessière, Julien Diard, Clément Moulin-Frier
article
Physics of Life Reviews, Elsevier, 2016, 16, pp.93-95. 〈10.1016/j.plrev.2016.01.007〉
resume
The contribution by M.A. Arbib over the years and as it appears summarized and conceptualized in this paper is admirable, extremely impressive, and very convincing in many aspects. A key value of this work is that it systematically attempts to introduce formal conceptualization and modeling in the reasoning about facts and interpretations.
typdoc
Journal articles
DOI
DOI : 10.1016/j.plrev.2016.01.007
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01262293/file/schwartz16.pdf BibTex

Conference papers

titre
Sensorimotor learning in a Bayesian computational model of speech communication
auteur
Marie-Lou Barnaud, Jean-Luc Schwartz, Julien Diard, Pierre Bessière
article
The Sixth Joint IEEE International Conference Developmental Learning and Epigenetic Robotics (ICDL-EPIROB 2016), Sep 2016, Cergy-Pontoise, France. 〈http://www.icdl-epirob.org/〉
resume
Although sensorimotor exploration is a basic process within child development, clear views on the underlying computational processes remain challenging. We propose to compare eight algorithms for sensorimotor exploration, based on three components: " accommodation " performing a compromise between goal babbling and social guidance by a master, " local extrapolation " simulating local exploration of the sensorimotor space to achieve motor generalizations and " idiosyncratic babbling " which favors already explored motor commands when they are efficient. We will show that a mix of these three components offers a good compromise enabling efficient learning while reducing exploration as much as possible.
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01371719/file/epirob-icdl_2016.pdf BibTex
titre
Bayesian Modeling in Speech Motor Control: A Principled Structure for the Integration of Various Constraints
auteur
Jean-François Patri, Pascal Perrier, Julien Diard
article
17th Annual Conference of the International Speech Communication Association (Interspeech 2016), Sep 2016, San-Francisco, United States. 2016, pp.3588-3592, 2016, Proc. Interspeech 2016. 〈10.21437/Interspeech.2016-441〉
resume
Speaking involves sequences of linguistic units that can be produced under different sets of control strategies. For instance, a given phoneme can be achieved with different acoustic properties , and a sequence of phonemes can be performed at different speech rates and with different prosodies. How does the Central Nervous System select a specific control strategy among all the available ones? In a previously published article we proposed a Bayesian model that addressed this question with respect to the multiplicity of acoustic realizations of a sequence of phonemes. One of the strengths of Bayesian modeling is that it is well adapted to the combination of multiple constraints. In the present paper we illustrate this feature by defining an extension of our previous model that includes force constraints related to the level of effort for the production of phoneme sequences , as it could be the case in clear versus casual speech. The integration of this additional constraint is used to model the control of articulation clarity. Pertinence of the results is illustrated by controlling a biomechanical model of the vocal tract for speech production.
typdoc
Conference papers
DOI
DOI : 10.21437/Interspeech.2016-441
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01378928/file/patri16.pdf BibTex
titre
Audiovisual Speech Scene Analysis in the Context of Competing Sources
auteur
Attigodu Ganesh, Frédéric Berthommier, Jean-Luc Schwartz
article
17th Annual Conference of the International Speech Communication Association (Interspeech 2016), Sep 2016, San Francisco, United States. Proceedings Interspeech 2016, 2016, pp.47 - 51, 2016, Proceedings Interspeech 2016. 〈10.21437/Interspeech.2016-62〉
resume
Audiovisual fusion in speech perception is generally conceived as a process independent from scene analysis, which is supposed to occur separately in the auditory and visual domain. On the contrary, we have been proposing in the last years that scene analysis such as what takes place in the cocktail party effect was an audiovisual process. We review here a series of experiments illustrating how audiovisual speech scene analysis occurs in the context of competing sources. Indeed, we show that a short contextual audiovisual stimulus made of competing auditory and visual sources modifies the perception of a following McGurk target. We interpret this in terms of binding, unbinding and rebinding processes, and we show how these processes depend on audiovisual correlations in time, attentional processes and differences between junior and senior participants.
typdoc
Conference papers
DOI
DOI : 10.21437/Interspeech.2016-62
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01405509/file/0062anav.pdf BibTex
titre
Assessing Idiosyncrasies in a Bayesian Model of Speech Communication
auteur
Marie-Lou Barnaud, Julien Diard, Pierre Bessière, Jean-Luc Schwartz
article
17th Annual Conference of the International Speech Communication Association (Interspeech 2016), Sep 2016, San Francisco, United States. Interspeech 2016, 2016, 〈10.21437/Interspeech.2016-396〉
resume
Although speakers of one specific language share the same phoneme representations, their productions can differ. We propose to investigate the development of these differences in production , called idiosyncrasies, by using a Bayesian model of communication. Supposing that idiosyncrasies appear during the development of the motor system, we present two versions of the motor learning phase, both based on the guidance of an agent master: " a repetition model " where agents try to imitate the sounds produced by the master and " a communication model " where agents try to replicate the phonemes produced by the master. Our experimental results show that only the " communication model " provides production idiosyncrasies, suggesting that idiosyncrasies are a natural output of a motor learning process based on a communicative goal.
typdoc
Conference papers
DOI
DOI : 10.21437/Interspeech.2016-396
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01371722/file/396_file_Paper.pdf BibTex
titre
Does Auditory-Motor Learning of Speech Transfer from the CV Syllable to the CVCV Word?
auteur
Tiphaine Caudrelier, Pascal Perrier, Jean-Luc Schwartz, Amélie Rochet-Capellan
article
17th Annual Conference of the International Speech Communication Association (Interspeech 2016), Sep 2016, San Francisco, United States. Interspeech 2016 proceedings, 2016, pp.2095 - 2099, 2016, 〈10.21437/Interspeech.2016-262〉
resume
Speech is often described as a sequence of units associating linguistic, sensory and motor representations. Is the connection between these representations preferentially maintained at a specific level in terms of a linguistic unit? In the present study, we contrasted the possibility of a link at the level of the syllable (CV) and the word (CVCV). We modified the production of the syllable /be/ in French speakers using an auditory-motor adaptation paradigm that consists of altering the speakers' auditory feedback. After stopping the perturbation, we studied to what extent this modification would transfer to the production of the disyllabic word /bebe/ and compared it to the after-effect on /be/. The results show that changes in /be/ transfer partially to /bebe/. The partial influence of the somatosensory and motor representations associated with the syllable on the production of the disyllabic word suggests that both units may contribute to the specification of the motor goals in speech sequences. In addition, the transfer occurs to a larger extent in the first syllable of /bebe/ than in the second one. It raises new questions about a possible interaction between the transfer of auditory-motor learning and serial control processes.
typdoc
Conference papers
DOI
DOI : 10.21437/Interspeech.2016-262
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01391406/file/Interspeech2016_Tiphaine_v12.pdf BibTex
titre
Auditory-Visual Perception of VCVs Produced by People with Down Syndrome: Preliminary Results
auteur
Alexandre Hennequin, Amélie Rochet-Capellan, Marion Dohen
article
17th Annual Conference of the International Speech Communication Association (Interspeech 2016), Sep 2016, San Francisco, United States. Interspeech 2016 proceedings, 2016, 〈10.21437/Interspeech.2016-1198〉
resume
Down Syndrome (DS) is a genetic disease involving a number of anatomical, physiological and cognitive impairments. More particularly it affects speech production abilities. This results in reduced intelligibility which has however only been evaluated auditorily. Yet, many studies have demonstrated that adding vision to audition helps perception of speech produced by people without impairments especially when it is degraded as is the case in noise. The present study aims at examining whether the visual information improves intelligibility of people with DS. 24 participants without DS were presented with VCV sequences (vowel-consonant-vowel) produced by four adults (2 with DS and 2 without DS). These stimuli were presented in noise in three modalities: auditory, auditory-visual and visual. The results confirm a reduced auditory intelligibility of speakers with DS. They also show that, for the speakers involved in this study, visual intelligibility is equivalent to that of speakers without DS and compensates for the auditory intelligibility loss. An analysis of the perceptual errors shows that most of them involve confusions between consonants. These results put forward the crucial role of multimodality in the improvement of the intelligibility of people with DS.
typdoc
Conference papers
DOI
DOI : 10.21437/Interspeech.2016-1198
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01368410/file/HennequinEtAl2016.pdf BibTex
titre
Modélisation bayésienne de la planification motrice des gestes de parole : Évaluation du rôle des différentes modalités sensorielles
auteur
Jean-François Patri, Julien Diard, Pascal Perrier
article
31e Journées d'Études sur la Parole (JEP-TALN-RECITAL 2016), Jul 2016, Paris, France. Actes de la conférence JEP-TALN 2016, pp.419-427
resume
La prise en compte des informations auditives et proprioceptives dans le contrôle de la parole est mise en évidence par un nombre croissant de résultats expérimentaux. Cependant, les modèles de production imposent le plus souvent l'une ou l'autre des modalités, ou n'offrent pas de cadre formel pour évaluer leurs contributions respectives. Nous proposons d'explorer le rôle de ces modalités sensorielles dans la planification des gestes de parole à partir d'un modèle bayésien représentant la structure des connaissances mises en jeu dans cette tâche. Le modèle permet d'envisager trois mécanismes de planification, reposant sur la modalité auditive, proprioceptive ou sur les deux conjointement. Nous comparons des simulations obtenues par les deux premiers mécanismes de planification. Les résultats indiquent des réalisations articulatoires différentes mais donnant néanmoins des réalisations auditives qualitativement similaires dans leur variabilité.
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01345199/file/JEP2016_vf.pdf BibTex
titre
De bé à bébé : le transfert d'apprentissage auditori-moteur pour interroger l'unité de production de la parole
auteur
Tiphaine Caudrelier, Pascal Perrier, Jean-Luc Schwartz, Christophe Savariaux, Amélie Rochet-Capellan
article
31e Journées d'Études sur la Parole (JEP-TALN-RECITAL 2016), Jul 2016, Paris, France
resume
La parole est souvent décrite comme une mise en séquence d'unités associant des représentations linguistiques, sensorielles et motrices. Le lien entre ces représentations se fait-il de manière privilégiée sur une unité spécifique ? Par exemple, est-ce la syllabe ou le mot ? Dans cette étude, nous voulons contraster ces deux hypothèses. Pour cela, nous avons modifié chez des locuteurs du français la production de la syllabe « bé », selon un paradigme d'adaptation auditori-motrice, consistant à perturber le retour auditif. Nous avons étudié comment cette modification se transfère ensuite à la production du mot « bébé ». Les résultats suggèrent un lien entre représentations linguistiques et motrices à plusieurs niveaux, à la fois celui du mot et de la syllabe. Ils montrent également une influence de la position de la syllabe dans le mot sur le transfert, qui soulève de nouvelles questions sur le contrôle sériel de la parole.
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01343390/file/jeptaln2016_TiphaineCaudrelier_reviewed.pdf BibTex
titre
Perception audio-visuelle de séquences VCV produites par des personnes porteuses de Trisomie 21 : une étude préliminaire
auteur
Alexandre Hennequin, Amélie Rochet-Capellan, Marion Dohen
article
31e Journées d'Études sur la Parole (JEP-TALN-RECITAL 2016), Jul 2016, Paris, France. Actes de la conférence conjointe JEP-TALN-RECITAL 2016, 1
resume
La parole des personnes avec trisomie 21 (T21) présente une altération systématique de l'intelligibilité qui n'a été quantifiée qu'auditivement. Or la modalité visuelle pourrait améliorer l'intelligibilité comme c'est le cas pour les personnes « ordinaires ». Cette étude compare la manière dont 24 participants ordinaires perçoivent des séquences VCV voyelle-consonne-voyelle) produites par quatre adultes (2 avec T21 et 2 ordinaires) et présentées dans le bruit en modalités auditive, visuelle et audiovisuelle. Les résultats confirment la perte d'intelligibilité en modalité auditive dans le cas de locuteurs porteurs de T21. Pour les deux locuteurs impliqués, l'intelligibilité visuelle est néanmoins équivalente à celle des deux locuteurs ordinaires et compensent le déficit d'intelligibilité auditive. Ces résultats suggèrent l'apport de la modalité visuelle vers une meilleure intelligibilité des personnes porteuses de T21.
typdoc
Conference papers
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01348019/file/JEP_2016_HennequinEtAl.pdf BibTex

Poster communications

titre
Phoneme categorization depends on production abilities during the first year of life
auteur
M Dole, Hélène Loevenbruck, O Pascalis, Jean-Luc Schwartz, Anne Vilain
article
International Conference on Infant Studies (ICIS 2016), May 2016, La Nouvelle Orléans, LA, United States
resume
An old-standing debate still not resolved in the field of speech communication concerns the nature of the representations underlying speech perception. On the one hand, auditory theories (e.g., Diehl et al., 2004) claim that the basic units in speech perception are purely auditory whereas the Motor Theory (Liberman et al., 1962; Galantucci et al., 2006) proposes that speech perception involves motor representations. Recently Schwartz et al. (2012), in a perceptuo-motor theory, claimed that perceptual and motor representations both play a role in the processing of speech units. To better understand the development of perceptuo-motor interactions during the first year of life, we examined the influence of speech production abilities on phonemic categorization in infants. We used an intersensory matching procedure in order to evaluate infants’ ability to bind auditory and visual information about a consonant category into a single representation. 6-to 12-month old French infants were familiarized with auditory syllables with different vowel contexts (e.g., /be/-/bi/-/bu). In the test phase, two side-by-side silent videos of faces repeatedly pronouncing consonants in a new vowel context (/ba/ on one side and /da/ on the other side) were presented and looking times (LTs) to each video were compared. In this protocol, infants who are able to extract the common (e.g., labial) gesture in the audio syllables, should be able to relate it to the same gesture in the visual stimuli and should show different LTs for the two test stimuli (/ba/ vs. /da/). Speech production abilities of each of the 6- to 12-month-old infants were assessed using a parental questionnaire. Infants were assigned to one of three production groups, Non Babbling, i.e. infants who did not produce the /b/-/d/ consonants, Canonical Babbling, i.e. infants who produced the consonants with only one vowel (e.g. ‘bababa’ or ‘dadada’), or Variegated Babbling, i.e. infants who produced the consonants with different vowels (e.g. ‘babibu’). We expected better categorization and better auditory-visual association in infants with greater production experience(i.e., infants in the Babbling phase), than in infants with fewer productions (Non Babbling infants). Results showed no main effect of age, however, 9-month old infants showed a significant categorization effect (one-sample t-test p<0.05) whereas 6- and 12-month olds did not. When taking production abilities into account, infants in the Variegated Babbling phase exhibited better categorization abilities that infants in the Canonical Babbling phase or Non Babbling infants. This suggests that greater production abilities are linked to better perception abilities, however this result could be linked to general language abilities. To eliminate this possibility and validate our hypothesis, we plan to test 6-to 12-month old infants using the same procedure with a /v/ vs. /z/ contrast, involving consonants that most French infants should not be able to produce yet. We expect an absence of audio-visual association with this contrast in all infants. The absence of audio-visual association with unproduced consonants together with the occurrence of audio-visual association with frequently produced consonants, would be a strong argument in favor of the development of a perceptuo-motor link during the first year of life. Taken together these studies should allow us to better assess the role of motor knowledge in the development of speech perception.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01489951/file/poster_ICIS.pdf BibTex
titre
Tes mots me touchent : étude des apports de la modalité tactile dans la perception de la parole
auteur
Avril Treille, Coriandre Vilain, Marc Sato
article
INSHEA - Colloque intenational « Toucher pour apprendre, toucher pour communiquer », Mar 2016, Paris, France
resume
La parole est le fruit de la mise en fonctionnement d’articulateurs spécifiques plus ou moins visibles. C’est à travers ce processus complexe qu’émergent les sons destinés à la formation du message linguistique. On la dit souvent audio-visuelle, mais on oublie que ce sont des mouvements que l’on peut aussi toucher. C’est notamment grâce à cette propriété que des personnes sourdes et aveugles sont capables de communiquer. Ils emploient la méthode Tadoma qui consiste à placer une main sur le visage du locuteur, le pouce à la verticale des lèvres et les autres doigts le long de la mâchoire, afin de ressentir le mouvement des sons produits. Si les mécanismes de fusion des modalités auditive et visuelle ont été largement étudiés chez les sujets normo-entendants et normo-voyants aucune étude ne s’est portée sur la fusion des informations provenant de l’audition et du toucher, un des sens le plus utilisé dans la vie quotidienne, mais rarement employé pour percevoir de la parole. Sommes-nous capables de décoder un message linguistique à partir de nouvelles informations -tactiles- inconnues jusqu’alors ? Les mécanismes utilisés lors de l’intégration de ces deux modalités sont-ils semblables à ceux utilisés dans la perception audio-visuelle de la parole ? Afin de répondre à ces questions, nous nous sommes inspirés de la méthode Tadoma pour réaliser deux expériences en électroencéphalographie sur la perception audio-tactile des syllabes /pa/ & /ta/ (expérience 1, Treille et al., 2014a) et /pa/, /ta/ & /ka/ (expérience 2, Treille et al., 2014b) sur une population de sujets sains normo-entendants. Nos résultats montrent dans un premier temps que des sujets naïfs sont capables d’identifier tactilement les syllabes prononcées par l’expérimentatrice, suggérant ainsi une utilisation de leurs connaissances motrices liées à la production de la parole pour faciliter le décodage des informations tactiles des mouvements perçus. D’autre part, nous avons également montré l’existence de mécanismes d’intégration similaires à ceux utilisés pour fusionner les informations auditives et visuelles grâce à la mise en évidence de marqueurs électrophysiologiques spécifiques aux processus d’intégration, notamment une facilitation temporelle du traitement auditif lorsque les modalités visuelles ou tactiles sont ajoutées ainsi qu’une diminution de la réponse neuronale lors de la présentation de stimuli bimodaux par rapport à la condition auditive seule. Pris ensembles, ces résultats soulignent l’extraordinaire capacité de notre cerveau à faire appel à nos connaissances sensorielles et motrices pour traiter au mieux les informations inconnues, comme celles provenant du toucher, afin de parvenir à une forme de communication. Une nouvelle étude en préparation devrait permettre d’identifier, à l’aide d’une lésion virtuelle des régions motrices, si notre système moteur est effectivement impliqué dans les mécanismes d’intégration audio-tactile de la parole.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01298235/file/INSHEA_AvrilTreille_ERC.pdf BibTex

2015

Journal articles

titre
Somatosensory Event-related Potentials from Orofacial Skin Stretch Stimulation
auteur
Takayuki Ito, David J Ostry, Vincent Gracco
article
Journal of visualized experiments, 2015, 106, pp.e53621. 〈http://www.jove.com/video/53621〉. 〈10.3791/53621〉
resume
Cortical processing associated with orofacial somatosensory function in speech has received limited experimental attention due to the difficulty of providing precise and controlled stimulation. This article introduces a technique for recording somatosensory event-related potentials (ERP) that uses a novel mechanical stimulation method involving skin deformation using a robotic device. Controlled deformation of the facial skin is used to modulate kinesthetic inputs through excitation of cutaneous mechanoreceptors. By combining somatosensory stimulation with electroencephalographic recording, somatosensory evoked responses can be successfully measured at the level of the cortex. Somatosensory stimulation can be combined with the stimulation of other sensory modalities to assess multisensory interactions. For speech, orofacial stimulation is combined with speech sound stimulation to assess the contribution of multi-sensory processing including the effects of timing differences. The ability to precisely control orofacial somatosensory stimulation during speech perception and speech production with ERP recording is an important tool that provides new insight into the neural organization and neural representations for speech. Video Link The video component of this article can be found at http://www.jove.com/video/53621
typdoc
Journal articles
DOI
DOI : 10.3791/53621
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01247041/file/Ito_text_finalBK.pdf BibTex
titre
Audio-visual speech scene analysis: Characterization of the dynamics of unbinding and rebinding the McGurk effect
auteur
Olha Nahorna, Frédéric Berthommier, Jean-Luc Schwartz
article
Journal of the Acoustical Society of America, Acoustical Society of America, 2015, 137 (1), pp.362-377. 〈10.1121/1.4904536〉
resume
While audiovisual interactions in speech perception have long been considered as automatic, recent data suggest that this is not the case. In a previous study, Nahorna et al. [(2012). J. Acoust. Soc. Am. 132, 1061–1077] showed that the McGurk effect is reduced by a previous incoherent audiovisual context. This was interpreted as showing the existence of an audiovisual binding stage controlling the fusion process. Incoherence would produce unbinding and decrease the weight of the visual input in fusion. The present paper explores the audiovisual binding system to characterize its dynamics. A first experiment assesses the dynamics of unbinding, and shows that it is rapid: An incoherent context less than 0.5 s long (typically one syllable) suffices to produce a maximal reduction in the McGurk effect. A second experiment tests the rebinding process, by presenting a short period of either coherent material or silence after the incoherent unbinding context. Coherence provides rebinding, with a recovery of the McGurk effect, while silence provides no rebinding and hence freezes the unbinding process. These experiments are interpreted in the framework of an audiovisual speech scene analysis process assessing the perceptual organization of an audiovisual speech input before decision takes place at a higher processing stage.
typdoc
Journal articles
DOI
DOI : 10.1121/1.4904536
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01213897/file/Nahorna_JASA_Binding2_second_rev_V2_proofs.pdf BibTex
titre
COSMO (“Communicating about Objects using Sensory–Motor Operations”): A Bayesian modeling framework for studying speech communication and the emergence of phonological systems
auteur
Clément Moulin-Frier, Julien Diard, Jean-Luc Schwartz, Pierre Bessière
article
Journal of Phonetics, Elsevier, 2015, 53, pp.5-41. 〈http://www.sciencedirect.com/science/article/pii/S0095447015000352〉. 〈10.1016/j.wocn.2015.06.001〉
resume
While the origin of language remains a somewhat mysterious process, understanding how human language takes specific forms appears to be accessible by the experimental method. Languages, despite their wide variety, display obvious regularities. In this paper, we attempt to derive some properties of phonological systems (the sound systems for human languages) from speech communication principles. We introduce a model of the cognitive architecture of a communicating agent, called COSMO (for “Communicating about Objects using Sensory–Motor Operations') that allows a probabilistic expression of the main theoretical trends found in the speech production and perception literature. This enables a computational comparison of these theoretical trends, which helps us to identify the conditions that favor the emergence of linguistic codes. We present realistic simulations of phonological system emergence showing that COSMO is able to predict the main regularities in vowel, stop consonant and syllable systems in human languages.
typdoc
Journal articles
DOI
DOI : 10.1016/j.wocn.2015.06.001
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01230175/file/moulinfrier2015cosmo.pdf BibTex
titre
Multisensory and sensorimotor interactions in speech perception
auteur
Kaisa Tiippana, Riikka Möttönen, Jean-Luc Schwartz
article
Frontiers in Psychology, Frontiers, 2015, 〈10.3389/fpsyg.2015.00458〉
resume
This research topic presents speech as a natural, well-learned, multisensory communication signal, processed by multiple mechanisms. Reflecting the general status of the field, most articles focus on audiovisual speech perception and many utilize the McGurk effect, which arises when discrepant visual and auditory speech stimuli are presented (McGurk and MacDonald, 1976). Tiippana (2014) argues that the McGurk effect can be used as a proxy for multisensory integration provided it is not interpreted too narrowly. Several articles shed new light on audiovisual speech perception in special populations. It is known that individuals with autism spectrum disorder (ASD, e.g., Saalasti et al., 2012) or language impairment (e.g., Meronen et al., 2013) are generally less influenced by the talking face than peers with typical development. Here Stevenson et al. (2014) propose that a deficit in multisensory integration could be a marker of ASD, and a component of the associated deficit in communication. However, three studies suggest that integration is not deficient in some communication disorders. Irwin and Brancazio (2014) show that children with ASD looked less at the mouth region, resulting in poorer visual speech perception and consequently weaker visual influence. Leybaert et al. (2014) report that children with specific language impairment recognized visual and auditory speech less accurately than their controls, affecting audiovisual speech perception, while audiovisual integration per se seemed unimpaired. In a similar vein, adult patients with aphasia showed unisensory deficits but still integrated audiovisual speech information (Andersen and Starrfelt, 2015). Multisensory information can influence response accuracy and processing speed (e.g., Molholm et al., 2002; Klucharev et al., 2003). Scarbel et al. (2014) show that oral responses to speech in noise were faster but less accurate than manual responses, suggesting that oral responses are planned at an earlier stage than manual responses. Sekiyama et al. (2014) show that older adults were more influenced by visual speech than younger adults and correlated this fact to their slower reaction times to auditory stimuli. Altieri and Hudock (2014) report variation in reaction time and accuracy benefits for audiovisual speech in hearing-impaired observers, emphasizing the importance of individual differences in integration. Finally, Heald and Nusbaum (2014) show that when there were two possible talkers instead of just one, audiovisual information appeared to distract the observer from the task of word recognition and slowed down their performance. This finding demonstrates that multisensory stimulation does not always facilitate performance. While multisensory stimulation is thought to be beneficial for learning (Shams and Seitz, 2008), evidence for this is still scarce. In the current research topic, the overall utility of multisensory learning is brought under question. In a paradigm training to associate novel words and pictures , Bernstein et al. (2014) show no benefit of audiovisual presentation compared with auditory presentation for normal hearing individuals, and even a degradation for adults with hearing impairment. In a study of cued speech, i.e., specific hand-signs for different speech sounds, Bayard et al. (2014) demonstrate that individuals with hearing impairment used the visual cues differently from their controls, even though both groups were experts in cued speech. Kelly et al. (2014)
typdoc
Journal articles
DOI
DOI : 10.3389/fpsyg.2015.00458
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01214067/file/fpsyg-06-00458_Tiippana_Mottonen_Schwartz.pdf BibTex
titre
On the cognitive nature of speech sound systems
auteur
Jean-Luc Schwartz, Clément Moulin-Frier, Pierre-Yves Oudeyer
article
Journal of Phonetics, Elsevier, 2015, On the cognitive nature of speech sound systems, 53, pp.1-4. 〈10.1016/j.wocn.2015.09.008〉
resume
During the last 50 years, the question of the cognitive nature of phonological units has followed the rhythm of the persistent debate between auditory and motor theories of speech communication. Though recent advances in cognitive neuroscience and cognitive psychology have largely renewed this debate, a consensus is still out of reach, and the true nature of speech units in the human brain remains elusive.A dimension of importance in this debate is a systemic one: speech units are not isolated, they are part of a phonological system, and they obey structural principles regarding well-investigated properties as distinctiveness, compositionality, contextual dependencies or systemic regularities. The phonological system itself is also part of a complex network of interaction with low-level biomechanical and sensory-motor systems, with higher-level brain structures regulating cognition, emotion and motivation, and finally with the social structures in which all these systems are embedded.Connecting assumptions or theories about the nature of speech units with a structuralist view about the relationship between phonetic properties and phonological systems has given rise to a number of major breakthroughs in speech science, for instance Lindblom’s bridges between the Variable Adaptive Theory (or its Hyper-Hypo variant) of speech communication (Lindblom, 1990) and the Dispersion Theory of vowel systems (Lindblom, 1986); or Stevens’ Quantal Theory (Stevens, 1972, 1989) addressing both the invariance issue and the search for the origins of distinctiveness and phonetic features; or the tandem between the Motor Theory of Speech Perception (Liberman & Mattingly, 1985) and Articulatory Phonology (Browman & Goldstein, 1992) in the Haskins Labs. This Special Issue is centered around a target paper by Moulin-Frier et al. that aims at relating the question of the auditory vs. motor vs. perceptuo-motor nature of speech units with simulations of vowel, plosive and syllable systems of human languages emerging from agent interactions, in a computational Bayesian framework. In this context, the papers in the special issue explore further the systemic perspective, studying how various dimensions of physical, cognitive, motivational and interactional systems can inform our understanding of the origins of speech forms.
typdoc
Journal articles
DOI
DOI : 10.1016/j.wocn.2015.09.008
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01222752/file/Introduction_SI_final.pdf BibTex
titre
Optimal speech motor control and token-to-token variability: a Bayesian modeling approach
auteur
Jean-François Patri, Julien Diard, Pascal Perrier
article
Biological Cybernetics (Modeling), Springer Verlag, 2015, 109 (6 (2015)), pp.611--626. 〈http://link.springer.com/article/10.1007/s00422-015-0664-4〉. 〈10.1007/s00422-015-0664-4〉
resume
The remarkable capacity of the speech motor system to adapt to various speech conditions is due to an excess of degrees of freedom, which enables producing similar acoustical properties with different sets of control strategies. To explain how the Central Nervous System selects one of the possible strategies, a common approach, in line with optimal motor control theories, is to model speech motor planning as the solution of an optimality problem based on cost functions. Despite the success of this approach, one of its drawbacks is the intrinsic contradiction between the concept of optimality and the observed experimental intra-speaker token-to-token variability. The present paper proposes an alternative approach by formulating feedforward optimal control in a probabilistic Bayesian modeling framework. This is illustrated by controlling a biomechanical model of the vocal tract for speech production and by comparing it with an existing optimal control model (GEPPETO). The essential elements of this optimal control model are presented first. From them the Bayesian model is constructed in a progressive way. Performance of the Bayesian model is evaluated based on computer simulations and compared to the optimal control model. This approach is shown to be appropriate for solving the speech planning problem while accounting for variability in a principled way.
typdoc
Journal articles
DOI
DOI : 10.1007/s00422-015-0664-4
Accès au texte intégral et bibtex
http://hal.univ-grenoble-alpes.fr/hal-01221738/file/Patri_BiolCyb_Final.pdf BibTex

Conference papers

titre
Integration of auditory, labial and manual signals in cued speech perception by deaf adults : an adaptation of the McGurk paradigm
auteur
Clémence Bayard, Jacqueline Leybaert, Cécile Colin
article
1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing (FAAVSP 2015), Sep 2015, Vienne, Austria
resume
Among deaf individuals fitted with a cochlear implant, some use Cued Speech (CS; a system in which each syllable is uttered with a complementary manual gesture) and therefore, have to combine auditory, labial and manual information to perceive speech. We examined how audio-visual (AV) speech integration is affected by the presence of manual cues and on which form of information (auditory, labial or manual) the CS receptors primarily rely depending on labial ambiguity. To address this issue, deaf CS users (N=36) and deaf CS naïve (N=35) participants were submitted to an identification task of two AV McGurk stimuli (either with a plosive or with a fricative consonant). Manual cues were congruent with either auditory information, lip information or the expected fusion. Results revealed that deaf individuals can merge audio and labial information into a single unified percept. Without manual cues, participants gave a high proportion of fusion response (particularly with ambiguous plosive McGurk stimuli). Results also suggested that manual cues can modify the AV integration and that their impact differs between plosive and fricative McGurk stimuli.
typdoc
Conference papers
Accès au bibtex
BibTex
titre
Audiovisual binding in speech perception
auteur
Jean-Luc Schwartz
article
1st Joint Conference on Facial Analysis, Animation and Auditory-Visual Speech Processing (FAAVSP 2015), Sep 2015, Vienne, Austria. Audiovisual Speech Processing (FAAVSP 2015), 2015
resume
We have been elaborating in the last years in Grenoble a series of experimental works in which we attempt to show that audiovisual speech perception comprises an “audiovisual binding” stage before fusion and decision. This stage would be in charge to extract and associate the auditory and visual cues corresponding to a given speech source, before further categorisation processes could take place at a higher stage. We developed paradigms to characterize audiovisual binding in terms of both “streaming” and “chunking” adequate pieces of information. This can lead to elements of a possible computational model, in relation with a larger theoretical perceptuo-motor framework for speech perception, the “Perception-for-Action-Control” Theory.
typdoc
Conference papers
Accès au bibtex
BibTex
titre
Le liage audiovisuel en perception de la parole, données et questions pour la modélisation neuronale
auteur
Jean-Luc Schwartz
article
Neurostic, Jul 2015, Paris, France. 2015
resume
On a longtemps considéré que les systèmes sensoriels constituaient des modules de traitement cortical autonomes, avant que les processus d’interaction et fusion multisensorielle ne prennent place au niveau des aires associatives, puis au-delà. On sait maintenant que les interactions multisensorielles sont précoces et massives, et mettent en jeu des mécanismes de liage neuronal probablement basés sur des principes de synchronisation et de codage multiplexe. J’introduirai les enjeux dans le domaine du traitement de la parole. Je décrirai les résultats comportementaux que nous avons obtenus récemment mettant en évidence des processus de liage audiovisuel avant fusion. Je décrirai l’architecture à deux étages, « liage et fusion » que nous avons proposée, et je la mettrai en regard des résultats récents portant sur les mécanismes de traitement neuronal des signaux audiovisuels de la parole.
typdoc
Conference papers
Accès au bibtex
BibTex
titre
Modeling concurrent development of speech perception and production in a Bayesian framework
auteur
Marie-Lou Barnaud, Raphaël Laurent, Pierre Bessière, Julien Diard, Jean-Luc Schwartz
article
Workshop on Infant Language Development (WILD), Jun 2015, Stockholm, Sweden. 2015
resume
It is widely accepted that motor and auditory processes interact in speech perception, but little is known about the functional role motor processes play in the development of speech perception. To address this question we consider a Bayesian model of speech perception development based on three sets of variables: motor representations M, sensory representations S and objects O (e.g. phonological units such as phonemes). The model comprises two internal branches. Firstly, an auditory identification sub-system connects S and O. Secondly, a motor sub-system connecting M and O and a sensori-motor model connecting M and S can be combined to provide “motor identification” of sounds S, from S to M and from M to O, in an analysis-by-synthesis process. Development is modeled as a learning process in which a master iteratively produces a sensory percept S associated with an object O. The learning agent updates its auditory sub-system by observing S and O. Update of the two other branches is more complex and based on an imitation phase. The learning agent estimates a likely motor action M from input S, produces this M and observes the resulting sound S’. M, S’ and O are used to update both the motor sub-system (M, O) and the sensori-motor model (S, M). We show that the auditory identification sub-system learns rapidly, and becomes efficient for stimuli close to those provided by the master, although it generalizes poorly. By contrast, the two other sub-systems evolve more slowly, and in consequence the motor identification system performs less accurately. However, motor identification happens to have captured more variable situations during learning, and generalizes better (e.g. in noise). This is in line with a developmental schedule in which auditory processing is mature before motor knowledge (Kuhl et al, 2008) and is exploited by infants after 11 months of age for analysis-by-synthesis of unusual speech stimuli (Kuhl et al., 2014).
typdoc
Conference papers
Accès au bibtex
BibTex

Directions of work or proceedings

titre
On the cognitive nature of speech sound systems
auteur
Jean-Luc Schwartz, Clément Moulin-Frier, Pierre-Yves Oudeyer
article
France. 53, Elsevier, pp.1-175, 2015, Journal of Phonetics - Special issue: "On the cognitive nature of speech sound systems", ISSN 0095-4470
resume
During the last 50 years, the question of the cognitive nature of phonological units has followed the rhythm of the persistent debate between auditory and motor theories of speech communication. Though recent advances in cognitive neuroscience and cognitive psychology have largely renewed this debate, a consensus is still out of reach, and the true nature of speech units in the human brain remains elusive. A dimension of importance in this debate is a systemic one: speech units are not isolated, they are part of a phonological system, and they obey structural principles regarding well-investigated properties as distinctiveness, compositionality, contextual dependencies or systemic regularities. The phonological system itself is also part of a complex network of interaction with low-level biomechanical and sensory-motor systems, with higher-level brain structures regulating cognition, emotion and motivation, and finally with the social structures in which all these systems are embedded. Connecting assumptions or theories about the nature of speech units with a structuralist view about the relationship between phonetic properties and phonological systems has given rise to a number of major breakthroughs in speech science, for instance Lindblom’s bridges between the Variable Adaptive Theory (or its Hyper-Hypo variant) of speech communication (Lindblom, 1990) and the Dispersion Theory of vowel systems (Lindblom, 1986); or Stevens’ Quantal Theory (Stevens, 1972, 1989) addressing both the invariance issue and the search for the origins of distinctiveness and phonetic features; or the tandem between the Motor Theory of Speech Perception (Liberman & Mattingly, 1985) and Articulatory Phonology (Browman & Goldstein, 1992) in the Haskins Labs. This Special Issue is centered around a target paper by Moulin-Frier et al. that aims at relating the question of the auditory vs. motor vs. perceptuo-motor nature of speech units with simulations of vowel, plosive and syllable systems of human languages emerging from agent interactions, in a computational Bayesian framework. In this context, the papers in the special issue explore further the systemic perspective, studying how various dimensions of physical, cognitive, motivational and interactional systems can inform our understanding of the origins of speech forms.
typdoc
Directions of work or proceedings
Accès au bibtex
BibTex

Poster communications

titre
Speech in the mirror? Neurobiological correlates of self speech perception
auteur
Avril Treille, Coriandre Vilain, Sonia Kandel, Jean-Luc Schwartz, Marc Sato
article
Seventh Annual Meeting of the Society for the Neurobiology of Language, Oct 2015, Chicago, United States. 〈http://www.neurolang.org/future-meetings/〉
resume
Self-awareness and self-recognition during action observation may partly result from a functional matching between action and perception systems. This perception-action interaction enhances the integration between sensory inputs and our own sensory-motor knowledge. We present combined EEG and fMRI studies examining the impact of self-knowledge on multisensory integration mechanisms. More precisely, we investigated this impact during auditory, visual and audio-visual speech perception. Our hypothesis was that hearing and/or viewing oneself talk would facilitate the bimodal integration process and activate sensory-motor maps to a greater extent than observing others. In both studies, half of the stimuli presented the participants’ own productions (self condition) and the other half presented an unknown speaker (other condition). For the “self” condition, we recorded videos of each participant producing/pa/, /ta/ and /ka/ syllables. In the “other” condition, we recorded videos of a speaker the participants had never met producing the same syllables. These recordings were then presented in different modalities: auditory only (A), visual only (V), audio-visual (AV) and incongruent audiovisual (AVi – incongruency referred to different speakers for the audio and video components). In the EEG experiment, 18 participants had to categorize the syllables. In the fMRI experiment, 12 participants had listen to and/or view passively the syllables. In the EEG session, audiovisual interactions were estimated by comparing auditory N1/P2 ERPs during bimodal responses (AV) with the sum of the responses in A and V only conditions (A+V). The amplitude of P2 ERPs was lower for AV than A+V. Importantly, latencies for N1 ERPs were shorter for the “Visual-self” condition than the “Visual-other”, regardless of signal type. In the fMRI session, the presentation modality had an impact on brain activation: activation was stronger for audio or audiovisual stimuli in the superior temporal auditory regions (A= AV=AVi> V), and for video or audiovisual stimuli in MT/V5 and in the premotor cortices (V=AV=AVi> A). In addition, brain activity was stronger in the “self” than the “other” condition both at the left posterior inferior frontal gyrus and cerebellum (lobules I-IV). In line with previous studies on multimodal speech perception, our results point to the existence of integration mechanisms of auditory and visual speech signals. Critically, they further demonstrate a processing advantage when the perceptual situation involves our own speech production. In addition, hearing and/or viewing oneself talk increased activation in the left posterior IFG and cerebellum. These regions are generally responsible for predicting sensory outcomes of action generation. Altogether, these results suggest that viewing our own utterances leads to a temporal facilitation of auditory and visual speech integration. Moreover, processing afferent and efferent signals in sensory-motor areas leads to self -awareness during speech perception.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01297700/file/NLC_self_EEG%26IRMf_poster_FINAL.pdf BibTex
titre
From Sensorimotor Experience To Speech Unit -Adaptation to altered auditory feedback in speech to assess transfer of learning in complex serial movements
auteur
Tiphaine Caudrelier, Jean-Luc Schwartz, Pascal Perrier, Christophe Savariaux, Amélie Rochet-Capellan
article
SFN - Neuroscience, Oct 2015, Chicago, United States
resume
Using bird song as a model to understand generalization in motor learning, Hoffman and Sober recently found that adaptation to pitch-shift of birds’ vocal output transfered to the production of the same sounds embedded in a different serial context (J. Neurosc 2014). In humans, speech learning has been found to transfer as a function of the acoustical similarity between the training and the testing utterances (Cai et al. 2010, Rochet-Capellan et al. 2011) but it is unclear if transfer of learning is sensitive to serial order. We investigate the effects of serial order on transfer of speech motor learning using non-words sequences of CV syllables. Three groups of native speakers of French were trained to produce the syllable /be/ repetitively while their auditory feedback was altered in real time toward /ba/. They were then tested for transfer toward /be/ (control), /bepe/ or /pebe/ under normal feedback conditions. The training utterance was then produced again to test for after-effects. The auditory shift was achieved in real time using Audapter software (Cai et al. 2008). Adaptation and transfer effects were quantified in terms of changes in formants frequencies of the vowel /e/, as a function of its position and the preceding consonant in the utterance. Changes in formant frequencies in a direction opposite to the shift were significant for ~80% of the participants. Adaptation was still significant for the three groups in the after-effect block. Transfer effects in the /bepe/ and /pebe/ groups were globally smaller than that of the control group, particularly when the vowel /e/ came after /p/ and/or was in second position in the utterance. Taken together, the results suggest that transfer of speech motor learning is not homogenous and as observed by Hoffman and Sober, depends on the serial context of a sound within the utterance.Cai S, Boucek M, Ghosh SS, Guenther FH, Perkell JS. (2008). A system foronline dynamic perturbation of formant frequencies and results from perturbation of the Mandarin triphthong /iau/. In Proceedings of the 8th Intl. Seminar on Speech Production, Strasbourg, France, Dec. 8-12, 2008. pp. 65Cai, S., Ghosh, S. S., Guenther, F. H., & Perkell, J. S. (2010). Adaptive auditory feedback control of the production of formant trajectories in the Mandarin triphthong/iau/and its pattern of generalization. The Journal of the Acoustical Society of America, 128(4), 2033-2048.Hoffmann, L. A., & Sober, S. J. (2014). Vocal generalization depends on gesture identity and sequence. The Journal of Neuroscience, 34(16), 5564-5574.Rochet-Capellan, A., Richer, L., & Ostry, D. J. (2012). Nonhomogeneous transfer reveals specificity in speech motor learning. Journal of neurophysiology, 107(6), 1711-1717.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01221491/file/poster%20T.Caudrelier%20SFN.pdf BibTex
titre
Visual lip information supports auditory word segmentation
auteur
Antje Strauß, Christophe Savariaux, Sonia Kandel, Jean-Luc Schwartz
article
FAAVSP 2015, Sep 2015, Vienna, Austria
resume
Word segmentation is one of the initial processes that needs to be solved when acquiring the first or learning a second language. Acoustic cues like the fundamental frequency and segment durations have been shown to facilitate the detection of word boundaries. The role of visual speech and in particular of lip movements in word segmentation is still rather unknown. In French, liaisons, e.g. between determiner and noun, often pose a problem of several segmentation possibilities (e.g., the sequence /lafiS/ with liaison ("l’affiche") means the poster whereas without liaison ("la fiche") it means the file.). Here, we use 17 ambiguous French sequences with and without liaison. They were presented in carrier sentences either with clear acoustic cues for the first or the second segmentation possibility or with ambiguous acoustic cues. The three audio conditions were combined with lip movements hyper-articulating either the first or the second segmentation possibility in order to observe the influence of visual information on segmentation. The participants had to indicate as quickly as possible which of the two versions they understood (e.g., "l’affiche" or "la fiche"?). Results show that lip information indeed biases the word segmentation decision. These data provide important implications for audiovisual integration processes.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://halshs.archives-ouvertes.fr/halshs-01298351/file/Strauss_AVSP2015.pdf BibTex
titre
Auditory-visual Perception of VCVs Produced by People with Down Syndrome: a Preliminary Study
auteur
Alexandre Hennequin, Amélie Rochet-Capellan, Marion Dohen
article
FAAVSP 2015, Sep 2015, Vienna, Austria. 〈http://faavsp2015.ftw.at〉
resume
Down syndrome (DS) is the most frequent genetic disorder in humans and is present throughout society. When questioned about their child’s speech, all parents of a child with DS report speech intelligibility issues [Kumin, 2006]. People with DS actually have better receptive than expressive speech abilities [Kumin, 2006]. Improving speech production of people with DS is an important aspect of their quality of life. Understanding how perception of speech produced by people with DS could be improved could also have positive effects on their social integration. Speech difficulties in people with DS originate from anatomical and physiological specificities as well as motor impairments and appear in early childhood. For example, people with DS have a smaller vocal tract and their tongue is bigger relatively to the size of their oral cavity. Other anatomical and perceptual specificities affect their ability to produce speech (see [Kent and Vorperian, 2013] for a review). All these specificities must not only have acoustical consequences but also visual ones. To our knowledge no study has explored auditory-visual perception of speech produced by people with DS whereas it is well known that speech perception benefits from the addition of vision especially in disturbed conditions (for example in noise: [Sumby and Pollack, 1954]). This study aims at exploring if and how vision can improve the perception, by “ordinary” people, of speech produced by people with DS.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01238878/file/posterFAAVSP2015_HennequinEtAl.pdf BibTex
titre
A Bayesian framework for speech motor control
auteur
Jean-François Patri, Julien Diard, Pascal Perrier, Jean-Luc Schwartz
article
Workshop: Probabilistic Inference and the Brain, Sep 2015, Paris, France. 〈http://pibrainconf.sciencesconf.org/〉
resume
The remarkable capacity of the speech motor system to adapt to various speech conditions is due to an excess of degrees of freedom, which enables producing similar acoustical properties with different sets of control strategies. To explain how the Central Nervous System selects one of the possible strategies, a common approach, in line with optimal motor control theories, is to model speech motor planning as the solution of an optimality problem based on cost functions. Despite the success of this approach, one of its drawbacks is the intrinsic contradiction between the concept of optimality and the observed experimental intra-speaker token-to-token variability. The present paper proposes an alternative approach by formulating feedforward optimal control in a probabilistic Bayesian modeling framework. This is illustrated by controlling a biomechanical model of the vocal tract for speech production and by comparing it with an existing optimal control model (GEPPETO). The essential elements of this optimal control model are presented first. From them the Bayesian model is constructed in a progressive way. Performance of the Bayesian model is evaluated based on computer simulations and compared to the optimal control model. This approach is shown to be appropriate for solving the speech planning problem while accounting for variability in a principled way.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01297922/file/2015.05.22%5BPoster_MCX%5D.pdf BibTex
titre
Modeling the concurrent development of speech perception and production in a Bayesian framework
auteur
Marie-Lou Barnaud, Julien Diard, Pierre Bessière, Jean-Luc Schwartz
article
Workshop: Probabilistic Inference and the Brain, Sep 2015, Paris, France. 2015
resume
It is widely accepted that both motor and auditory processes interact in the brain during speech perception, but little is known about the functional role played by motor processes. To address this question we consider a Bayesian model of speech communication based on three sets of variables: motor representations M, sensory representations S and objects O (e.g. phonological units such as phonemes). The model comprises two internal branches. Firstly, an auditory identification sub-system connects S and O. Secondly, a motor production sub-system connecting M and O and a sensory-motor sub-system connecting M and S can be combined to provide “motor identification” of sounds S, from S to M and from M to O, in an analysis-by-synthesis process. The auditory identification sub-system, the motor production sub-system and the sensory-motor sub-system are learned in a supervised learning scenario, in which a master agent provides sensory signals s and their respective object o. Learning the auditory identification system is straightforward using experimental < s; o > pairs. On the other hand, learning the motor sub-system is more complicated. The learning agent infers motor gestures in an “accomodation process”: the learning agent tries to reproduce the input sensory signal s by selecting a motor gesture m. Performing m yields s’, the resulting sensory output. Triplets < m; s’; o > are used to update the parameters of the motor identification system. We show that the direct inference process involved in auditory identification provides rapid and efficient learning but generalizes poorly. By contrast, the more complex inference process required in motor identification learns more slowly and performs less accurately. However, this system happens to have captured more variable situations during learning, and generalizes better (e.g. in noise). This could provide the basis for a complementarity between auditory and motor identification systems in the human brain.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01202420/file/Poster.pdf BibTex
titre
Modeling the concurrent development of speech perception and production in a Bayesian framework
auteur
Marie-Lou Barnaud, Julien Diard, Pierre Bessière, Jean-Luc Schwartz
article
EPIROB-ICDL, Aug 2015, Providence, United States. 2015, 〈http://www.icdl-epirob.org/〉
resume
It is now widely accepted that there is a functional relationship between the speech perception and production systems in the human brain. However, the precise mechanisms and role of this relationship still remain debated. The question of invariance and robustness in categorization are set at the center of the debate: how is stable information extracted from the variable sensory input in order to achieve speech comprehension? In this context, auditory (resp. motor, perceptuo-motor) theories propose that speech is categorized thanks to auditory (resp. motor, perceptuo-motor) processes. However, experimental evidence is still scarce and does not allow to clearly distinguish between the current theories and determine whether invariance in speech perception is of an auditory or motor type. This is why we developed COSMO, a Bayesian model comparing sensory and motor processes in the form of probability distributions which enable both theoretical developments and quantitative simulations. A first significant result in COSMO is an indistinguishability theorem: it is only by simulations of adverse conditions or partial learning that the specificity of sensory vs. motor processing can emerge and provide a basis for evaluation of the specific role of each sub-system. We present the COSMO model, and how its sensory and motor sub-systems are learned, then we describe simulations exploring the way these sub-systems differ during speech categorization. We discuss the experimental results in the light of a “narrowband vs. wideband” interpretation: the sensory sub-system is more precisely tuned to the frequently learned sensory input and hence more efficient in recognizing these inputs, providing a “narrowband” system. Conversely, the motor sub-system is less accurate to recognize learned sensory inputs but it has better generalization properties, making it more robust to unexpected variability which would provide it with “wideband” characteristics.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01202418/file/Poster.pdf BibTex
titre
Neural correlates of auditory-somatosensory interaction in speech perception
auteur
Takayuki Ito, Vincent Gracco, David J Ostry
article
16TH INTERNATIONAL MULTISENSORY RESEARCH FORUM, Jun 2015, Pisa, Italy. 2015
resume
Speech perception is known to rely on both auditory and visual information. However, sound specific somatosensory input has been shown also to influence speech perceptual processing (Ito et al., 2009). In the present study we addressed further the relationship between somatosensory information and speech perceptual processing by addressing the hypothesis that the temporal relationship between orofacial movement and sound processing contributes to somatosensory-auditory interaction in speech perception. We examined the changes in event-related potentials in response to multisensory synchronous (simultaneous) and asynchronous (90 ms lag and lead) somatosensory and auditory stimulation compared to individual unisensory auditory and somatosensory stimulation alone. We used a robotic device to apply facial skin somatosensory deformations that were similar in timing and duration to those experienced in speech production. Following synchronous multisensory stimulation the amplitude of the event-related potential was reliably different from the two unisensory potentials. More importantly, the magnitude of the event-related potential difference varied as a function of the relative timing of the somatosensory-auditory stimulation. Event-related activity change due to stimulus timing was seen between 160-220 ms following somatosensory onset, mostly around the parietal area. The results demonstrate a dynamic modulation of somatosensory-auditory convergence and suggest the contribution of somatosensory information for speech processing process is dependent on the specific temporal order of sensory inputs in speech production.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01214668/file/Ito_IMRF2015A4.pdf BibTex
titre
Electrophysiological evidence for audio-visuo-lingual speech integration
auteur
Coriandre Vilain, Avril Treille, Marc Sato
article
IMRF 2015 - 16th international multisensory research forum, Jun 2015, Pise, Italy. 〈http://www.pisavisionlab.org/imrf2015/〉
resume
Audio-visual speech perception is a special case of multisensory processing that interfaces with the linguistic system. One important issue is whether cross-modal interactions only depend on well-known auditory and visuo-facial modalities or, rather, might also be triggered by other sensory sources less common in speech communication. The present EEG study aimed at investigating cross-modal interactions not only between auditory, visuo-facial and audio-visuo-facial syllables but also between auditory, visuo-lingual and audio-visuo-lingual syllables. Eighteen adults participated in the study, none of them being experienced with visuo-lingual stimuli. The stimuli were acquired by means of a camera and an ultrasound system, synchronized with the acoustic signal. At the behavioral level, visuo-lingual syllables were recognized far above chance, although to a lower degree than visuo-labial syllables. At the brain level, audiovisual interactions were estimated by comparing the EEG responses to the multisensory stimuli (AV) to the combination of responses to the stimuli presented in isolation (A+V). For both visuo-labial and visuo-lingual syllables, a reduced latency and a lower amplitude of P2 auditory evoked potentials were observed for AV compared to A+V. Apart from this sub-additive effect, a reduced amplitude of N1 and a higher amplitude of P2 were also observed for lingual compared to labial movements. Although participants were not experienced with visuo-lingual stimuli, our results demonstrate that they were able to recognize them and provide the first evidence for audio-visuo-lingual speech interactions. These results further emphasize the multimodal nature of speech perception and likely reflect the impact of listener's knowledge of speech production.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01297678/file/IMRF-EEG-tongue-Final.pdf BibTex
titre
Seeing our own voice: an electrophysiological study of audiovisual speech integration during self perception
auteur
Avril Treille, Coriandre Vilain, Sonia Kandel, Marc Sato
article
IMRF 2015 - 16th international multisensory research forum, Jun 2015, Pise, Italy. 〈http://www.pisavisionlab.org/imrf2015/〉
resume
Recent studies suggest that better recognition of one's actions may result from the integration of sensory inputs with our own sensory-motor knowledge. However, whether hearing our voice and seeing our articulatory gestures facilitate audiovisual speech integration is still debated. The present EEG study examined the impact of self-knowledge during the perception of auditory, visual and audiovisual syllables that were previously recorded by a participant or a speaker he/she had never met. Audiovisual interactions were estimated on eighteen participants by comparing the EEG responses to the multisensory stimuli (AV) to the combination of responses to the stimuli presented in isolation (A+V). An amplitude decrease of early P2 auditory evoked potentials was observed during AV compared to A+V. Moreover, shorter latencies of N1 auditory evoked potentials were also observed for self-related visual stimuli compared to those of an unknown speaker. In line with previous EEG studies on multimodal speech perception, our results point to the existence of early integration mechanisms of auditory and visual speech information. Crucially, they also provide evidence for a processing advantage when the perceptual situation involves our own speech productions. Viewing our own utterances leads to a temporal facilitation of the integration of auditory and visual speech signals.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01297672/file/IMRF-EEG-self-Final.pdf BibTex
titre
Perceptual abilities in relation with motor development in the first year of life
auteur
Marjorie Dole, Hélène Loevenbruck, Olivier Pascalis, Jean-Luc Schwartz, Anne Vilain
article
Workshop on Infant Language Development (WILD) 2015, Jun 2015, Stockholm, Sweden. Book of abstracts of the Workshop on Infant Language Development (WILD) 2015, 〈http://www.stockholmbabylab.se/WILD2015/〉
typdoc
Poster communications
Accès au bibtex
BibTex
titre
Perceptual abilities in relation with motor development during the first year of life
auteur
Marjorie Dole, Hélène Loevenbruck, Olivier Pascalis, Jean-Luc Schwartz, Anne Vilain
article
WILD 2015 - Second Workshop on Infant Language Development, Jun 2015, Stockholm, Sweden
resume
To better understand the development of perceptuo-motor interactions during the first year of life we designed two studies evaluating the influence of speech production abilities on phonemic categorization. In a first study we use a visual fixation paradigm to evaluate infants’ consonant categorization in different vowel contexts. Auditory stimuli are presented via a loudspeaker located behind a screen. A /d/-/g/ contrast is employed; infants are habituated with one member of the pair associated with different vowels (/do/-/di/-/du/). When reaching the criterion of 60% of the mean looking time (LT) for the first three trials, they are presented with consonants in a new context (/da/ and /ga/). We compare LTs between familiar and novel consonants. Infants who are able to extract the common consonant (here /d/) in the different vocalic contexts should show different LTs for the two test stimuli. In a second study infants’ ability to link auditory and visual information on a consonant category into a single representation will be tested using an intersensory matching procedure. Infants will be familiarized with auditory syllables with different vowel contexts (/bo/-/bi/-/bu). In the test phase, two side-by-side silent videos of faces repeatedly pronouncing consonants in a new vowel context (/ba/ on one side and /da/ on the other) will be presented and LTs to each video will be compared. Infants who are able to extract the common gesture in the audio syllables should be able to relate it to the same gesture in the visual stimuli and show different LTs for the two test stimuli (/ba/ vs /da/). For both studies the speech production abilities of each of the 6- to 12-month-old infants are assessed using a parental questionnaire. We expect better categorization and better auditory-visual association in infants who can produce the target consonants than in those who cannot. These studies will allow us assess the role of motor knowledge in the development of speech perception.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01301013/file/PosterWILD_final.pdf BibTex

2014

Poster communications

titre
To integrate the unknown: touching your lips, hearing your tongue, seeing my voice
auteur
Avril Treille, Coriandre Vilain, Jean-Luc Schwartz, Marc Sato
article
International conference on Auditory Cortex, Sep 2014, Magdeburg, Germany
resume
Seeing the articulatory gestures of the speaker significantly enhances auditory speech perception. A key issue is whether cross-modal speech interactions only depend on well-known auditory and visual modalities or, rather, might also be triggered by other sensory sources less common in speech communication. The present electro-encephalographic (EEG) and functional magnetic resonance imaging (fMRI) studies aimed at investigating cross-modal interactions between auditory, haptic, visuo-facial and visuo-lingual speech signals during the perception of other’s and our own production. In a first EEG study (n=16), auditory evoked potentials were compared during auditory, audio-visual and audio-haptic speech perception through natural dyadic interactions between a listener and a speaker. Shortened latencies and reduced amplitude of early auditory evoked potentials were observed during both audio-visual and audio-haptic speech perception compared to auditory speech perception, providing evidence for early integrative mechanisms between auditory, visual and haptic information. In a second fMRI study (n=12), the neural substrates of cross-modal binding during auditory, visual and audio-visual speech perception in relation to either facial or tongue movements of a speaker (recorded by a camera and an ultrasound system, respectively) were determined. In line with a sensorimotor nature of speech perception, common overlapping activity was observed for both facial and tongue-related speech stimuli in the posterior part of the superior temporal gyrus/sulcus as well as in the premotor cortex and in the inferior frontal gyrus. In a third EEG study (n=17), auditory evoked potentials were compared during the perception of auditory, visual and audio-visual stimuli related to our own speech gestures or those of a stranger. Apart from a reduced amplitude of early auditory evoked potentials during audio-visual compared to auditory and visual speech perception, a self-advantage was also observed with shortened latencies of early auditory evoked potentials for self-related speech stimuli. Altogether our results provide evidence for bimodal interactions between auditory, haptic, visuo-facial and visuo-lingual speech signals. They further emphasize the multimodal nature of speech perception and demonstrate that multisensory speech perception is partly driven by sensory predictability and by the listener’s knowledge of speech production.
typdoc
Poster communications
Accès au texte intégral et bibtex
https://hal.archives-ouvertes.fr/hal-01298240/file/Auditory_cortex_poster.pdf BibTex
Back to top