您的当前位置:首页正文

Mask

2023-12-30 来源:好走旅游网
1

Maskestimationformissingdataspeechrecognition

basedonstatisticsofbinauralinteraction

SueHarding*,Member,IEEE,JonBarker,andGuyJ.Brown

putationalAbstract—sounddata’separationauditoryThispaperaccordingscenedescribesanalysisaperceptuallymotivatedcom-tospatial(CASA)locationsystemwiththatthecombines‘missingingdistributionsdataapproachtime-frequencyforrobustmasksspeechrecognitioninnoise.Miss-differencesbasedonestimatesareofinterauralcreatedusingtimeprobabilityandlevelconditions;(ITDconstituteofreliabletheseandmasksILD)evidenceindicateformixedofthewhichutterancesinreverberatedtargetspeechregionssignal.oftheAspectrumnumberwhenexperimentstheencounteredabilityusedofindividuallycomparetherelativeefficacyofthebinauralcuesthesystemandtogeneraliseincombination.toWealsoinvestigaterecognitionparticularlytaskduringtraining.Performanceacousticonacontinuousconditionsdigitnottalkers.

challengingusingthisenvironmentmethodisfoundwiththreetobeconcurrentgood,evenmaleination,IndexILD,TermsITD,—binaural,CASA,missingreverberation.

data,automaticspeechrecogni-D

I.INTRODUCTION

ESPITEmaticspeechmuchrecognitionprogress(ASR)inrecentinnoisyyears,androbustreverberantauto-environmentsremainsachallengingproblem.ThisismostapparentsystemsandwhenhumanoneconsiderslistenerstheonrelativethesameperformancespeechrecognitionofASRtask.WorderrorratesforASRsystemscanbeanorderofmagnitudedifferencesgreaterarelargestthanwhenthosetheforspeechhumanislisteners,contaminatedandthebybackgroundtheauditorynoisesystemorgiveroomrisereverberationtothisadvantage,[1].Whatandcanaspectstheun-ofderlyingmechanismsbeincorporatedintoourcomputationalsystemsinordertoimprovetheirrobustness?

Oneobviouscharacteristicofhumanlistenersisthattheyhavefromtwoasingleears,whereasaudiochannel.ASRsystemsBinauraltypicallyhearingtakeunderliestheirinputanumberofimportantauditoryfunctions(see[2]forareview).Firstly,principallyhumanbylistenersmeasuringaredifferencesabletolocalisebetweensoundstheintimespaceofarrivaltoasinterauralandsoundtimeleveldifferencesatthetwoears.(ITDs)Theseandcuesinterauralarereferredleveldifferences(ILDs).Secondly,binauralmechanismssuppressechoesFinally,andbinauralthereforehearingcounteractcontributestheeffectstotheofabilityreverberationoflisteners[3].toattendtoatargetsoundsourceinthepresenceofother

ThisSheffield,Theauthorsworkwas+44RegentarefundedCourt,withtheby211DepartmentEPSRCgrantGR/R47400/01.

PortobelloStreet,ofComputerSheffield,Science,S14DP,UniversityofEmail:114222EDICS{1800;Fax:+441142221810.

U.K.Tel.:1-RECOs.harding,-Speechj.barker,Recognition

g.brown}@dcs.shef.ac.ukinterferingchophysicalsources.studies,Evidencewhichshowforthatthisthehasintelligiblitycomefromofpsy-twooverlappingspeechsignalsincreasesasthespatialseparationbetweenbeinvolvedtheminincreasesthisrespect.[4].ForAnumberexample,oflistenersprocessesmayappearsimplytoattendmostfavourable.totheearIninaddition,whichthemoresignal-to-noisecomplexmechanismsratio(SNR)mayisbeinvolved,inwhichbinauralcomparisonsareusedtocancelinterferingwhichoriginatessoundsourcesfromthe[5]sameoractivelylocationgroupinspaceacoustic[6].

energyThenotionofauditorygroupingisakeycomponentofBregman’sfluentialaccounttheoryofoftheauditoryprocessesscenebywhichanalysislisteners(ASA),segregateanin-aworktargethassoundstimulatedsourceinterestfromaninacousticthedevelopmentmixture[7].ofBregman’scomputa-tionalauditorysceneanalysis(CASA)systems,whichattempttonumbermimicofthethesesoundsystemsseparationhaveexploitedabilitiesofbinauralhumancueslisteners.inorderAtoseparateadesiredtalkerfromspatiallyseparatedinterferers[8],In[9],this[10],paper,[11],we[12].

describeaCASAsystemwhichexploitsspatialASRinlocationmultisource,cuesreverberantinordertoenvironments.improvetherobustnessOurapproachofisimplementedwithinthe‘missingdata’frameworkforASR[13],acousticandfeaturesconsists(spectraloftwoprocessingenergies)stages.andbinauralInthefirstcuesstage,(ITDandILD)arederivedfromanauditorymodel.Thebinauralcueseachareelementusedindicatestoestimatewhetheratime-frequencythecorrespondingmask,inacousticwhichfeatureconstitutesreliableevidenceofthetargetspeechsignalorfrequencynot.Inthemasksecondarestage,passedthetoacousticamissingfeaturesdataASRandthesystem,time-whichdecoding.

treatsreliableandunreliablefeaturesdifferentlyduringTheapproachdescribedhereextendsourpreviousworkinausednumberheuristicofrespects.rulestoFirstly,estimateourtime-frequencyprevioussystemsmasks.[12],Here,[14],weadoptamoreprincipledapproachinwhichmasksareestimatedwhicharefromobtainedprobabilityfromadensitycorpusfunctionsofsoundformixtures.binauralIncues,thisrespect,inCASAourforcurrentapplyingworkstatisticalisrepresentativeratherthanofheuristicthecurrentmethodstrend(forexample,see[15],[16],[11]).Secondly,weconsidertheproblemreverberation.ofaThismultisourcerepresentsenvironmentamuchgreaterwithchallengerealisticroomthantheanechoicorverymildlyreverberantconditionsusedinsomepreviousofapproachourprevioususedstudiesITDonly,[11],[12].hereweFinally,investigatewhereastherela-

our2

tiveeffectivenessofITD,ILDandajointITD-ILDspace.ThelatterapproachisrelatedtothebinauralspeechsegregationsystemrelativeofstrengthRomanofaettargetal.[11].andinterfererTheirsystem(andhencedeterminesestimatestheabinarymask)bymeasuringthedeviationofobservedITDandSpecifically,ILDineachasupervisedfrequencylearningbandofmethodabinauralisusedauditoryfordifferentmodel.spatialILDfeatureconfigurationsspace.GivenandanfrequencyobservationbandsxwithinintheITD-ILDanITD-featurespaceforafrequencyband,twohypothesesaretested;whetherisdominantthetarget(H).isUsingdominantestimates(H1)andofthewhetherbivariatetheinterferer2densitiesp(x|H1)andp(x|H2),classificationisthenperformedusingamaximumAlthoughaourposterioriapproach(MAP)issimilardecisiontothatrule.

ofRomanetal.,theredistributionsareimportantforITD,differences.ILDandITD-ILDHere,wedirectlyestimatefromprobabilitytrainingdata,ratherthanusingaparametricmethod.Wealsoassumethatsubsequentthetargetprocessing.islocatedMostataknownimportantly,azimuth,wewhichhavesimplifiesevaluatedoursysteminreverberantconditions.ReverberationremainsathesubstantialcaseofASR,problemitisforwellbothknownASRthatandrecognitionCASAsystems.accuracyInfallsdirectassoundtheT60toreverberationreverberationtimedecreasesincreases[17].andSimilarly,theratiotheofperformanceofCASAsystemsisdegradedbyreverberation(e.g.,onlyevaluated[12]),toontheanechoicextentmixtures.thatmanyHere,CASAwesystemsevaluateourarecombinedCASAandASRsystemusingthesamespeechrecognitionreverberantconditionstaskasRomanwhichetareal.,similarandobtaintothoseaccuraciesobtainedbyinRomanOurapproachetal.foranechoicalsodiffersmixtures.

fromspeechrecognitionsys-temsthatusemultiplemicrophones(forexample,see[18]).Suchbeamforming,systemstypicallyinorderperformtopreservespatialthefilteringsignalemanatingusingadaptivefromatargetdirectionwhilstsuppressingnoiseandreverberationthatspatialoriginateinformationfromtootherderivedirections.filteredacousticHere,wefeatures;donotrather,usespatialcuesareusedtoselectacousticfeaturesthatarerepresentativespectralacousticoffeaturesthetargetratherspeechthansignal.thoseSimilarly,thatareintendedweusetorelativeconferspectralrobustnessperceptualagainstnoiselinearandpredictionreverberation,(RASTA-PLP)suchas[19]ormean-normalizedmel-frequencycepstralcoefficients(MFCC)robust’features[20].Ourdonotpreviousperformworkwellsuggestsinconditionsthatsuchwhere‘noisebothinterferingsoundsandreverberationarepresent[12].Wealsonotetomodel-basedthatourapproachapproachesisapurelysuchasdata-drivenparallelmodelone,combinationasopposed[21]TheandremainderhiddenMarkovofthemodelpaperisdecompositionorganisedas[22].

follows.Sec-tionIIexplainsthemethodsused(signalspatialisationandreverberation,binauralcues).missingSectiondataIIIdescribesASR,andanumbermaskestimationofexperiments,usingwhichexaminetheeffectsofcueselectionandothertrainingparameterspaperwithaongeneralASRperformance.discussion.

SectionIVconcludestheII.GENERAL

METHODS

A.Signalspatialisationandreverberation

fromInputthedataTIdigitsforthecorpusCASA[23],systemtowhichconsistedreverberationofutterancesandspatialisationinterestwasatwereazimuthapplied.0degrees,Itwasassumedalthoughthatanotherthesourceazimuthofcouldhavebeenused.Signalsconsistingoftwoconcurrentutterancessimilaracousticbymalemixturesspeakerswerewereusedusedastotrainingtestthedatasystem;whengeneratingtheprobabilitydistributionsusedtocreatethemissingsayingfromdataonemasks.tosevenEachdigits,utterancewitheachconsisteddigitselectedofaspeakerfromtheThelistmixed‘oneutterancestwothreeconsistedfourfiveofsixaseventargetutteranceeightnineatzeroazimuthoh’.0mixedInorderwithtoanotheraddreverberationmaskingutteranceandspatialatadifferentlocationazimuth.totheoriginalusingthemonauralRoomsimutterances,simulator1impulsewitharesponsessimulatedwereroomcreatedofsize6centremx4mx3m.ThereceiverwasaKEMARhead2intheatazimuthofthe0,room,5,7.5,2m10,above15,20,the30ground,or40anddegreesthesourceataradialwasdistancewereassumedof1.5tomhavefromidenticalthereceiver.reverberationAllsurfacescharacteristics.oftheroomTworeverberationsurfaceswereused,‘acousticplaster’and‘platformfloorwooden’,withmeanestimatedT60tiontimesof0.34and0.51secondsrespectively(reverberationreverbera-timesthelatteratstandardsurfacefrequencieswasusedonlyareforshowngeneratinginTabletrainingI):notedata,thatnottestdata.Theroomimpulseresponseswereconvolvedwithmonauralsignalstoproducebinauralreverberateddata.

TABLEI

REVERBERATIONTIMES(SECONDS)FORTWOSURFACES.

Surface

125

250

Frequency5001000(Hz)2000

4000Mean

Platform

floorwooden

0.220.310.490.480.650.890.51

3

6060504030201060Frequency channel504030201040506070Frequency channelFrequency channel5040302010Time (frames)8040Time (frames)5060708040Time (frames)50607080Fig.1.Segmentsofauditoryspectrogramsfortheutterance‘onetwoeightoh’atazimuth0mixedatSNR0dBwithutterance‘fivethree’atazimuth40,bothbymalespeakers:left,anechoic;middle,reverberated,surface‘acousticplaster’;right,reverberated,surface‘platformfloorwooden’.

wascreatedbypassingeachofthebinauralinputsthroughtheauditoryfilterbankdescribedaboveandcomputingthecross-correlationbetweeneachfrequencychannelforeachtimeframe.ThebiggestpeakineachchannelwasselectedandtheestimateofITDwasimprovedbyfittingaquadraticcurvetothepeak.TheILDwascalculatedbysummingtheenergyovereachfrequencychannelandconvertingtheratiooftheenergyfortherightandleftearstodB.Furtherdetailscanbefoundin[12].

C.The‘missingdata’speechrecogniser

ThemissingdatarecogniseruseshiddenMarkovmodels(HMMs)trainedonspectro-temporalacousticfeatures.Duringtesting,therecogniserissuppliedwithfeaturesforanacousticmixtureandatime-frequencymask,thatindicateswhichpartsoftheinputconstitutereliableevidenceforthetargetsource.Themissingdatamaskmaybediscrete,i.e.eachtime-frequencyelementiseither0(meaningthetargetismasked),or1(meaningthetargetisdominant);alternatively‘soft’masksmaybeused[25]inwhicheachelementtakesarealvaluebetween0and1indicatingtheprobabilitythattheelementisdominatedbythetarget.

TheHMMemissiondistributionsarebasedonGaussianmixturemodels(GMMs),witheachcomponenthavingadi-agonalcovariancematrix.Duringrecognition,thesoftmissingdatatechniqueadaptstheusualstate-likelihoodcomputationtocomputestatescores,p󰀃(x|q),whichtakeaccountoftheunreliabledataasfollows:

󰀃

󰀃

p(x|q)=wλp󰀃(x|q,λ)(1)

λ

wherewλisthemixtureweightforGMMcomponentλ,

andp󰀃(x|q)isascoreforeachmixturecomponent,givenbyformingaproductoverthespectralfeaturesoftheform

󰀄󰀆1−mi

p󰀃(x|q,λ)=mip(xi|q,λ)+

i

σλi

2

(xi−µλi)2

4

Inordertocreatetheprobabilitydistributions,ILDandITDwereidentifiedforeachtime-frequencyelementofasetabove).ofmixedProbabilitytrainingdistributionsutteranceswere(describedproducedinsectionforarangeII-Aoftrainingdata,whichwasalwaysselectedfromasetof120surfacepairs‘acousticofutterances,plaster’matchedor‘platformforlength,floorforwooden’.reverberationOneutterance20or40wasdegrees,atazimuthorat0-5,degrees,-10,-20andorthe-40otherdegrees,at5,and10,theutterancesweremixedatSNR0,10and20dB.Thechoicesetdependedofreverberationontheexperiment,surfaceandSNRbutallusedofintheeach8azimuthtrainingseparationslistedabovewerealwaysincluded.

AftereachILDandITDwasdeterminedforeachtime-frequencyabin.Theelementbinsizesofwereatrainingselectedutterance,accordingitwastotheassignedrangetoofvaluesweresetobservedto0.1forforILDILDandand0.01ITDforinITD.theThesetrainingvaluesdatawereandfoundduringpreliminaryinvestigationstoprovidesufficientresolutionfortheprobabilitydistributions.

ITDTwovalues.histogramsThefirstwerehistogram,producedfromHthebinnedILDanda,countedthenumberofobservationsofeachILD/ITDcombinationinallofthetrainingtargetsourcedata(i.e.andtheincludingmaskingobservationssource);theproducedsecond,Hbyboththet,countedtheproducednumberbyoftheobservationstargetsourceofalone.eachILD/ITDObservationscombinationlikelytohavebeenproducedbythetargetsourcewereidentifiedusinganutterance:apriorionlymaskthose(sectiontime-frequencyII-C)createdelementsforeachofmixedtheatrainingpriorimaskthatwereidentifiedasbelongingtothetargetsourcewereincludedinhistogramHt.

approach.TheprobabilityGivenandistributionobservationwaso=modelled(ILD,ITDusing)aforBayesianatime-frequencytheobservationelementwasofproducedamixedbyutterance,atargetthesourceprobabilityatazimuththatzeroisgivenby:

p(Target|o)=p(o|Target)p(Target)/p(o)

(4)

where:

p(o|Target)=

Ht(o)

󰀂

H(6)

P(Target)=

󰀂

aHt

Ha(o)

(8)

whichcanbefoundfromthehistogramsobtainedfromthetrainingdataasdescribedabove.

Observationswerecountedseparatelyforeachfrequencychanneltionsforduedifferenttothechannels.

largevariationintheprobabilitydistribu-Afurtherstepwasperformedtoreducetheeffectofinsufficienttrainingdataforcertainobservations.Anytime-frequencyelementsforwhichthenumberofobservationsHa(o)waslessthanathresholdvalueweretreatedasifnodatawerepresent.Usingathresholdreducedthechanceofanelementsunrealisticallywerepresenthighprobabilitybutwereallocatedoccurringtowhentheonlytarget;afewforexample,of1.WhenasinglethedenominatortargetelementHwouldresultinaprobabilitya(o)waszero,thenumeratorHt(o)wasalsozero,andthecorrespondingprobabilitywassetlowtoprobabilityzero.This(wheresmoothedfewertheobservationsprogressionoccurred)fromregionstohighofprobability(wherelargernumbersofobservationsoccurred)(Fig.asdescribed3).Thebelow,choiceinoforderthresholdtofindwasadeterminedbalancebetweenheuristically,includ-ingandexcessivelyexcludingprobabilitieshighprobabilitiesresultingduetofromlacksmalloftrainingamountsdata,oftrainingdatathatwouldneverthelesshavebeenvalid.byAidentifyingmissingdatathemaskILDwasandcreatedITDforforeacheachmixedtime-frequencytestsignalelement,andusingtheprobabilitydistributionasalook-uptableelementforwasthesedominatedtwocuestobydeterminethetargettheatazimuthprobability0.

thateachinFig.this3wayshowsforfourexamplesoftheof64probabilityfrequencydistributionschannels,forobtainedtrainingdatawithacombinationofSNRsandhistogramthreshold10.TheILDexamplesandITDarecues,forbutaprobabilitydistributionsdistributionwerealsobasedproducedonbothforeachcueindependently.

1Channel centre frequency 99 Hz1Channel centre frequency 545 Hz0.50.5))ssmm(( D0 D0TTII−0.5−0.5−1−505−1ILD (dB)−5ILD (dB)051Channel centre frequency 1599 Hz1Channel centre frequency 4090 Hz0.50.5)s)sm( D0m( D0TTII−0.5−0.5−1−505−1ILD (dB)−5ILD (dB)05Fig.0right),degrees,3.ExamplesforchannelsofILD/ITDwithprobabilitydistributionsforasourceatazimuthlowerprobability;1599Hz(bottomcentrefrequency99Hz(topleft),545Hz(topdarkerleft)areasandhigher4090probability.

Hz(bottomright).LighterareashaveExamplesofmissingdatamaskscreatedfromprobabilitydistributionsandITDcombineddeterminedareshownfromILDinFig.alone,2.

ITDaloneandILDE.Evaluation

Theaccuracyofthemasksandthelocalisationprocesswasevaluatedbymeasuringtherecognitionaccuracyforasethadofbeenmixedproducedtestutterancesusingeachforprobabilitywhichmissingdistribution.datamasksOnlyonereverberationsurface,‘acousticplaster’,wasusedforthetestdifferentset.Thefromtestthoseutterancesintheconsistedtrainingofset,240withtargetreverberationutterances,added,to-noisespatialisedratio(SNR)atazimuthof0,5,010,degrees15or20anddBmixedwithatonesignal-ofasetof240maskingmaleutterances,matchedinlengthtotheoriginalandspatialised240utterances.atazimuthThe5,masking7.5,10,15,speech20,30wasorreverberated40degrees.Theresultingmixturewasprocessedtoformanauditoryspectrogramfromdataspatialisedasfortheattrainingazimuthdata.0degrees;TheSNRthewasmixedcalculatedspeechsignalusedforenteringrecognition.

theearfurthestfromthemaskingspeechwasAnumberofexperimentswereperformedtoinvestigatetheimportanceofeachofthetwolocalisationcues(ILDandITD)asreverberationtogetherwithsurfacetheandeffectSNR)ofdifferentofthetrainingpropertiesdata(suchusedtocreatetheprobabilitydistribution.Afurtherexperimentinvestigateddifficulttestconditions,whethertheusingmethodtwomaskingwassuccessfulutterances.inThesemoreexperimentsaresummarisedinTableII.

III.EXPERIMENTS

A.Experiment1-effectofcueselection

localisationExperimentcue1investigated(ITDorILD)theandeffectsofcombiningofselectingbothasinglecuestogether.thesethreeProbabilityconditions,distributionsusingtrainingweredatacreatedforreverberationforeachofsurface‘acousticplaster’andforallthreeSNRs(0,10and20distributions,dB)combinedthehistogramtogether.HWhencreatingtheprobabilityathresholdvaluewassetto10(seealsoexperiment2).

Fig.4showstherecognitionaccuracyforthe240mixedtestbaselineutterancesresultsforforeachaprioriofthemasks.threeconditions,Arecognisertogetherwaswithalsotrained(MFCCs)andderivedtestedusingfromMelthesameFrequencytrainingCepstralandtestingCoefficientsdata.Forthisapproach,thedatawasprocessedusingananalysiswindowchannels,of1225cepstralmsandframecoefficientsshiftofplus10ms,anenergywith23term,frequencydeltaandaccelerationcoefficients,andenergyandcepstralmeannormalisation.noisyconditionsTheasrecognisercanbeseendidfromnotperformtheMFCCwellresultsunderincludedinFig.4.

inAtheseparatetestdata.plotTheisshowndifferentforeffecteachofofthethecues5SNRswasmostusedpronouncedforlowertestSNRbut,overall,usingbothILDandwasITDused.producedUsingILDbetteraloneresultsproducedthanwhentheworstonlyperformanceasinglecueinallcases,showingthatthiscuewasnotsufficienttodefinetheseparationprobabilityofthedistribution,twosources.especiallyWhenbothforsmallercueswereazimuthused,

5

100Test SNR 0 dB 90y80cILD onlyar70ITD onlyucILD/ITDc60A prioriA MFCC%50403057.51015203040100Test SNR 5 dB Azimuth9590yc85aILD onlyru80ITD onlyccILD/ITDA75A priori %70MFCC65605557.51015203040100Test SNR 10 dB Azimuth95yc90aILD onlyruITD onlycc85ILD/ITDAA priori %80MFCC757057.51015Azimuth203040100Test SNR 15 dB yc95aILD onlyruITD onlyccILD/ITDAA priori %90MFCC8557.51015203040100Test SNR 20 dB Azimuth9998y97car96ILD onlyucc95ITD onlyILD/ITDA %94A prioriMFCC9392919057.51015Azimuth203040Fig.cues4.Resultsofand‘acoustic20(ILDdBand(combinedITD).experimentTraining1,showingtheeffectofvaryinglocalisationtogether).data:Histogramsurfacethreshold:‘acousticplaster’,SNR0,10SNR.(Noteplaster’,thattheSNRordinate0,5,10,scale15varies.)

or20dB.Oneplot10.isshownTestdata:pertestsurfacedata6

TABLEII

SUMMARYOFEXPERIMENTALCONDITIONS.

Experiment

ILD,ITDorcombinedILD/ITDCombinedILD/ITDCombinedILD/ITD

45

310

Histogramthreshold

‘acousticplaster’‘acousticplaster’‘acousticplaster’or

0,10or20(individually)0,10and20(combined)

Trainingdata

SNR(dB)

111

1009896ycar94uccacoustic plaster SNR 0 dBA %92acoustic plaster SNR 5 dBacoustic plaster SNR 10 dBacoustic plaster SNR 15 dB90acoustic plaster SNR 20 dBplatform floor SNR 0 dBplatform floor SNR 5 dB88platform floor SNR 10 dBplatform floor SNR 15 dBplatform floor SNR 20 dB8657.5101520Azimuth3040Fig.training6.surfacedataResultsreverberationofexperimentsurfacefor3,combinedshowingILD/ITDtheeffectcues.ofvaryingthe(combined‘acousticplaster’or‘platformfloorwooden’,SNR0,10Trainingand20data:surface‘acoustictogetherplaster’,foreachSNRsurface).0,5,10,Histogram15or20dB.

threshold:10.Testdata:dBforcondition.eachSNRTheandthresholdtheperformancewasreducedcomparedfromfor10eachto3trainingwhencreatingquantitytheseoftrainingthreedatadistributionscomparedtowithallowthefordistributionthereducedfortheFig.combined7showsSNRs.

theresultsplottedseparatelyforeachofthe5testSNRs.ForthelowertestSNRs(0,5and10dB),thesmallertherewasazimuthanincreaseseparationsinperformance(5and7.5ofdegrees)upto3%whenforthetrainingSNRwasalsolow(0dB).However,thiswasoffsetthehigherbyasmallazimuthdecreaseseparations.inperformanceTheinverseofaroundoccurred0.5%whenforadecreasedhighertrainingbyuptoSNR3%or(105%orrespectively20dB)wasforused:smallerperformanceazimuthseparations,buttherewasacorrespondingsmallincreaseinperformanceSNRs(15andfor20higherdB)theazimuthtrainingseparations.dataSNRForhadthelittlehighereffect.testOverall,therewasasmallmeanimprovementwhenusingatrainingusetrainingSNRdataof0fordB,thesuggestingmoredifficultthatitconditions.

ismoreimportanttoE.Experiment5-usingmultiplemaskingsources

Themethodusedintheseexperimentscanbeappliedtodatathathasmorethanonemaskingsource.Inexperiment5,sourcesadditional(bothtestmaledataspeakers)wasproducedwereforpresent,whichtotwocheckmaskingthatgoodconditions.

performancecouldbeobtainedinmoredifficulttestazimuth;Thefirstthemaskersecondwaswasat5,at7.5,-1010,or15,+1020,degrees.30or40ThedegreestwomaskersweremixedatSNR0dBandthenthismixturewascombinedwerecreatedwithusingthetargettheILD/ITDatSNRprobability0dB.Missingdistributiondatamasksfortrainingdatawithsurface‘acousticplaster’andacombinationoffurthestSNRvaluesfrom(0,10and20dB).ThesignalenteringtheeartheFig.results8showstheforathefirstprioriresultsmaskermasksofwasthisusedforrecognition.

forexperiment,theonemaskertogetherandwithtwo

7

100Test SNR 0 dB 9896y94car92ucc90A %8886Combined training SNR, 0,10 and 20 dB84Training SNR 0 dB82Training SNR 10 dBTraining SNR 20 dB8057.51015203040100Test SNR 5 dB Azimuth9896y94car92ucc90A %8886Combined training SNR, 0,10 and 20 dB84Training SNR 0 dBTraining SNR 10 dB82Training SNR 20 dB8057.51015203040100Test SNR 10 dB Azimuth9896y94car92ucc90A %8886Combined training SNR, 0,10 and 20 dB84Training SNR 0 dB82Training SNR 10 dBTraining SNR 20 dB8057.51015203040100Test SNR 15 dB Azimuth9896y94car92ucc90A %888684Combined training SNR, 0,10 and 20 dBTraining SNR 0 dB82Training SNR 10 dBTraining SNR 20 dB8057.51015Azimuth203040100Test SNR 20 dB 9896y94car92ucc90A %8886Combined training SNR, 0,10 and 20 dB84Training SNR 0 dB82Training SNR 10 dBTraining SNR 20 dB8057.51015Azimuth203040Fig.to-noise7.Resultsdata:Histogramsurfaceratio‘acousticofoftheexperimenttraining4,showingtheeffectofvaryingthesignal-plaster’,dataSNRfor0,10combinedorILD/ITDcues.Training15or20dB.threshold:Oneplot3.isTestshowndata:persurfacetestdata‘acoustic20dBSNR.

plaster’,(eachusedSNRseparately).0,5,10,8

maskerconditions.Performancewasstillgoodconsideringthedifficultyofthetask,althoughreducedby3-5%(comparedwithsamethesidesingleofthemaskerhead,case)andbywhen4-7%bothwhenmaskersmaskerswerewereontheonoppositesidesofthehead.Performanceusingtheapriorimasksasinglewasmasker.

alsoreducedforthetwomaskerscomparedwith100989694yca92rucc90A %8886ILD/ITD, single masker84ILD/ITD, two maskers, second at azimuth 10ILD/ITD, two maskers, second at azimuth −10A priori, single masker82A priori, two maskers, second at azimuth 10A priori, two maskers, second at azimuth −108057.51015Azimuth of first masker203040Fig.ILD/ITD8.ResultsdBcues.ofexperiment5,usingtwomaskingsources,forcombinedplaster’,(combinedSNR0together).Trainingdata:surface‘acousticplaster’,SNR0,10and20dB.

Histogramthreshold:10.Testdata:surface‘acousticIV.DISCUSSION

Wehaveshown,viatheexperimentssummarisedinTableI,ofthatspatiallyatargetseparatedspeechsignalmaskerscaninbearecognisedreverberantinenvironment,thepresenceusingthestatisticsofbinauralcuestoestimatemissingdatamasks.werenotTheencounteredsystemgeneralisedduringtraining.

welltoacousticconditionsthatproblemExperimentwasmost1showedeffectivelythattheaddressedbinauralwithinsourceajointseparationITD-ILDiscompatiblespace,ratherwiththanthebytheoreticalusingITDoranalysisILDalone.andThissimulationsfindinggivenbyRomanetal[11].Additionally,wefoundthatILDaloneconditions.wasafarThislessresulteffectiveisconsistentcuethanITDwithalonepredictionsinreverberantfrompsychophysics.Forexample,IhlefeldandShinn-Cunningham[26]theILD,showmakingthatreverberationitanunreliabledecreasesindicatorthemeanofsourcemagnitudeazimuth.ofHowever,overtimetheymaysuggeststillprovidethatthesomevariationindicationoftheofshort-termlateralpositionILDinvariationreverberantsystem,andofbinauralconditions.willbeinvestigatedcuesisInformationnotexplicitlyaboutthetemporalinfuturework.

usedinourcurrentmanceAsmightwaslowerbeexpected,whentheinalltestexperimentsdatawererecognitionatsmallerazimuthperfor-separationorhadlowerSNR.Boththeseconditionsreducedthereasons.accuracyWhenofthetheazimuthmissingseparationdatamasks,betweenbutforthedifferentsourcesissmall,itishardertoobtainareliableestimateoftheILDandseparationITDforusedthe(5targetdegrees),andmasker.theILDAtisthesmallsmallestandtheangularITD

(measuredas0.045msecfromtheKEMARhead-relatedimpulseresponse)liesclosetothesmallesttimelagdetectablebysampleourcross-correlationperiodof0.05msec).algorithmWhen(whichtheSNRislimitedislower,bymoretheelementswillbedominatedbythemasker,andthereforetheregionsofthemaskassignedtothetargetwillbesmaller.Themasksareaffectedinasimilarwaybytheaccuracyoftheprobabilitydistributions,whichisinfluencedbythetrainingchoiceofdatareveberationused.Althoughsurfaceusedexperimentfortraining4showedhaslittlethateffect,thetheselectionoftrainingdataforlowerSNRisimportant,asdiscussedtothehistograminsectionthreshold,III-D.Performanceasillustratedisalsoinratherexperimentsensitive2.Aslowertheazimuththresholdseparationincreases,butperformancethereisacorrespondingtendstoincreasedecreaseforforhigherazimuthseparation.Whenthethresholdislow,someofmaytheprobabilitiesinthedistribution(andhenceinthemasks)theprobabilitiesbeexcessivelywillhigh;bezero.whenAthelowthresholdthresholdisresultshigh,moreinmoreofunreliablesmallerazimuthdatainseparations,themask,butespeciallytheeasierfortesttheconditionsmoredifficultarelessalsoaffectedgetsthrough.sinceInthecontrast,morereliableahigh(higherthresholdprobability)preventssomedataofthereliabledatabeingincludedintheprobabilitydistribu-tiondata,andwhichmasks,improvesbutalsothereducesperformancethequantityforsmallerofunreliableazimuthseparationsbutreducesperformanceforlargerazimuthsepa-rationsmoredifficult(Fig.9).testUsingconditions,moretrainingwouldbedata,expectedespeciallytoreduceforthethesensitivityofthesystemtothehistogramthreshold.

Threshold 50, azimuth 5Threshold 10, azimuth 406060le50lne50nnanh40ach40 cy cyn30cen30ueque20qre20FrF10102040Time (frames)60801001202040Time (frames)6080100120Threshold 50, azimuth 5Threshold 50, azimuth 406060lleen50n50nnaah40h40cc yycn30cn30eeuuqe20qe20rrFF10102040Time (frames)60801001202040Time (frames)6080100120Fig.surface9.MissingdatamasksforthemixedutterancesinFig.azimuth‘acousticseparations.

plaster’,showingtheeffectofhistogramthreshold1,reverberationfortwowhenThemultiplemethodworkedmaskerswellwereinpresentmorechallenging(experimentconditions,5).Thereducedperformancefortheapriorimaskswasprobablyduetowiththeareducedsinglemasker,varianceandinathesimilarcombinedeffectmaskerswouldbecomparedexpected

inthemasksproducedusinglocalisationcues.Inthecaseofmaskersonoppositesidesofthehead,twoadditionalfactorsperformance:arelikelyfirst,tothehavesignalbeenusedresponsibleforrecognitionforthewasreducedclosetothesecondofthetwomaskers,sotheearadvantagewasreducedmaskerswouldorremoved;beexpectedsecond,tocomplicateinteractionsthebetweenpatterntheofILDstwoandmaskers.

ITDsandproducemoreconfusionsbetweenthetargetandFurtherworkisrequiredtoseparatetheeffectsmentionedabove,(forexample,andtousinginvestigatemorewhethertrainingutterances,extendingtheadditionaltrainingrever-databerationsurfacesandmultiplemaskers)improvesperformanceundersurfacesallfortesttesting,conditions.andtrainWewillandalsotesttheusesystemotherreverberationwithtargetsatperformanceotherazimuths.ofthisItsystemwouldwithalsothosebeinterestingofhumanstoundercomparesimilartheconditions.

REFERENCES

[1]R.CommunicationP.Lippmann,,“Speechvol.22,recognitionpp.1–15,1997.

bymachinesandhumans,”Speech

[2]B.London:C.J.Moore,AcademicAnPress,introduction2003.

tothepsychologyofhearing,5thed.

[3]P.YostM.Zurek,“Theprecedenceeffect,”in85–105.

andG.Gourevitch,Eds.NewYork:DirectionalSpringer-Verlag,Hearing1987,,W.pp.A.

[4]W.simultaneousSpieth,J.F.vol.26,pp.391–396,messages,”Curtis,and1954.

JournalJ.C.ofWebster,theAcoustical“RespondingSocietytoofoneAmericaoftwo

,[5]N.levelI.Durlach,“Equalizationandcancellationtheoryofbinauralno.8,differences,”pp.1206–1218,Journal1963.

oftheAcousticalSocietyofAmerica,masking

vol.35,[6]C.roleJ.(HumanofDarwininterauralandPerceptiontimeR.anddifferences,”W.Hukin,“AuditoryPerformance)Journalobjectsofattention:the

,vol.25,ofExperimentalno.3,pp.616–629,Psychology1999.[7]A.1990.

S.Bregman,Auditorysceneanalysis.Cambridge,MA:MITPress,

[8]R.aration,”F.Lyon,“AcomputationalmodelofbinauralSpeechandinProceedingsSignalProcessingofthe,1983,Internationalpp.1148–1151.

ConferencelocalizationonAcoustics,andsep-[9]M.tailBodden,partyeffect,”“ModellingActaAcusticahuman,sound-sourcevol.1,pp.43–55,localization1993.

andthecock-[10]A.forShamsoddini179–196,reverberant2001.

conditions,”andP.N.SpeechDenbigh,Communication“Asoundsegregation,vol.33,no.algorithm

3,pp.[11]N.soundRoman,D.L.Wang,andG.J.114,no.localization,”4,pp.2236–2252,Journal2003.oftheBrown,Acoustical“SpeechSocietysegregationofAmericabased,vol.on

[12]K.forJ.Palom¨aki,G.J.Brown,andD.L.Wang,“Abinauralroommissing378,2004.

reverberation,”dataspeechSpeechrecognitionCommunicationinthepresenceprocessor

,vol.43,ofnoiseno.4,andpp.small-361–[13]M.speechP.Cooke,P.Green,L.Josifovski,andA.Vizinho,Communicationrecognition,vol.with34,missingpp.267–285,andunreliable2001.

acoustic“Robustdata,”automatic

Speech[14]S.localizationHarding,J.InternationalforBarker,missinganddataG.J.speechBrown,“Maskestimationbasedonsound

Philadelphia,ConferenceMarch2005.

onAcoustics,recognition,”SpeechandinProceedingsSignalProcessingofthe,[15]S.NeuralT.Roweis,Press,2000,Information“Onepp.793–799.

ProcessingmicrophoneSystemssource13separation,”.Cambridge,inAdvancesMA:MITin

[16]J.sceneNix,andanalysisM.Kleinschmidt,byusingandV.Hohmann,“Computationalauditory

Septembersound2003,sourcepp.direction,”statisticsofhigh-dimensionalspeechdynamics1441–1444.

inProceedingsofEurospeech,Geneva,9

[17]B.speechW.GillespieInternationalrecognitionandConferenceinreverberantL.E.Atlas,onAcoustics,environments,”“AcousticdiversitySpeechandinProceedingsforimproved

Signalofthe[18]Orlando,M.andOmologo,May2002,pp.557–560.

Processing,[19]CommunicationacousticP.transductionSvaizer,andM.Matassoni,“Environmentalconditions

H.,vol.25,inhands-freespeechrecognition,”SpeechTransactionsHermanskyonandSpeechN.Morgan,pp.75–95,andAudio“RASTA1998.

Processingprocessing,vol.of2,speech,”pp.IEEE

[20]1994.

578–589,F.normalizationH.Liu,R.M.Stern,X.Huang,andA.Acero,“EfficientSixthARPAWorkshopforrobustonHumanspeechLanguagerecognition,”TechnologyinProceedingscepstral

.Priceton,ofNJ:the[21]MorganM.usingJ.F.Kaufmann,parallelGalesandS.MarchJ.Young,1993.

“Robustcontinuousspeechrecognition

[22]AudioA.speechP.VargaProcessingmodel,vol.combination,”4,no.5,pp.IEEE352–359,Transactions1996.

onSpeechandonAcoustics,andnoise,”andR.K.SpeechinandProceedingsMoore,“HiddenSignalProcessingoftheMarkovIEEE,Albuquerque,Internationalmodeldecompositionof

NewConference[23]AprilR.G.1990,Leonard,pp.“A845–848.

Mexico,[24]inB.Proc.fromR.notched-noiseGlasbergICASSPand,vol.databaseB.3,1984,forpp.speaker-independent111–114.

digitrecognition,”

data,”C.J.HearingMoore,Research“Derivation,vol.of47,auditoryno.1–2,filterpp.shapes

[25]138,J.inP.Barker,1990.

103–L.Josifovski,M.P.Cooke,andP.D.Green,“Softdecisions

[26]Proc.missingdatatechniquesforrobustautomaticspeechrecognition,”inA.andIhlefeldICSLPand,2000,B.pp.373–376.

AcousticallistenerSocietylocationG.ofonShinn-Cunningham,AmericaILDcues,vol.in115,areverberant“Effectp.2598,room,”ofsource2004.Journallocation

oftheSuemathematicsHardingsityandSuecomputerHardingreceivedtheB.Sc.inPLACEastronomyofYork,fromU.K.,Universityin1980,sciencethefromtheUniver-ofM.Sc.degreeinradioPHOTO1982HEREneuroscienceandtheFromfromPh.D.KeeledegreeUniversity,inManchester,communicationU.K.,in

U.K.,in2003.andanalyst1982sheworkedasaprogrammerandfromof1988atSimonEngineering,Stockport,U.K.,systems

andDepartmentsheComputerashasworkedScienceaComputingOfficerintheDepartmentasataKeeleUniversity.Since2003interestsincludeofComputerauditorySciencesceneanalysisattheUniverstiyResearchandmodelsofofSheffield.Associatespeechperception.Herresearchinthe

JonelectricalBarkerUniversity,andJonU.K.,informationBarkerreceivedin1991sciencetheandthefromB.A.degreeinPh.DCambridgedegreePLACEcomputerPHOTOU.K,in1998.scienceHefromhastheworkedUniversityasaresearcherofSheffield,

inatHEREICPresearch(Grenoble,France)andhasbeenavisitingandaoflecturerICSIscientist(Berkeley,atIDIAPSheffield.incomputerUS).(Martigny,Switzerland)

HisresearchscienceSince2002hehasbeeninterestswiththeincludeUniversityaudiospeechrecognition,andaudio-visualandaudio-visualspeechspeechprocessing.

perception,robustautomatic

GuydegreeJ.BrownGuyJ.BrownreceivedUniversity,inappliedsciencefromSheffieldtheHallamB.Sc.PLACEcomputerPHOTOUniversityscienceU.K.,ofSheffield,andin1988,thethePh.D.degreeinSheffield,M.Ed.degreefromthe

HERE1997,scientistrespectively.OhioatLIMSI-CNRSHehasU.K.,in1992and(Paris),beenaATRvisitingresearchTechnology.StateUniversityandHelsinkiUniversity(Kyoto),The

ofcomputerscienceHeiswithcurrentlyaSeniorLecturerinauditoryrecognitionmodelling,andHealsohashasalong-establishedtheresearchinterestsinterestUniversityinautomaticincomputational

ofSheffield.80papersinandbooks,musicjournalstechnology.andconferenceHehasauthoredproceedings.

andcoauthoredmorespeechthan

因篇幅问题不能全部显示,请点此查看更多更全内容