您的当前位置:首页正文

Chapter 4 An Introduction to Hidden Markov Models for

2021-11-05 来源:好走旅游网
Chapter4

AnIntroductiontoHiddenMarkovModelsforBiologicalSequences

byAndersKrogh

CenterforBiologicalSequenceAnalysisTechnicalUniversityofDenmarkBuilding206,2800Lyngby,DenmarkPhone:+4545252471Fax:+4545934808

E-mail:krogh@cbs.dtu.dk

InComputationalMethodsinMolecularBiology,editedbyS.L.Salzberg,D.B.SearlsandS.Kasif,pages45-63.Elsevier,1998.

1

Contents

4AnIntroductiontoHiddenMarkovModelsforBiologicalSequences4.1Introduction.............................4.2FromregularexpressionstoHMMs................4.3ProfileHMMs............................

4.3.1Pseudocounts........................4.3.2Searchingadatabase....................4.3.3Modelestimation......................4.4HMMsforgenefinding.......................

4.4.1Signalsensors.......................4.4.2Codingregions.......................4.4.3Combiningthemodels...................4.5Furtherreading...........................

2

13481011121314151619

4.1Introduction

Veryefficientprogramsforsearchingatextforacombinationofwordsareavail-ableonmanycomputers.Thesamemethodscanbeusedforsearchingforpatternsinbiologicalsequences,butoftentheyfail.Thisisbecausebiological‘spelling’ismuchmoresloppythanEnglishspelling:proteinswiththesamefunctionfromtwodifferentorganismsarealmostcertainlyspelleddifferently,thatis,thetwoaminoacidsequencesdiffer.Itisnotrarethattwosuchhomologoussequenceshavelessthan30%identicalaminoacids.SimilarlyinDNAmanyinterestingsig-nalsvarygreatlyevenwithinthesamegenome.Somewell-knownexamplesareribosomebindingsitesandsplicesites,butthelistislong.Fortunatelythereareusuallystillsomesubtlesimilaritiesbetweentwosuchsequences,andtheques-tionishowtodetectthesesimilarities.

Thevariationinafamilyofsequencescanbedescribedstatistically,andthisisthebasisformostmethodsusedinbiologicalsequenceanalysis,see[1]forapresentationofsomeofthesestatisticalapproaches.Forpairwisealignments,forinstance,theprobabilitythatacertainresiduemutatestoanotherresidueisusedinasubstitutionmatrix,suchasoneofthePAMmatrices.ForfindingpatternsinDNA,e.g.splicesites,somesortofweightmatrixisveryoftenused,whichissimplyapositionspecificscorecalculatedfromthefrequenciesofthefournucleotidesatallthepositionsinsomeknownexamples.Similarly,methodsforfindinggenesuse,almostwithoutexception,thestatisticsofcodonsordicodonsinsomeformorother.

AhiddenMarkovmodel(HMM)isastatisticalmodel,whichisverywellsuitedformanytasksinmolecularbiology,althoughtheyhavebeenmostlyde-velopedforspeechrecognitionsincetheearly1970s,see[2]forhistoricaldetails.ThemostpopularuseoftheHMMinmolecularbiologyisasa‘probabilisticpro-file’ofaproteinfamily,whichiscalledaprofileHMM.Fromafamilyofproteins(orDNA)aprofileHMMcanbemadeforsearchingadatabaseforothermem-bersofthefamily.TheseprofileHMMsresembletheprofile[3]andweightmatrixmethods[4,5],andprobablythemaincontributionisthattheprofileHMMtreatsgapsinasystematicway.

TheHMMcanbeappliedtoothertypesofproblems.Itisparticularlywellsuitedforproblemswithasimple‘grammaticalstructure,’suchasgenefinding.Ingenefindingseveralsignalsmustberecognizedandcombinedintoapredictionofexonsandintrons,andthepredictionmustconformtovariousrulestomakeitareasonablegeneprediction.AnHMMcancombinerecognitionofthesignals,anditcanbemadesuchthatthepredictionsalwaysfollowtherulesofagene.SincemuchoftheliteratureonHMMsisalittlehardtoreadformanybiol-ogists,Iwillattemptinthischaptertogiveanon-mathematicalintroductiontoHMMs.Whereasthelittlebiologicalbackgroundneededistakenforgranted,I

3

havetriedtoexplainHMMsatalevelthatalmostanyonecanfollow.FirstHMMsareintroducedbyanexampleandthenprofileHMMsaredescribed.ThenanHMMforfindingeukaryoticgenesissketched,andfinallypointerstothelitera-turearegiven.

4.2FromregularexpressionstoHMMs

Mostreadershavenodoubtcomeacrossregularexpressionsatsomepoint,andmanyprobablyusethemquitealot.Regularexpressionsareusedinmanypro-grams,inparticularonUnixcomputers.Inprogramslikeawk,grep,sed,andperl,regularexpressionscanbeusedforsearchingtextfilesforapattern.Withgrepforinstance,youcansearchafileforalllinescontaining‘C.elegans’or‘Caenorhab-ditiselegans’withtheregularexpression‘’.ThiswillmatchanylinecontainingaCfollowedbyanynumberoflower-caselettersor‘.’,thenaspaceandthenelegans.Regularexpressionscanalsobeusedtocharacterizeproteinfamilies,whichisthebasisforthePROSITEdatabase[6].

Usingregularexpressionsisaveryelegantandefficientwaytosearchforsomeproteinfamilies,butdifficultforother.Asalreadymentionedinthein-troduction,thedifficultiesarisebecauseproteinspellingismuchmorefreethanEnglishspelling.Thereforetheregularexpressionssometimesneedtobeverybroadandcomplex.ImagineaDNAmotiflikethis:

(IuseDNAonlybecauseofthesmallernumberoflettersthanforaminoacids).Aregularexpressionforthisis

[AT][CG][AC][ACGT]*A[TG][GC],

meaningthatthefirstpositionisAorT,thesecondCorG,andsoforth.Theterm‘[ACGT]*’meansthatanyofthefourletterscanoccuranynumberoftimes.Theproblemwiththeaboveregularexpressionisthatitdoesnotinanywaydistinguishbetweenthehighlyimplausiblesequence

whichhastheexceptionalcharacterineachposition,andtheconsensussequence

4

Figure4.1:AhiddenMarkovmodelderivedfromthealignmentdiscussedinthetext.Thetransitionsareshownwitharrowswhosethicknessindicatetheirproba-bility.Ineachstatethehistogramshowstheprobabilitiesofthefournucleotides.

withthemostplausiblecharacterineachposition(thedashesarejustforaligningthesesequenceswiththepreviousones).Whatismeantbya‘plausible’sequencecanofcoursebedebated,althoughmostwouldprobablyagreethatthefirstse-quenceisnotlikelytobethesamemotifasthe5sequencesabove.Itispossibletomaketheregularexpressionmorediscriminativebysplittingitintoseveraldif-ferentones,butiteasilybecomesmessy.Thealternativeistoscoresequencesbyhowwelltheyfitthealignment.

Toscoreasequence,wesaythatthereisaprobabilityof4508foranAinthefirstpositionand1502foraT,becauseweobservethatoutof5letters4areAsandoneisaT.SimilarlyinthesecondpositiontheprobabilityofCis45andofG15,andsoforth.Afterthethirdpositioninthealignment,3outof5sequenceshave‘insertions’ofvaryinglengths,sowesaytheprobabilityofmakinganinsertionis35andthus25fornotmakingone.TokeeptrackofthesenumbersadiagramcanbedrawnwithprobabilitiesasinFig.4.1.

ThisisahiddenMarkovmodel.Aboxinthedrawingiscalledastate,andthereisastateforeachtermintheregularexpression.Alltheprobabilitiesarefoundsimplybycountinginthemultiplealignmenthowmanytimeseacheventoccur,justasdescribedabove.Theonlypartthatmightseemtrickyisthe‘in-sertion,’whichisrepresentedbythestateabovetheotherstates.Theprobabilityofeachletterisfoundbycountingalloccurrencesofthefournucleotidesinthisregionofthealignment.ThetotalcountsareoneA,twoCs,oneG,andoneT,yieldingprobabilities15,25,15,and15respectively.Aftersequences2,3and5havemadeoneinsertioneach,therearetwomoreinsertions(fromsequence2)andthetotalnumberoftransitionsbacktothemainlineofstatesis3(allthreesequenceswithinsertionshavetofinish).Thereforethereare5transitionsintotalfromtheinsertstate,andtheprobabilityofmakingatransitiontoitselfis25andtheprobabilityofmakingonetothenextstateis35.

5

Sequence

Originalsequences

Probability

100Logodds

4.93.05.34.94.6

3.30.0075

1.23.30.59

Table4.1:Probabilitiesandlog-oddsscoresforthe5sequencesinthealignmentandfortheconsensussequenceandthe‘exceptional’sequence.

Itisnoweasytoscoretheconsensussequence.TheprobabilityofthefirstAis45.Thisismultipliedbytheprobabilityofthetransitionfromthefirststatetothesecond,whichis1.Continuingthis,thetotalprobabilityoftheconsensusis

P

080447

108061102

11

0808

06108

Makingthesamecalculationfortheexceptionalsequenceyieldsonly00023102,whichisroughly2000timessmallerthanfortheconsensus.Thiswayweachievedthegoalofgettingascoreforeachsequence,ameasureofhowwellasequencefitsthemotif.

Thesameprobabilitycanbecalculatedforthefouroriginalsequencesinthealignmentinexactlythesameway,andtheresultisshowninTable4.1.Theprobabilitydependsverystronglyonthelengthofthesequence.Thereforetheprobabilityitselfisnotthemostconvenientnumbertouseasascore,andthelog-oddsscoreshowninthelastcolumnofthetableisusuallybetter.Itisthelogarithmoftheprobabilityofthesequencedividedbytheprobabilityaccordingtoanullmodel.Thenullmodelisonethattreatsthesequencesasrandomstringsofnucleotides,sotheprobabilityofasequenceoflengthLis025L.Thenthelog-oddsscoreis

log-oddsforsequenceS

log

PS

Figure4.2:TheprobabilitiesofthemodelinFig.4.1havebeenturnedintolog-oddsbytakingthelogarithmofeachnucleotideprobabilityandsubtractinglog025.Thetransitionprobabilitieshavebeenconvertedtosimplelogs.

Whenasequencefitsthemotifverywellthelog-oddsishigh.Whenitfitsthenullmodelbetter,thelog-oddsscoreisnegative.Althoughtherawprobabilityofthesecondsequence(theonewiththreeinserts)isalmostaslowasthatoftheexceptionalsequence,noticethatthelog-oddsscoreismuchhigherthanfortheexceptionalsequence,andthediscriminationisverygood.Unfortunately,onecannotalwaysassumethatanythingwithapositivelog-oddsscoreis‘ahit,’becausetherearerandomhitsifoneissearchingalargedatabase.SeeSection4.5forreferences.

Insteadofworkingwithprobabilitiesonemightconverteverythingtolog-odds.Ifeachnucleotideprobabilityisdividedbytheprobabilityaccordingtothenullmodel(0.25inthiscase)andthelogarithmisapplied,wewouldgetthenumbersshowninFig.4.2.Thetransitionprobabilitiesarealsoturnedintologarithms.Nowthelog-oddsscorecanbecalculateddirectlybyaddingupthesenumbersinsteadofmultiplyingtheprobabilities.Forinstance,thecalculationofthelog-oddsoftheconsensussequenceis

log-odds

116047664

0116011605105113901160116

(ThefiniteprecisioncausesthelittledifferencebetweenthisnumberandtheoneinTable4.1.)

Ifthealignmenthadnogapsorinsertionswewouldgetridoftheinsertstate,andthenalltheprobabilitiesassociatedwiththearrows(thetransitionprobabili-ties)wouldbe1andmightaswellbeignoredcompletely.ThentheHMMworksexactlyasaweightmatrixoflog-oddsscores,whichiscommonlyused.

7

Figure4.3:ThestructureoftheprofileHMM.

4.3ProfileHMMs

AprofileHMMisacertaintypeofHMMwithastructurethatinanaturalwayallowspositiondependentgappenalties.AprofileHMMcanbeobtainedfromamultiplealignmentandcanbeusedforsearchingadatabaseforothermembersofthefamilyinthealignmentverymuchlikestandardprofiles[3].ThestructureofthemodelisshowninFig.4.3.Thebottomlineofstatesarecalledthemainstates,becausetheymodelthecolumnsofthealignment.InthesestatestheprobabilitydistributionisjustthefrequencyoftheaminoacidsornucleotidesasintheabovemodeloftheDNAmotif.Thesecondrowofdiamondshapedstatesarecalledinsertstatesandareusedtomodelhighlyvariableregionsinthealignment.TheyfunctionexactlylikethetopstateinFig.4.1,althoughonemightchoosetouseafixeddistributionofresidues,e.g.theoveralldistributionofaminoacids,insteadofcalculatingthedistributionasintheexampleabove.Thetoplineofcircularstatesarecalleddeletestates.Theseareadifferenttypeofstate,calledasilentornullstate.Theydonotmatchanyresidues,andtheyaretheremerelytomakeitpossibletojumpoveroneormorecolumnsinthealignment,i.e.,tomodelthesituationwhenjustafewofthesequenceshavea‘–’inthemultiplealignmentataposition.Letusturntoanexample.

SupposeyouhaveamultiplealignmentastheoneshowninFig.4.4.Aregionofthisalignmenthasbeenchosentobean‘insertion,’becauseanalignmentofthisregionishighlyuncertain.Therestofthealignment(shadedinthefigure)arethecolumnsthatwillcorrespondtomainstatesinthemodel.Foreachnon-insertcolumnwemakeamainstateandsettheprobabilitiesequaltotheaminoacidfrequencies.Toestimatethetransitionprobabilitieswecounthowmanysequencesusethevarioustransitions,justlikethetransitionprobabilitieswerecalculatedinthefirstexample.ThemodelisshowninFig.4.5.Therearetwotransitionsfromamainstatetoadeletestateshownwithdashedlinesinthefigure,thatfromtothefirstdeletestateandfrommainstate12todeletestate

8

GIPDGGG-GGGGGSLREIKVPEEGDQGGGNGGNEEDDDDEDEEDPDHHDGDNEGGGRGDDWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWLWWWWWWWWWWCWWWWWWMWWLWWWWTWWRNEQKLDEYKLKERREKRKPENEKRRKQETGGGAAAAAAAAAAVAFVVVGGGGGGGAGAGdyqrqrerrrrkqvrrkqeleeedsererrynlrsslssssstndsddvnlicyyirlst.e.....lllll.l.k...e........n.gt.....siavsktktar.r...g..ak..gtndtskstttsntnvlnntngktnynstnkgneggggnrgkgrgyggdrgngrggggggkerqqqrhskrrqqqtnhrqqrkiqreqekqrrietrreeee.eepveqrrkvqvvtkneLGGGGGGGGGGGGGGGGGGGGGGQGGGGGGWDIIFYKYYYYFWLYYYYFDVIIYWWIWYIFFFVIIVVIIVIVIIYIVVFFFFFFFIAIFPPPPPPPPPPPPPPPEPPPPPPPPPPPPPPSGSSFSSSSSSSSLSSSSAGAAKSSASTSANTNKNNNNTNNNNNNGNSATSTVNNNNNNNYYY-FYYYYYFYYFYYYYYYYYFYYYYYYYVVV-VVLVVVVVIVIVVLVVVVVVVVVLVVFigure4.4:Analignmentof30shortaminoacidsequenceschoppedoutofaalignmentoftheSH3domain.TheshadedareasarethemostconservedandwerechosentoberepresentedbythemainstatesintheHMM.Theunshadedareawithlower-caseletterswaschosentoberepresentedbyaninsertstate.

123456789101112131485TSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCENDBEGINFigure4.5:AprofileHMMmadefromthealignmentshowninFig.4.4.Transi-tionlineswithnoarrowheadaretransitionsfromlefttoright.Transitionswithprobabilityzeroarenotshown,andthosewithverysmallprobabilityareshownasdashedlines.Transitionsfromaninsertstatetoitselfisnotshown;insteadtheprobabilitytimes100isshowninthediamond.Thenumbersinthecirculardeletestatesarejustpositionnumbers.(ThisfigureandFig.4.6weregeneratedbyaprogramintheSAMpackageofprograms.)

9

123456789101112131433TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC85TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC50ENDBEGINFigure4.6:ModelobtainedinthesamewayasFig.4.5,butusingapseudocountofone.

13.Bothofthesecorrespondtodashesinthealignment.Inbothcasesonlyonesequencehasgaps,sotheprobabilityofthesedeletetransitionsis1/30.Thefourthsequencecontinuesdeletiontotheend,sotheprobabilityofgoingfromdelete13to14is1andfromdelete14totheendisalso1.

4.3.1Pseudocounts

Itisdangeroustoestimateaprobabilitydistributionfromjustafewobservedaminoacids.Ifforinstanceyouhavejusttwosequenceswithleucineatacertainposition,theprobabilityforleucinewouldbe1andtheprobabilitywouldbezeroforallotheraminoacidsatthisposition,althoughitiswellknownthatoneoftenseesforexamplevalinesubstitutedforleucine.Insuchacasetheprobabilityofawholesequencemayeasilybecomezeroifasingleleucineissubstitutedbyavaline,orequivalently,thelog-oddsisminusinfinity.

Thereforeitisimportanttohavesomewayofavoidingthissortofover-fitting,wherestrongconclusionsaredrawnfromverylittleevidence.Themostcom-monmethodistousepseudocounts,whichmeansthatonepretendstohavemorecountsofaminoacidsthanthosefromthedata.Thesimplestistoadd1toallthecounts.Withtheleucineexampleitwouldmeanthattheprobabilityofleucinewouldbeestimatedas323andforthe19otheraminoacidsitwouldbecome123.InFig.4.6amodelisshown,whichwasobtainedfromthealignmentinFig.4.6usingapseudocountof1.

Addingonetoallthecountscanbeinterpretedasassumingapriorithatalltheaminoacidsareequallylikely.However,therearesignificantdifferencesintheoccurrenceofthe20aminoacidsinknownproteinsequences.Therefore,thenextstepistousepseudocountsproportionaltotheobservedfrequenciesoftheaminoacidsinstead.ThisistheminimumlevelofpseudocountstobeusedinanyrealapplicationofHMMs.

Becauseacolumninthealignmentmaycontaininformationaboutthepre-10

ferredtypeofaminoacids,itisalsopossibletousemoresophisticatedpseudo-countstrategies.Ifacolumnconsistspredominantlyofleucine(asabove),onewouldexpectsubstitutionstootherhydrophobicaminoacidstobemoreprobablethansubstitutionstohydrophilicaminoacids.Onecane.g.derivepseudocountsforagivencolumnfromsubstitutionmatrices.SeeSection4.5forreferences.

4.3.2Searchingadatabase

Abovewesawhowtocalculatetheprobabilityofasequenceinthealignmentbymultiplyingalltheprobabilities(oraddingthelog-oddsscores)inthemodelalongthepathfollowedbythatparticularsequence.However,thispathisusuallynotknownforothersequenceswhicharenotpartoftheoriginalalignment,andthenextproblemishowtoscoresuchasequence.Obviously,ifwecanfindapaththroughthemodelwherethenewsequencefitswellinsomesense,thenwecanscorethesequenceasbefore.Weneedto‘align’thesequencetothemodel.Itresemblesverymuchthepairwisealignmentproblem,wheretwosequencesarealignedsothattheyaremostsimilar,andindeedthesametypeofdynamicprogrammingalgorithmcanbeused.

Foraparticularsequence,analignmenttothemodel(orapath)isanassign-mentofstatestoeachresidueinthesequence.Therearemanysuchalignmentsforagivensequence.Forinstanceanalignmentmightbeasfollows.LetuslabeltheaminoacidsinaproteinasA1,A2,A3,etc.SimilarlywecanlabeltheHMMstatesasM1,M2,M3,etc.formatchstates,I1,I2,I3forinsertstates,andsoon.ThenanalignmentcouldhaveA1matchstateM1,A2andA3matchI1,A4matchM2,A5matchM6(afterpassingthroughthreedeletestates),andsoon.Foreachsuchpathwecancalculatetheprobabilityofthesequenceorthelog-oddsscore,andthuswecanfindthebestalignment,i.e.,theonewiththelargestprobability.Althoughthereareanenormousnumberofpossiblealignmentsitcanbedoneeffi-cientlybytheabovementioneddynamicprogrammingalgorithm,whichiscalledtheViterbialgorithm.Thealgorithmalsogivestheprobabilityofthesequenceforthatalignment,andthusascoreisobtained.

Thelog-oddsscorefoundinthismannercanbeusedtosearchdatabasesformembersofthesamefamily.AtypicaldistributionofscoresfromsuchasearchisshowninFig.4.7.Asisalsothecasewithothertypesofsearches,thereisnoclear-cutseparationoftrueandfalsepositives,andoneneedstoinvestigatesomeofthesequencesaroundalog-oddsofzero,andpossiblyincludesomeoftheminthealignmentandtrysearchingagain.

Analternativewayofscoringsequencesistosumtheprobabilitiesofallpos-siblealignmentsofthesequencetothemodel.Thisprobabilitycanbefoundbyasimilaralgorithmcalledtheforwardalgorithm.Thistypeofscoringisnotverycommoninbiologicalsequencecomparison,butitismorenaturalfroma

11

5040number of sequences3020100-40-200204060log-odds80100120140Figure4.7:Thedistributionoflog-oddsscoresfromasearchofSwissprotwithaprofileHMMoftheSH3domain.ThedarkareaofthehistogramrepresentsthesequenceswithanannotatedSH3domain,andthelightthosethatarenotannotatedashavingone.Thisisforillustrativepurposesonly,andthesequenceswithlog-oddsaroundzerowerenotinvestigatedfurther.

probabilisticpointofview.However,itusuallygivesverysimilarresults.

4.3.3Modelestimation

Aspresentedsofar,onemayviewtheprofileHMMsasageneralizationofweightmatricestoincorporateinsertionsanddeletionsinanaturalway.Thereishow-everoneinterestingfeatureofHMMs,whichhasnotbeenaddressedyet.Itispossibletoestimatethemodel,i.e.determinealltheprobabilityparametersofit,fromunalignedsequences.Furthermore,amultiplealignmentofthesequencesisproducedintheprocess.Likemanyothermultiplealignmentmethodsthisisdoneinaniterativemanner.Onestartsoutwithamodelwithmoreorlessrandomprobabilities,orifareasonablealignmentofsomeofthesequencesareavailable,amodelisconstructedfromthisalignment.Then,whenallthesequencesarealignedtothemodel,wecanusethealignmenttoimprovetheprobabilitiesinthemodel.Thesenewprobabilitiesmaythenleadtoaslightlydifferentalignment.Iftheydo,wethenrepeattheprocessandimprovetheprobabilitiesagain.Theprocessisrepeateduntilthealignmentdoesnotchange.Thealignmentofthe

12

sequencestothefinalmodelyieldsamultiplealignment.1

Althoughthisestimationprocesssoundseasy,therearemanyproblemstoconsidertoactuallymakeitworkwell.Oneproblemischoosingtheappropri-atemodellength,whichdeterminesthenumberofinsertsinthefinalalignment.Anothersevereproblemisthattheiterativeprocedurecanconvergetosuboptimalsolutions.Itisnotguaranteedthatitfindstheoptimalmultiplealignment,i.e.themostprobableone.MethodsfordealingwiththeseissuesaredescribedintheliteraturepointedtoinSection4.5.

4.4HMMsforgenefinding

OneabilityofHMMs,whichisnotreallyutilizedinprofileHMMs,istheabilitytomodelgrammar.Manyproblemsinbiologicalsequenceanalysishaveagram-maticalstructure,andeukaryoticgenestructure,whichIwilluseasanexample,isoneofthem.Ifyouconsiderexonsandintronsasthe‘words’inalanguage,thesentencesareoftheformexon-intron-exon-intron...intron-exon.The‘sentences’canneverendwithanintron,atleastifthegenesarecomplete,andanexoncanneverfollowanexonwithoutanintroninbetween.Obviouslythisgrammarisgreatlysimplified,becausethereareseveralotherconstraintsongenestructure,suchastheconstraintthattheexonshavetofittogethertogiveavalidcodingregionaftersplicing.InFig.4.8thestructureofageneisshownwithsomeoftheknownsignalsmarked.

Formallanguagetheoryappliedtobiologicalproblemsisnotanewinvention.InparticularDavidSearls[7]haspromotedthisideaanduseditforgenefinding[8],butmanyothergenefindersuseitimplicitly.FormallytheHMMcanonlyrepresentthesimplestofgrammars,whichiscalledaregulargrammar[7,1],butthatturnsouttobegoodenoughforthegenefindingproblem,andmanyotherproblems.OneoftheproblemsthathasamorecomplicatedgrammarthantheHMMcanhandleistheRNAfoldingproblem,whichisonestepuptheladderofgrammars,becausebasepairingintroducescorrelationsbetweenbasesfarfromeachotherintheRNAsequence.

Iwillherebrieflyoutlinemyownapproachtogenefindingwiththeweightontheprinciplesratherthanonthedetails.

Figure4.8:Thestructureofagenewithsomeoftheimportantsignalsshown.

4.4.1Signalsensors

OnemayapplyanHMMsimilartotheonesalreadydescribeddirectlytomanyofthesignalsinagenestructure.InFig.4.9analignmentisshownofsomesequencesaroundacceptorsitesfromhumanDNA.Ithas19columnsandanHMMwith19states(noinsertordeletestates)canbemadedirectlyfromit.Sincethealignmentisgap-less,theHMMisequivalenttoaweightmatrix.

Thereisoneproblem:inDNAtherearefairlystrongdinuclotidepreferences.Amodelliketheonedescribedtreatsthenucleotidesasindependent,sodinu-cleotidepreferencescannotbecaptured.Thisiseasilyfixedbyhaving16prob-abilityparametersineachstateinsteadof4.IncolumntwowefirstcountalloccurrencesofthefournucleotidesgiventhatthereisanAinthefirstcolumnandnormalizethesefourcounts,sotheybecomeprobabilities.Thisisthecondi-tionalprobabilitythatacertainnucleotideappearsinpositiontwo,giventhatthepreviousonewasA.ThesameisdoneforalltheinstancesofCincolumn1andsimilarlyforGandT.Thisgivesatotalof16probabilitiestobeusedinstatetwooftheHMM.Similarlyforalltheotherstates.Tocalculatetheprobabilityofasequence,sayACTGTC,wejustmultiplytheconditionalprobabilities

PACTGTC

p1A

p2CA

p3TC

p4GT

p5TG

p6CT

14

CTGTTTTCCTATTTAGTATGCTTACATTCTCTCTTCCCGCCTGTGCGGCTCCTTGTATTTTCACCTGTCTTCATGTGTTTTGTAGCCCGTATCACTGTTTATTGTTATGGTCTTTCTCTCGTTTTTTTGTGTGCCTATCTCCTCTCTTTGTCCACTTCTCCTTTTATCCACCCTCCCTTTGTTCTCCTCGCTAACCAGGAGACACAACCCTACCCCCTCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGCGGGACGGGACGGACGAGGGAGTGAACTATTCCTTCCTACGTCGGT

Figure4.9:Examplesofhumanacceptorsites(thesplicesite5’totheexon).Ex-ceptinrarecases,theintronendswithAG,whichhasbeenhighlighted.Includedinthesesequencesare16basesupstreamofthesplicesiteand3basesdownstreamintotheexon.

Herep1istheprobabilityofthefournucleotidesinstate1,p2xyisthecondi-tionalprobabilityinstate2ofnucleotidexgiventhatthepreviousnucleotidewasy,andsoforth.

Astatewithconditionalprobabilitiesiscalledafirstorderstate,becauseitcapturesthefirstordercorrelationsbetweenneighboringnucleotides.Itiseasytoexpandtohigherorder.Asecondorderstatehasprobabilitiesconditionedonthetwopreviousnucleotidesinthesequence,i.e.,probabilitiesoftheformpxyz.Wewillreturntosuchhigherorderstatesbelow.

SmallHMMslikethisareconstructedinexactlythesamewayforothersig-nals:donorsplicesites,theregionsaroundthestartcodons,andtheregionsaroundthestopcodons.

4.4.2Codingregions

Thecodonstructureisthemostimportantfeatureofcodingregions.BasesintripletscanbemodeledwiththreestatesasshowninFig.4.10.Thefigurealsoshowshowthismodelofcodingregionscanbeusedinasimplemodelofanunsplicedgenethatstartswithastartcodon(ATG),thenconsistsofsomenumberofcodons,andendswithastopcodon.

Sinceacodonisthreebaseslong,thelaststateofthecodonmodelmustbeatleastofordertwotocorrectlycapturethecodonstatistics.The64probabilitiesinsuchastateareestimatedbycountingthenumberofeachcodoninasetofknowncodingregions.Thesenumbersarethennormalizedproperly.ForexampletheprobabilitiesderivedfromthecountsofCAA,CAC,CAG,andCATare

pACA

cCAA

cCAA

cCAC

cCAG

cCAT

15

Figure4.10:Top:Amodelofcodingregions,wherestateone,twoandthree

matchthefirst,secondandthirdcodonpositionsrespectively.Acodingregionofanylengthcanmatchthismodel,becauseofthetransitionfromstatethreebacktostateone.Bottom:asimplemodelforunsplicedgeneswiththefirstthreestatesmatchingastartcodon,thenextthreeoftheformshowntotheleft,andthelastthreestatesmatchingastopcodon(onlyoneofthethreepossiblestopcodonsareshown).

pCCApGCApTCA

cCACcCAAcCAGcCAAcCATcCAA

cCACcCACcCAC

cCAGcCAGcCAG

cCATcCATcCAT

wherecxyzisthecountofcodonxyz.

Oneofthecharacteristicsofcodingregionsisthelackofstopcodons.Thatisautomaticallytakencareof,becausepATA,pGTAandpATG,corre-spondingtothethreestopcodonsTAA,TAGandTGA,willautomaticallybecomezero.

Formodelingcodonstatisticsitisnaturaltouseanordinary(zerothorder)stateasthefirststateofthecodonmodelandafirstorderstateforthesecond.However,thereareactuallyalsodependenciesbetweenneighboringcodons,andthereforeonemaywantevenhigherorderstates.Inmyowngenefinder,Iusethreefourthorderstates,whichisinspiredbyGeneMark[9],inwhichsuchmod-elswerefirstintroduced.Technicallyspeaking,suchamodeliscalledaninho-mogeneousMarkovchain,whichcanbeviewedasasub-classofHMMs.

4.4.3Combiningthemodels

Tobeabletodiscovergenes,weneedtocombinethemodelsinawaythatsatis-fiesthegrammarofgenes.Irestrictmyselftocodingregions,i.e.the5’and3’untranslatedregionsofthegenesarenotmodeledandalsopromotersaredisre-garded.

16

Figure4.11:AhiddenMarkovmodelforunsplicedgenes.Inthisdrawingan‘x’

meansastatefornon-codingDNA,anda‘c’astateforcodingDNA.Onlyoneofthethreepossiblestopcodonsareshowninthemodeloftheregionaroundthestopcodon.

Figure4.12:Toallowforsplicinginthreedifferentframesthreeintronmodelsareneeded.Togettheframecorrect‘spacerstates’areaddedbeforeandaftertheintronmodels.

17

First,letusseehowtodoitforunsplicedgenes.Ifweignoregenesthatareverycloselyspacedoroverlaps,amodelcouldlooklikeFig.4.11.Itconsistsofastateforintergenicregions(oforderatleast1),amodelfortheregionaroundthestartcodon,themodelforthecodingregion,andamodelfortheregionaroundthestopcodon.Themodelforthestartcodonregionismadejustliketheacceptormodeldescribedearlier.Itmodelseightbasesupstreamofthestartcodon,2theATGstartcodonitself,andthefirstcodonafterthestart.Similarlyforthestopcodonregion.ThewholemodelisonebigHMM,althoughitwasputtogetherfromsmallindependentHMMs.

Havingsuchamodel,howcanwepredictgenesinasequenceofanonymousDNA?Thatisquitesimple:usetheViterbialgorithmtofindthemostprobablepaththroughthemodel.WhenthispathgoesthroughtheATGstates,astartcodonispredicted,whenitgoesthroughthecodonstatesacodonispredicted,andsoon.

Thismodelmightnotalwayspredictcorrectgenes,butatleastitwillonlypredictsensiblegenesthatobeythegrammaticalrules.Agenewillalwaysstartwithastartcodonandendwithastopcodon,thelengthwillalwaysbedivisibleby3,anditwillnevercontainstopcodonsinthereadingframe,whicharetheminimumrequirementsforunsplicedgenecandidates.

Makingamodelthatconformstotherulesofsplicingisabitmoredifficultthanitmightseematfirst.Thatisbecausesplicingcanhappeninthreedifferentreadingframes,andthereadingframeinoneexonhastofittheoneinthenext.Itturnsoutthatbyusingthreedifferentmodelsofintrons,oneforeachframe,thisispossible.InFig.4.12itisshownhowthesemodelsareaddedtothemodelofcodingregions.

Thetoplineinthemodelisforintronsappearingbetweentwocodons.Ithasthreestates(labeledccc)beforetheintronstartstomatchthelastcodonoftheexon.ThefirsttwostatesoftheintronmodelmatchGT,whichistheconsensussequenceatdonorsites(itisoccasionallyanothersequence,butsuchcasesareignoredhere).ThenextsixstatesmatchesthesixbasesimmediatelyafterGT.Thestatesjustdescribedmodelthedonorsite,andtheprobabilitiesarefoundasitwasdescribedearlierforacceptorsites.Thenfollowsasinglestatetomodeltheinterioroftheintron.Iactuallyusethesameprobabilityparametersinthisstateasinthestatemodelingintergenicregions.Nowfollowstheacceptormodel,whichincludesthreestatestomatchthefirstcodonofthenextexon.

Thenextlineinthemodelisforintronsappearingafterthefirstbaseinacodon.Thedifferencefromthefirstisthatthereisonemorestateforacoding

basebeforetheintronandtwomorestatesaftertheintron.Thisensuresthattheframesfitintwoneighboringexons.Similarlyinthethirdlinefromthetoptherearetwoextracodingstatesbeforetheintronandoneafter,sothatitcanmatchintronsappearingafterthesecondbaseinacodon.

Thereareobviouslymanypossiblevariationsofthemodel.Onecanaddmorestatestothesignalsensors,includemodelsofpromoterelementsanduntranslatedregionsofthegene,andsoforth.

4.5Furtherreading

Ageneralintroductioncanbefoundin[2],andoneaimedmoreatbiologicalreadersin[1].Thefirstapplicationsforsequenceanalysisisprobablyformodel-ingcompositionaldifferencesbetweenvariousDNAtypes[10]andforstudyingthecompositionalstructureofgenomes[11].TheinitialworkonusinghiddenMarkovmodels(HMMs)as‘probabilisticprofiles’ofproteinfamilieswaspre-sentedataconferenceinthespringof1992andinatechnicalreportfromthesameyear,anditwaspublishedin[12,13].Theideawasquicklytakenupbyothers[14,15].Independently,someverysimilarideaswerealsodevelopedin[16,17].Alsothegeneralizedprofiles[18]areverysimilar.

Estimationandmultiplealignmentisdescribedin[13]indetail,andin[19]someofthepracticalmethodsarefurtherdiscussed.Alternativemethodsformodelestimationarepresentedin[14,20].MethodsforscoringsequencesagainstaprofileHMMweregivenin[13],buttheseissueshavemorerecentlybeenad-dressedin[21].Thebasicpseudocountmethodisalsoexplainedin[13],andmoreadvancedmethodsarediscussedin[22,23,24,25,26].

AreviewofprofileHMMscanbefoundin[27],andin[1]profileHMMsarediscussedingreatdetail.Also[28]willundoubtedlycontaingoodmaterialonprofileHMMs.

SomeoftherecentapplicationsofprofileHMMstoproteinsare:detectionoffibronectintypeIIIdomainsinyeast[29],adatabaseofproteindomainfamilies[30],proteintopologyrecognitionfromsecondarystructure[31],andmodelingofaproteinsplicingdomain[32].

Therearetwoprogrampackagesavailablefreeofchargetotheacademiccom-munity.One,developedbySeanEddy,iscalledhmmer(pronounced‘hammer’),andcanbeobtainedfromhisweb-site(http://genome.wustl.edu/eddy/hmm.html).Theotherone,calledSAM(http://www.cse.ucsc.edu/research/compbio/sam.html),wasdevelopedbymyselfandthegroupatUCSantaCruz,anditisnowbeingmaintainedandfurtherdevelopedunderthecommandofRichardHughey.

ThegenefindersketchedaboveiscalledHMMgene.Therearemanydetailsomitted,suchasspecialmethodsforestimationandpredictiondescribedin[33].

19

Itisstillunderdevelopment,anditispossibletofollowthedevelopmentandtestthecurrentversionatthewebsitehttp://www.cbs.dtu.dk/services/HMMgene/.Methodsforautomatedgenefindinggobackalongtime,see[34]forareview.ThefirstHMMbasedgenefinderisprobablyEcoParsedevelopedforE.coli[35].VEIL[36]isarecentHMMbasedgenefinderforforhumangenes.ThemaindifferencefromHMMgeneisthatitdoesnotusehighorderstates(neitherdoesEcoParse),whichmakesgoodmodelingofcodingregionsharder.

Tworecentmethodsuseso-calledgeneralizedHMMs.Genie[37,38,39]combinesneuralnetworksintoanHMM-likemodel,whereasGENSCAN[40]ismoresimilartoHMMgene,butusesadifferentmodeltypeforsplicesite.Also,thegeneralizedHMMcanexplicitlyuseexonlengthdistributions,whichisnotpossibleinastandardHMM.Webpointerstogenefindingcanbefoundathttp://www.cbs.dtu.dk/krogh/genefinding.html.

OtherapplicationsofHMMsrelatedtogenefindingare:detectionofshortproteincodingregionsandanalysisoftranslationinitiationsitesinCyanobac-terium[41,42],characterizationofprokaryoticandeukaryoticpromoters[43],andrecognitionofbranchpoints[44].

Apartfromtheareasmentionedhere,HMMshavebeenusedforpredictionofproteinsecondarystructure[45],modelinganoscillatorypatterninnucleosomes[46],modelingsitedependenceofevolutionaryrates[47],andforincludingevo-lutionaryinformationinproteinsecondarystructureprediction[48].

Acknowledgements

HenrikNielsen,StevenSalzberg,andSteenKnudsenaregratefullyacknowledgedforcommentsonthemanuscript.ThisworkwassupportedbytheDanishNationalResearchFoundation.

References

[1]Durbin,R.M.,Eddy,S.R.,Krogh,A.,andMitchison,G.(1998)Biological

SequenceAnalysisCambridgeUniversityPressToappear.[2]Rabiner,L.R.(1989)Proc.IEEE77,257–286.

[3]Gribskov,M.,McLachlan,A.D.,andEisenberg,D.(1987)Proc.oftheNat.

Acad.ofSciencesoftheU.S.A.84,4355–4358.[4]Staden,R.(1984)NucleicAcidsResearch12,505–519.

[5]Staden,R.(1988)ComputerApplicationsintheBiosciences4,53–60.

20

[6]Bairoch,A.,Bucher,P.,andHofmann,K.(1997)NucleicAcidsResearch

25,217–221.[7]Searls,D.B.(1992)AmericanScientist80,579–591.[8]Dong,S.andSearls,D.B.(1994)Genomics23,540–551.

[9]Borodovsky,M.andMcIninch,J.(1993)ComputersandChemistry17,

123–133.[10]Churchill,G.A.(1989)BullMathBiol51,79–94.

[11]Churchill,G.A.(1992)ComputersandChemistry16,107–115.

[12]Haussler,D.,Krogh,A.,Mian,I.S.,andSj¨olander,K.(1993)Proteinmodel-ingusinghiddenMarkovmodels:AnalysisofglobinsInMudge,T.N.,Mi-lutinovic,V.,andHunter,L.(Eds.),ProceedingsoftheTwenty-SixthAnnual

HawaiiInternationalConferenceonSystemSciencesvolume1pp.792–802LosAlamitos,California.IEEEComputerSocietyPress.[13]Krogh,A.,Brown,M.,Mian,I.S.,Sj¨olander,K.,andHaussler,D.(1994)

JournalofMolecularBiology235,1501–1531.[14]Baldi,P.,Chauvin,Y.,Hunkapiller,T.,andMcClure,M.A.(1994)Proc.of

theNat.Acad.ofSciencesoftheU.S.A.91,1059–1063.[15]Eddy,S.R.,Mitchison,G.,andDurbin,R.(1995)JComputBiol2,9–23.[16]Stultz,C.M.,White,J.V.,andSmith,T.F.(1993)ProteinScience2,305–

314.[17]White,J.V.,Stultz,C.M.,andSmith,T.F.(1994)MathematicalBiosciences

119,35–75.[18]Bucher,P.,Karplus,K.,Moeri,N.,andHofmann,K.(1996)Computersand

Chemistry20,3–24.[19]Hughey,R.andKrogh,A.(1996)CABIOS12,95–107.

[20]Eddy,S.R.(1995)MultiplealignmentusinghiddenMarkovmodels.In

Rawlings,C.,Clark,D.,Altman,R.,Hunter,L.,Lengauer,T.,andWodak,S.(Eds.),Proc.ofThirdInt.Conf.onIntelligentSystemsforMolecularBiologyvolume3pp.114–120MenloPark,CA.AAAIPress.[21]Barrett,C.,Hughey,R.,andKarplus,K.(1997)ComputerApplicationsin

theBiosciences13,191–199.

21

[22]Brown,M.,Hughey,R.,Krogh,A.,Mian,I.S.,Sj¨olander,K.,andHaussler,

D.(1993)UsingDirichletmixturepriorstoderivehiddenMarkovmodelsforproteinfamiliesInHunter,L.,Searls,D.,andShavlik,J.(Eds.),Proc.ofFirstInt.Conf.onIntelligentSystemsforMolecularBiologypp.47–55MenloPark,CA.AAAI/MITPress.[23]Tatusov,R.L.,Altschul,S.F.,andKoonin,E.V.(1994)Proc.oftheNat.

Acad.ofSciencesoftheU.S.A.91,12091–12095.[24]Karplus,K.(1995)Evaluatingregularizersforestimatingdistributionsof

aminoacids.InRawlings,C.,Clark,D.,Altman,R.,Hunter,L.,Lengauer,T.,andWodak,S.(Eds.),Proc.ofThirdInt.Conf.onIntelligentSystemsforMolecularBiologyvolume3pp.188–196MenloPark,CA.AAAIPress.[25]Henikoff,J.G.andHenikoff,S.(1996)ComputerApplicationsintheBio-sciences12,135–143.[26]Sj¨olander,K.,Karplus,K.,Brown,M.,Hughey,R.,Krogh,A.,Mian,I.S.,

andHaussler,D.(1996)CABIOS12,327–345.[27]Eddy,S.R.(1996)CurrentOpinioninStructuralBiology6,361–365.[28]Baldi,P.andBrunak,S.(1998)Bioinformatics-TheMachineLearning

ApproachMITPressCambridgeMAToappear.[29]Bateman,A.andChothia,C.(1997)Curr.Biol.6,1544–1547.

[30]Sonnhammer,E.L.L.,Eddy,S.R.,andDurbin,R.(1997)Proteins28,

405–420.[31]DiFrancesco,V.,Garnier,J.,andMunson,P.J.(1997)JournalofMolecular

Biology267,446–463.[32]Dalgaard,J.Z.,Moser,M.J.,Hughey,R.,andMian,I.S.(1997)Journalof

ComputationalBiology4,193–214.[33]Krogh,A.(1997)TwomethodsforimprovingperformanceofaHMMand

theirapplicationforgenefindingInGaasterland,T.,Karp,P.,Karplus,K.,Ouzounis,C.,Sander,C.,andValencia,A.(Eds.),Proc.ofFifthInt.Conf.onIntelligentSystemsforMolecularBiologypp.179–186MenloPark,CA.AAAIPress.[34]Fickett,J.W.(1996)TrendsGenet.12,316–320.

22

[35]Krogh,A.,Mian,I.S.,andHaussler,D.(1994)NucleicAcidsResearch22,

4768–4778.[36]Henderson,J.,Salzberg,S.,andFasman,K.H.(1997)Findinggenesin

DNAwithahiddenMarkovmodelJournalofComputationalBiology.[37]Kulp,D.,Haussler,D.,Reese,M.G.,andEeckman,F.H.(1996)Agen-eralizedhiddenMarkovmodelfortherecognitionofhumangenesinDNAInStates,D.,Agarwal,P.,Gaasterland,T.,Hunter,L.,andSmith,R.(Eds.),Proc.Conf.onIntelligentSystemsinMolecularBiologypp.134–142MenloPark,CA.AAAIPress.[38]Reese,M.G.,Eeckman,F.H.,Kulp,D.,andHaussler,D.(1997)Improved

splicesitedetectioninGenieInWaterman,M.(Ed.),ProceedingsoftheFirstAnnualInternationalConferenceonComputationalMolecularBiology(RECOMB)NewYork.ACMPress.[39]Kulp,D.,Haussler,D.,Reese,M.G.,andEeckman,F.H.(1997)Integrating

databasehomologyinaprobabilisticgenestructuremodelInAltman,R.B.,Dunker,A.K.,Hunter,L.,andKlein,T.E.(Eds.),ProceedingsofthePacificSymposiumonBiocomputingNewYork.WorldScientific.[40]Burge,C.andKarlin,S.(1997)JournalofMolecularBiology268,78–94.[41]Yada,T.andHirosawa,M.(1996)DNARes.3,355–361.[42]Yada,T.,Sazuka,T.,andHirosawa,M.(1997)DNARes.4,1–7.

[43]Pedersen,A.G.,Baldi,P.,Brunak,S.,andChauvin,Y.(1996)Characteriza-tionofprokaryoticandeukaryoticpromotersusinghiddenMarkovmodelsInProc.ofFourthInt.Conf.onIntelligentSystemsforMolecularBiologypp.182–191MenloPark,CA.AAAIPress.[44]Tolstrup,N.,Rouz´e,P.,andBrunak,S.(1997)NucleicAcidsResearch25,

3159–3164.[45]Asai,K.,Hayamizu,S.,andHanda,K.(1993)ComputerApplicationsinthe

Biosciences9,141–146.[46]Baldi,P.,Brunak,S.,Chauvin,Y.,andKrogh,A.(1996)JournalofMolec-ularBiology263,503–510.[47]Felsenstein,J.andChurchill,G.A.(1996)MolecularBiologicalEvolution

13.

23

[48]Goldman,N.,Thorne,J.L.,andJones,D.T.(1996)JournalofMolecular

Biology263,196–208.

24

因篇幅问题不能全部显示,请点此查看更多更全内容