AnIntroductiontoHiddenMarkovModelsforBiologicalSequences
byAndersKrogh
CenterforBiologicalSequenceAnalysisTechnicalUniversityofDenmarkBuilding206,2800Lyngby,DenmarkPhone:+4545252471Fax:+4545934808
E-mail:krogh@cbs.dtu.dk
InComputationalMethodsinMolecularBiology,editedbyS.L.Salzberg,D.B.SearlsandS.Kasif,pages45-63.Elsevier,1998.
1
Contents
4AnIntroductiontoHiddenMarkovModelsforBiologicalSequences4.1Introduction.............................4.2FromregularexpressionstoHMMs................4.3ProfileHMMs............................
4.3.1Pseudocounts........................4.3.2Searchingadatabase....................4.3.3Modelestimation......................4.4HMMsforgenefinding.......................
4.4.1Signalsensors.......................4.4.2Codingregions.......................4.4.3Combiningthemodels...................4.5Furtherreading...........................
2
13481011121314151619
4.1Introduction
Veryefficientprogramsforsearchingatextforacombinationofwordsareavail-ableonmanycomputers.Thesamemethodscanbeusedforsearchingforpatternsinbiologicalsequences,butoftentheyfail.Thisisbecausebiological‘spelling’ismuchmoresloppythanEnglishspelling:proteinswiththesamefunctionfromtwodifferentorganismsarealmostcertainlyspelleddifferently,thatis,thetwoaminoacidsequencesdiffer.Itisnotrarethattwosuchhomologoussequenceshavelessthan30%identicalaminoacids.SimilarlyinDNAmanyinterestingsig-nalsvarygreatlyevenwithinthesamegenome.Somewell-knownexamplesareribosomebindingsitesandsplicesites,butthelistislong.Fortunatelythereareusuallystillsomesubtlesimilaritiesbetweentwosuchsequences,andtheques-tionishowtodetectthesesimilarities.
Thevariationinafamilyofsequencescanbedescribedstatistically,andthisisthebasisformostmethodsusedinbiologicalsequenceanalysis,see[1]forapresentationofsomeofthesestatisticalapproaches.Forpairwisealignments,forinstance,theprobabilitythatacertainresiduemutatestoanotherresidueisusedinasubstitutionmatrix,suchasoneofthePAMmatrices.ForfindingpatternsinDNA,e.g.splicesites,somesortofweightmatrixisveryoftenused,whichissimplyapositionspecificscorecalculatedfromthefrequenciesofthefournucleotidesatallthepositionsinsomeknownexamples.Similarly,methodsforfindinggenesuse,almostwithoutexception,thestatisticsofcodonsordicodonsinsomeformorother.
AhiddenMarkovmodel(HMM)isastatisticalmodel,whichisverywellsuitedformanytasksinmolecularbiology,althoughtheyhavebeenmostlyde-velopedforspeechrecognitionsincetheearly1970s,see[2]forhistoricaldetails.ThemostpopularuseoftheHMMinmolecularbiologyisasa‘probabilisticpro-file’ofaproteinfamily,whichiscalledaprofileHMM.Fromafamilyofproteins(orDNA)aprofileHMMcanbemadeforsearchingadatabaseforothermem-bersofthefamily.TheseprofileHMMsresembletheprofile[3]andweightmatrixmethods[4,5],andprobablythemaincontributionisthattheprofileHMMtreatsgapsinasystematicway.
TheHMMcanbeappliedtoothertypesofproblems.Itisparticularlywellsuitedforproblemswithasimple‘grammaticalstructure,’suchasgenefinding.Ingenefindingseveralsignalsmustberecognizedandcombinedintoapredictionofexonsandintrons,andthepredictionmustconformtovariousrulestomakeitareasonablegeneprediction.AnHMMcancombinerecognitionofthesignals,anditcanbemadesuchthatthepredictionsalwaysfollowtherulesofagene.SincemuchoftheliteratureonHMMsisalittlehardtoreadformanybiol-ogists,Iwillattemptinthischaptertogiveanon-mathematicalintroductiontoHMMs.Whereasthelittlebiologicalbackgroundneededistakenforgranted,I
3
havetriedtoexplainHMMsatalevelthatalmostanyonecanfollow.FirstHMMsareintroducedbyanexampleandthenprofileHMMsaredescribed.ThenanHMMforfindingeukaryoticgenesissketched,andfinallypointerstothelitera-turearegiven.
4.2FromregularexpressionstoHMMs
Mostreadershavenodoubtcomeacrossregularexpressionsatsomepoint,andmanyprobablyusethemquitealot.Regularexpressionsareusedinmanypro-grams,inparticularonUnixcomputers.Inprogramslikeawk,grep,sed,andperl,regularexpressionscanbeusedforsearchingtextfilesforapattern.Withgrepforinstance,youcansearchafileforalllinescontaining‘C.elegans’or‘Caenorhab-ditiselegans’withtheregularexpression‘’.ThiswillmatchanylinecontainingaCfollowedbyanynumberoflower-caselettersor‘.’,thenaspaceandthenelegans.Regularexpressionscanalsobeusedtocharacterizeproteinfamilies,whichisthebasisforthePROSITEdatabase[6].
Usingregularexpressionsisaveryelegantandefficientwaytosearchforsomeproteinfamilies,butdifficultforother.Asalreadymentionedinthein-troduction,thedifficultiesarisebecauseproteinspellingismuchmorefreethanEnglishspelling.Thereforetheregularexpressionssometimesneedtobeverybroadandcomplex.ImagineaDNAmotiflikethis:
(IuseDNAonlybecauseofthesmallernumberoflettersthanforaminoacids).Aregularexpressionforthisis
[AT][CG][AC][ACGT]*A[TG][GC],
meaningthatthefirstpositionisAorT,thesecondCorG,andsoforth.Theterm‘[ACGT]*’meansthatanyofthefourletterscanoccuranynumberoftimes.Theproblemwiththeaboveregularexpressionisthatitdoesnotinanywaydistinguishbetweenthehighlyimplausiblesequence
whichhastheexceptionalcharacterineachposition,andtheconsensussequence
4
Figure4.1:AhiddenMarkovmodelderivedfromthealignmentdiscussedinthetext.Thetransitionsareshownwitharrowswhosethicknessindicatetheirproba-bility.Ineachstatethehistogramshowstheprobabilitiesofthefournucleotides.
withthemostplausiblecharacterineachposition(thedashesarejustforaligningthesesequenceswiththepreviousones).Whatismeantbya‘plausible’sequencecanofcoursebedebated,althoughmostwouldprobablyagreethatthefirstse-quenceisnotlikelytobethesamemotifasthe5sequencesabove.Itispossibletomaketheregularexpressionmorediscriminativebysplittingitintoseveraldif-ferentones,butiteasilybecomesmessy.Thealternativeistoscoresequencesbyhowwelltheyfitthealignment.
Toscoreasequence,wesaythatthereisaprobabilityof4508foranAinthefirstpositionand1502foraT,becauseweobservethatoutof5letters4areAsandoneisaT.SimilarlyinthesecondpositiontheprobabilityofCis45andofG15,andsoforth.Afterthethirdpositioninthealignment,3outof5sequenceshave‘insertions’ofvaryinglengths,sowesaytheprobabilityofmakinganinsertionis35andthus25fornotmakingone.TokeeptrackofthesenumbersadiagramcanbedrawnwithprobabilitiesasinFig.4.1.
ThisisahiddenMarkovmodel.Aboxinthedrawingiscalledastate,andthereisastateforeachtermintheregularexpression.Alltheprobabilitiesarefoundsimplybycountinginthemultiplealignmenthowmanytimeseacheventoccur,justasdescribedabove.Theonlypartthatmightseemtrickyisthe‘in-sertion,’whichisrepresentedbythestateabovetheotherstates.Theprobabilityofeachletterisfoundbycountingalloccurrencesofthefournucleotidesinthisregionofthealignment.ThetotalcountsareoneA,twoCs,oneG,andoneT,yieldingprobabilities15,25,15,and15respectively.Aftersequences2,3and5havemadeoneinsertioneach,therearetwomoreinsertions(fromsequence2)andthetotalnumberoftransitionsbacktothemainlineofstatesis3(allthreesequenceswithinsertionshavetofinish).Thereforethereare5transitionsintotalfromtheinsertstate,andtheprobabilityofmakingatransitiontoitselfis25andtheprobabilityofmakingonetothenextstateis35.
5
Sequence
Originalsequences
Probability
100Logodds
4.93.05.34.94.6
3.30.0075
1.23.30.59
Table4.1:Probabilitiesandlog-oddsscoresforthe5sequencesinthealignmentandfortheconsensussequenceandthe‘exceptional’sequence.
Itisnoweasytoscoretheconsensussequence.TheprobabilityofthefirstAis45.Thisismultipliedbytheprobabilityofthetransitionfromthefirststatetothesecond,whichis1.Continuingthis,thetotalprobabilityoftheconsensusis
P
080447
108061102
11
0808
06108
Makingthesamecalculationfortheexceptionalsequenceyieldsonly00023102,whichisroughly2000timessmallerthanfortheconsensus.Thiswayweachievedthegoalofgettingascoreforeachsequence,ameasureofhowwellasequencefitsthemotif.
Thesameprobabilitycanbecalculatedforthefouroriginalsequencesinthealignmentinexactlythesameway,andtheresultisshowninTable4.1.Theprobabilitydependsverystronglyonthelengthofthesequence.Thereforetheprobabilityitselfisnotthemostconvenientnumbertouseasascore,andthelog-oddsscoreshowninthelastcolumnofthetableisusuallybetter.Itisthelogarithmoftheprobabilityofthesequencedividedbytheprobabilityaccordingtoanullmodel.Thenullmodelisonethattreatsthesequencesasrandomstringsofnucleotides,sotheprobabilityofasequenceoflengthLis025L.Thenthelog-oddsscoreis
log-oddsforsequenceS
log
PS
Figure4.2:TheprobabilitiesofthemodelinFig.4.1havebeenturnedintolog-oddsbytakingthelogarithmofeachnucleotideprobabilityandsubtractinglog025.Thetransitionprobabilitieshavebeenconvertedtosimplelogs.
Whenasequencefitsthemotifverywellthelog-oddsishigh.Whenitfitsthenullmodelbetter,thelog-oddsscoreisnegative.Althoughtherawprobabilityofthesecondsequence(theonewiththreeinserts)isalmostaslowasthatoftheexceptionalsequence,noticethatthelog-oddsscoreismuchhigherthanfortheexceptionalsequence,andthediscriminationisverygood.Unfortunately,onecannotalwaysassumethatanythingwithapositivelog-oddsscoreis‘ahit,’becausetherearerandomhitsifoneissearchingalargedatabase.SeeSection4.5forreferences.
Insteadofworkingwithprobabilitiesonemightconverteverythingtolog-odds.Ifeachnucleotideprobabilityisdividedbytheprobabilityaccordingtothenullmodel(0.25inthiscase)andthelogarithmisapplied,wewouldgetthenumbersshowninFig.4.2.Thetransitionprobabilitiesarealsoturnedintologarithms.Nowthelog-oddsscorecanbecalculateddirectlybyaddingupthesenumbersinsteadofmultiplyingtheprobabilities.Forinstance,thecalculationofthelog-oddsoftheconsensussequenceis
log-odds
116047664
0116011605105113901160116
(ThefiniteprecisioncausesthelittledifferencebetweenthisnumberandtheoneinTable4.1.)
Ifthealignmenthadnogapsorinsertionswewouldgetridoftheinsertstate,andthenalltheprobabilitiesassociatedwiththearrows(thetransitionprobabili-ties)wouldbe1andmightaswellbeignoredcompletely.ThentheHMMworksexactlyasaweightmatrixoflog-oddsscores,whichiscommonlyused.
7
Figure4.3:ThestructureoftheprofileHMM.
4.3ProfileHMMs
AprofileHMMisacertaintypeofHMMwithastructurethatinanaturalwayallowspositiondependentgappenalties.AprofileHMMcanbeobtainedfromamultiplealignmentandcanbeusedforsearchingadatabaseforothermembersofthefamilyinthealignmentverymuchlikestandardprofiles[3].ThestructureofthemodelisshowninFig.4.3.Thebottomlineofstatesarecalledthemainstates,becausetheymodelthecolumnsofthealignment.InthesestatestheprobabilitydistributionisjustthefrequencyoftheaminoacidsornucleotidesasintheabovemodeloftheDNAmotif.Thesecondrowofdiamondshapedstatesarecalledinsertstatesandareusedtomodelhighlyvariableregionsinthealignment.TheyfunctionexactlylikethetopstateinFig.4.1,althoughonemightchoosetouseafixeddistributionofresidues,e.g.theoveralldistributionofaminoacids,insteadofcalculatingthedistributionasintheexampleabove.Thetoplineofcircularstatesarecalleddeletestates.Theseareadifferenttypeofstate,calledasilentornullstate.Theydonotmatchanyresidues,andtheyaretheremerelytomakeitpossibletojumpoveroneormorecolumnsinthealignment,i.e.,tomodelthesituationwhenjustafewofthesequenceshavea‘–’inthemultiplealignmentataposition.Letusturntoanexample.
SupposeyouhaveamultiplealignmentastheoneshowninFig.4.4.Aregionofthisalignmenthasbeenchosentobean‘insertion,’becauseanalignmentofthisregionishighlyuncertain.Therestofthealignment(shadedinthefigure)arethecolumnsthatwillcorrespondtomainstatesinthemodel.Foreachnon-insertcolumnwemakeamainstateandsettheprobabilitiesequaltotheaminoacidfrequencies.Toestimatethetransitionprobabilitieswecounthowmanysequencesusethevarioustransitions,justlikethetransitionprobabilitieswerecalculatedinthefirstexample.ThemodelisshowninFig.4.5.Therearetwotransitionsfromamainstatetoadeletestateshownwithdashedlinesinthefigure,thatfromtothefirstdeletestateandfrommainstate12todeletestate
8
GIPDGGG-GGGGGSLREIKVPEEGDQGGGNGGNEEDDDDEDEEDPDHHDGDNEGGGRGDDWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWLWWWWWWWWWWCWWWWWWMWWLWWWWTWWRNEQKLDEYKLKERREKRKPENEKRRKQETGGGAAAAAAAAAAVAFVVVGGGGGGGAGAGdyqrqrerrrrkqvrrkqeleeedsererrynlrsslssssstndsddvnlicyyirlst.e.....lllll.l.k...e........n.gt.....siavsktktar.r...g..ak..gtndtskstttsntnvlnntngktnynstnkgneggggnrgkgrgyggdrgngrggggggkerqqqrhskrrqqqtnhrqqrkiqreqekqrrietrreeee.eepveqrrkvqvvtkneLGGGGGGGGGGGGGGGGGGGGGGQGGGGGGWDIIFYKYYYYFWLYYYYFDVIIYWWIWYIFFFVIIVVIIVIVIIYIVVFFFFFFFIAIFPPPPPPPPPPPPPPPEPPPPPPPPPPPPPPSGSSFSSSSSSSSLSSSSAGAAKSSASTSANTNKNNNNTNNNNNNGNSATSTVNNNNNNNYYY-FYYYYYFYYFYYYYYYYYFYYYYYYYVVV-VVLVVVVVIVIVVLVVVVVVVVVLVVFigure4.4:Analignmentof30shortaminoacidsequenceschoppedoutofaalignmentoftheSH3domain.TheshadedareasarethemostconservedandwerechosentoberepresentedbythemainstatesintheHMM.Theunshadedareawithlower-caseletterswaschosentoberepresentedbyaninsertstate.
123456789101112131485TSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCTSAGPEDNQKRHWYFLMIVCENDBEGINFigure4.5:AprofileHMMmadefromthealignmentshowninFig.4.4.Transi-tionlineswithnoarrowheadaretransitionsfromlefttoright.Transitionswithprobabilityzeroarenotshown,andthosewithverysmallprobabilityareshownasdashedlines.Transitionsfromaninsertstatetoitselfisnotshown;insteadtheprobabilitytimes100isshowninthediamond.Thenumbersinthecirculardeletestatesarejustpositionnumbers.(ThisfigureandFig.4.6weregeneratedbyaprogramintheSAMpackageofprograms.)
9
123456789101112131433TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC85TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC33TSAGPEDNQKRHWYFLMIVC50ENDBEGINFigure4.6:ModelobtainedinthesamewayasFig.4.5,butusingapseudocountofone.
13.Bothofthesecorrespondtodashesinthealignment.Inbothcasesonlyonesequencehasgaps,sotheprobabilityofthesedeletetransitionsis1/30.Thefourthsequencecontinuesdeletiontotheend,sotheprobabilityofgoingfromdelete13to14is1andfromdelete14totheendisalso1.
4.3.1Pseudocounts
Itisdangeroustoestimateaprobabilitydistributionfromjustafewobservedaminoacids.Ifforinstanceyouhavejusttwosequenceswithleucineatacertainposition,theprobabilityforleucinewouldbe1andtheprobabilitywouldbezeroforallotheraminoacidsatthisposition,althoughitiswellknownthatoneoftenseesforexamplevalinesubstitutedforleucine.Insuchacasetheprobabilityofawholesequencemayeasilybecomezeroifasingleleucineissubstitutedbyavaline,orequivalently,thelog-oddsisminusinfinity.
Thereforeitisimportanttohavesomewayofavoidingthissortofover-fitting,wherestrongconclusionsaredrawnfromverylittleevidence.Themostcom-monmethodistousepseudocounts,whichmeansthatonepretendstohavemorecountsofaminoacidsthanthosefromthedata.Thesimplestistoadd1toallthecounts.Withtheleucineexampleitwouldmeanthattheprobabilityofleucinewouldbeestimatedas323andforthe19otheraminoacidsitwouldbecome123.InFig.4.6amodelisshown,whichwasobtainedfromthealignmentinFig.4.6usingapseudocountof1.
Addingonetoallthecountscanbeinterpretedasassumingapriorithatalltheaminoacidsareequallylikely.However,therearesignificantdifferencesintheoccurrenceofthe20aminoacidsinknownproteinsequences.Therefore,thenextstepistousepseudocountsproportionaltotheobservedfrequenciesoftheaminoacidsinstead.ThisistheminimumlevelofpseudocountstobeusedinanyrealapplicationofHMMs.
Becauseacolumninthealignmentmaycontaininformationaboutthepre-10
ferredtypeofaminoacids,itisalsopossibletousemoresophisticatedpseudo-countstrategies.Ifacolumnconsistspredominantlyofleucine(asabove),onewouldexpectsubstitutionstootherhydrophobicaminoacidstobemoreprobablethansubstitutionstohydrophilicaminoacids.Onecane.g.derivepseudocountsforagivencolumnfromsubstitutionmatrices.SeeSection4.5forreferences.
4.3.2Searchingadatabase
Abovewesawhowtocalculatetheprobabilityofasequenceinthealignmentbymultiplyingalltheprobabilities(oraddingthelog-oddsscores)inthemodelalongthepathfollowedbythatparticularsequence.However,thispathisusuallynotknownforothersequenceswhicharenotpartoftheoriginalalignment,andthenextproblemishowtoscoresuchasequence.Obviously,ifwecanfindapaththroughthemodelwherethenewsequencefitswellinsomesense,thenwecanscorethesequenceasbefore.Weneedto‘align’thesequencetothemodel.Itresemblesverymuchthepairwisealignmentproblem,wheretwosequencesarealignedsothattheyaremostsimilar,andindeedthesametypeofdynamicprogrammingalgorithmcanbeused.
Foraparticularsequence,analignmenttothemodel(orapath)isanassign-mentofstatestoeachresidueinthesequence.Therearemanysuchalignmentsforagivensequence.Forinstanceanalignmentmightbeasfollows.LetuslabeltheaminoacidsinaproteinasA1,A2,A3,etc.SimilarlywecanlabeltheHMMstatesasM1,M2,M3,etc.formatchstates,I1,I2,I3forinsertstates,andsoon.ThenanalignmentcouldhaveA1matchstateM1,A2andA3matchI1,A4matchM2,A5matchM6(afterpassingthroughthreedeletestates),andsoon.Foreachsuchpathwecancalculatetheprobabilityofthesequenceorthelog-oddsscore,andthuswecanfindthebestalignment,i.e.,theonewiththelargestprobability.Althoughthereareanenormousnumberofpossiblealignmentsitcanbedoneeffi-cientlybytheabovementioneddynamicprogrammingalgorithm,whichiscalledtheViterbialgorithm.Thealgorithmalsogivestheprobabilityofthesequenceforthatalignment,andthusascoreisobtained.
Thelog-oddsscorefoundinthismannercanbeusedtosearchdatabasesformembersofthesamefamily.AtypicaldistributionofscoresfromsuchasearchisshowninFig.4.7.Asisalsothecasewithothertypesofsearches,thereisnoclear-cutseparationoftrueandfalsepositives,andoneneedstoinvestigatesomeofthesequencesaroundalog-oddsofzero,andpossiblyincludesomeoftheminthealignmentandtrysearchingagain.
Analternativewayofscoringsequencesistosumtheprobabilitiesofallpos-siblealignmentsofthesequencetothemodel.Thisprobabilitycanbefoundbyasimilaralgorithmcalledtheforwardalgorithm.Thistypeofscoringisnotverycommoninbiologicalsequencecomparison,butitismorenaturalfroma
11
5040number of sequences3020100-40-200204060log-odds80100120140Figure4.7:Thedistributionoflog-oddsscoresfromasearchofSwissprotwithaprofileHMMoftheSH3domain.ThedarkareaofthehistogramrepresentsthesequenceswithanannotatedSH3domain,andthelightthosethatarenotannotatedashavingone.Thisisforillustrativepurposesonly,andthesequenceswithlog-oddsaroundzerowerenotinvestigatedfurther.
probabilisticpointofview.However,itusuallygivesverysimilarresults.
4.3.3Modelestimation
Aspresentedsofar,onemayviewtheprofileHMMsasageneralizationofweightmatricestoincorporateinsertionsanddeletionsinanaturalway.Thereishow-everoneinterestingfeatureofHMMs,whichhasnotbeenaddressedyet.Itispossibletoestimatethemodel,i.e.determinealltheprobabilityparametersofit,fromunalignedsequences.Furthermore,amultiplealignmentofthesequencesisproducedintheprocess.Likemanyothermultiplealignmentmethodsthisisdoneinaniterativemanner.Onestartsoutwithamodelwithmoreorlessrandomprobabilities,orifareasonablealignmentofsomeofthesequencesareavailable,amodelisconstructedfromthisalignment.Then,whenallthesequencesarealignedtothemodel,wecanusethealignmenttoimprovetheprobabilitiesinthemodel.Thesenewprobabilitiesmaythenleadtoaslightlydifferentalignment.Iftheydo,wethenrepeattheprocessandimprovetheprobabilitiesagain.Theprocessisrepeateduntilthealignmentdoesnotchange.Thealignmentofthe
12
sequencestothefinalmodelyieldsamultiplealignment.1
Althoughthisestimationprocesssoundseasy,therearemanyproblemstoconsidertoactuallymakeitworkwell.Oneproblemischoosingtheappropri-atemodellength,whichdeterminesthenumberofinsertsinthefinalalignment.Anothersevereproblemisthattheiterativeprocedurecanconvergetosuboptimalsolutions.Itisnotguaranteedthatitfindstheoptimalmultiplealignment,i.e.themostprobableone.MethodsfordealingwiththeseissuesaredescribedintheliteraturepointedtoinSection4.5.
4.4HMMsforgenefinding
OneabilityofHMMs,whichisnotreallyutilizedinprofileHMMs,istheabilitytomodelgrammar.Manyproblemsinbiologicalsequenceanalysishaveagram-maticalstructure,andeukaryoticgenestructure,whichIwilluseasanexample,isoneofthem.Ifyouconsiderexonsandintronsasthe‘words’inalanguage,thesentencesareoftheformexon-intron-exon-intron...intron-exon.The‘sentences’canneverendwithanintron,atleastifthegenesarecomplete,andanexoncanneverfollowanexonwithoutanintroninbetween.Obviouslythisgrammarisgreatlysimplified,becausethereareseveralotherconstraintsongenestructure,suchastheconstraintthattheexonshavetofittogethertogiveavalidcodingregionaftersplicing.InFig.4.8thestructureofageneisshownwithsomeoftheknownsignalsmarked.
Formallanguagetheoryappliedtobiologicalproblemsisnotanewinvention.InparticularDavidSearls[7]haspromotedthisideaanduseditforgenefinding[8],butmanyothergenefindersuseitimplicitly.FormallytheHMMcanonlyrepresentthesimplestofgrammars,whichiscalledaregulargrammar[7,1],butthatturnsouttobegoodenoughforthegenefindingproblem,andmanyotherproblems.OneoftheproblemsthathasamorecomplicatedgrammarthantheHMMcanhandleistheRNAfoldingproblem,whichisonestepuptheladderofgrammars,becausebasepairingintroducescorrelationsbetweenbasesfarfromeachotherintheRNAsequence.
Iwillherebrieflyoutlinemyownapproachtogenefindingwiththeweightontheprinciplesratherthanonthedetails.
Figure4.8:Thestructureofagenewithsomeoftheimportantsignalsshown.
4.4.1Signalsensors
OnemayapplyanHMMsimilartotheonesalreadydescribeddirectlytomanyofthesignalsinagenestructure.InFig.4.9analignmentisshownofsomesequencesaroundacceptorsitesfromhumanDNA.Ithas19columnsandanHMMwith19states(noinsertordeletestates)canbemadedirectlyfromit.Sincethealignmentisgap-less,theHMMisequivalenttoaweightmatrix.
Thereisoneproblem:inDNAtherearefairlystrongdinuclotidepreferences.Amodelliketheonedescribedtreatsthenucleotidesasindependent,sodinu-cleotidepreferencescannotbecaptured.Thisiseasilyfixedbyhaving16prob-abilityparametersineachstateinsteadof4.IncolumntwowefirstcountalloccurrencesofthefournucleotidesgiventhatthereisanAinthefirstcolumnandnormalizethesefourcounts,sotheybecomeprobabilities.Thisisthecondi-tionalprobabilitythatacertainnucleotideappearsinpositiontwo,giventhatthepreviousonewasA.ThesameisdoneforalltheinstancesofCincolumn1andsimilarlyforGandT.Thisgivesatotalof16probabilitiestobeusedinstatetwooftheHMM.Similarlyforalltheotherstates.Tocalculatetheprobabilityofasequence,sayACTGTC,wejustmultiplytheconditionalprobabilities
PACTGTC
p1A
p2CA
p3TC
p4GT
p5TG
p6CT
14
CTGTTTTCCTATTTAGTATGCTTACATTCTCTCTTCCCGCCTGTGCGGCTCCTTGTATTTTCACCTGTCTTCATGTGTTTTGTAGCCCGTATCACTGTTTATTGTTATGGTCTTTCTCTCGTTTTTTTGTGTGCCTATCTCCTCTCTTTGTCCACTTCTCCTTTTATCCACCCTCCCTTTGTTCTCCTCGCTAACCAGGAGACACAACCCTACCCCCTCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGGGGCGGGACGGGACGGACGAGGGAGTGAACTATTCCTTCCTACGTCGGT
Figure4.9:Examplesofhumanacceptorsites(thesplicesite5’totheexon).Ex-ceptinrarecases,theintronendswithAG,whichhasbeenhighlighted.Includedinthesesequencesare16basesupstreamofthesplicesiteand3basesdownstreamintotheexon.
Herep1istheprobabilityofthefournucleotidesinstate1,p2xyisthecondi-tionalprobabilityinstate2ofnucleotidexgiventhatthepreviousnucleotidewasy,andsoforth.
Astatewithconditionalprobabilitiesiscalledafirstorderstate,becauseitcapturesthefirstordercorrelationsbetweenneighboringnucleotides.Itiseasytoexpandtohigherorder.Asecondorderstatehasprobabilitiesconditionedonthetwopreviousnucleotidesinthesequence,i.e.,probabilitiesoftheformpxyz.Wewillreturntosuchhigherorderstatesbelow.
SmallHMMslikethisareconstructedinexactlythesamewayforothersig-nals:donorsplicesites,theregionsaroundthestartcodons,andtheregionsaroundthestopcodons.
4.4.2Codingregions
Thecodonstructureisthemostimportantfeatureofcodingregions.BasesintripletscanbemodeledwiththreestatesasshowninFig.4.10.Thefigurealsoshowshowthismodelofcodingregionscanbeusedinasimplemodelofanunsplicedgenethatstartswithastartcodon(ATG),thenconsistsofsomenumberofcodons,andendswithastopcodon.
Sinceacodonisthreebaseslong,thelaststateofthecodonmodelmustbeatleastofordertwotocorrectlycapturethecodonstatistics.The64probabilitiesinsuchastateareestimatedbycountingthenumberofeachcodoninasetofknowncodingregions.Thesenumbersarethennormalizedproperly.ForexampletheprobabilitiesderivedfromthecountsofCAA,CAC,CAG,andCATare
pACA
cCAA
cCAA
cCAC
cCAG
cCAT
15
Figure4.10:Top:Amodelofcodingregions,wherestateone,twoandthree
matchthefirst,secondandthirdcodonpositionsrespectively.Acodingregionofanylengthcanmatchthismodel,becauseofthetransitionfromstatethreebacktostateone.Bottom:asimplemodelforunsplicedgeneswiththefirstthreestatesmatchingastartcodon,thenextthreeoftheformshowntotheleft,andthelastthreestatesmatchingastopcodon(onlyoneofthethreepossiblestopcodonsareshown).
pCCApGCApTCA
cCACcCAAcCAGcCAAcCATcCAA
cCACcCACcCAC
cCAGcCAGcCAG
cCATcCATcCAT
wherecxyzisthecountofcodonxyz.
Oneofthecharacteristicsofcodingregionsisthelackofstopcodons.Thatisautomaticallytakencareof,becausepATA,pGTAandpATG,corre-spondingtothethreestopcodonsTAA,TAGandTGA,willautomaticallybecomezero.
Formodelingcodonstatisticsitisnaturaltouseanordinary(zerothorder)stateasthefirststateofthecodonmodelandafirstorderstateforthesecond.However,thereareactuallyalsodependenciesbetweenneighboringcodons,andthereforeonemaywantevenhigherorderstates.Inmyowngenefinder,Iusethreefourthorderstates,whichisinspiredbyGeneMark[9],inwhichsuchmod-elswerefirstintroduced.Technicallyspeaking,suchamodeliscalledaninho-mogeneousMarkovchain,whichcanbeviewedasasub-classofHMMs.
4.4.3Combiningthemodels
Tobeabletodiscovergenes,weneedtocombinethemodelsinawaythatsatis-fiesthegrammarofgenes.Irestrictmyselftocodingregions,i.e.the5’and3’untranslatedregionsofthegenesarenotmodeledandalsopromotersaredisre-garded.
16
Figure4.11:AhiddenMarkovmodelforunsplicedgenes.Inthisdrawingan‘x’
meansastatefornon-codingDNA,anda‘c’astateforcodingDNA.Onlyoneofthethreepossiblestopcodonsareshowninthemodeloftheregionaroundthestopcodon.
Figure4.12:Toallowforsplicinginthreedifferentframesthreeintronmodelsareneeded.Togettheframecorrect‘spacerstates’areaddedbeforeandaftertheintronmodels.
17
First,letusseehowtodoitforunsplicedgenes.Ifweignoregenesthatareverycloselyspacedoroverlaps,amodelcouldlooklikeFig.4.11.Itconsistsofastateforintergenicregions(oforderatleast1),amodelfortheregionaroundthestartcodon,themodelforthecodingregion,andamodelfortheregionaroundthestopcodon.Themodelforthestartcodonregionismadejustliketheacceptormodeldescribedearlier.Itmodelseightbasesupstreamofthestartcodon,2theATGstartcodonitself,andthefirstcodonafterthestart.Similarlyforthestopcodonregion.ThewholemodelisonebigHMM,althoughitwasputtogetherfromsmallindependentHMMs.
Havingsuchamodel,howcanwepredictgenesinasequenceofanonymousDNA?Thatisquitesimple:usetheViterbialgorithmtofindthemostprobablepaththroughthemodel.WhenthispathgoesthroughtheATGstates,astartcodonispredicted,whenitgoesthroughthecodonstatesacodonispredicted,andsoon.
Thismodelmightnotalwayspredictcorrectgenes,butatleastitwillonlypredictsensiblegenesthatobeythegrammaticalrules.Agenewillalwaysstartwithastartcodonandendwithastopcodon,thelengthwillalwaysbedivisibleby3,anditwillnevercontainstopcodonsinthereadingframe,whicharetheminimumrequirementsforunsplicedgenecandidates.
Makingamodelthatconformstotherulesofsplicingisabitmoredifficultthanitmightseematfirst.Thatisbecausesplicingcanhappeninthreedifferentreadingframes,andthereadingframeinoneexonhastofittheoneinthenext.Itturnsoutthatbyusingthreedifferentmodelsofintrons,oneforeachframe,thisispossible.InFig.4.12itisshownhowthesemodelsareaddedtothemodelofcodingregions.
Thetoplineinthemodelisforintronsappearingbetweentwocodons.Ithasthreestates(labeledccc)beforetheintronstartstomatchthelastcodonoftheexon.ThefirsttwostatesoftheintronmodelmatchGT,whichistheconsensussequenceatdonorsites(itisoccasionallyanothersequence,butsuchcasesareignoredhere).ThenextsixstatesmatchesthesixbasesimmediatelyafterGT.Thestatesjustdescribedmodelthedonorsite,andtheprobabilitiesarefoundasitwasdescribedearlierforacceptorsites.Thenfollowsasinglestatetomodeltheinterioroftheintron.Iactuallyusethesameprobabilityparametersinthisstateasinthestatemodelingintergenicregions.Nowfollowstheacceptormodel,whichincludesthreestatestomatchthefirstcodonofthenextexon.
Thenextlineinthemodelisforintronsappearingafterthefirstbaseinacodon.Thedifferencefromthefirstisthatthereisonemorestateforacoding
basebeforetheintronandtwomorestatesaftertheintron.Thisensuresthattheframesfitintwoneighboringexons.Similarlyinthethirdlinefromthetoptherearetwoextracodingstatesbeforetheintronandoneafter,sothatitcanmatchintronsappearingafterthesecondbaseinacodon.
Thereareobviouslymanypossiblevariationsofthemodel.Onecanaddmorestatestothesignalsensors,includemodelsofpromoterelementsanduntranslatedregionsofthegene,andsoforth.
4.5Furtherreading
Ageneralintroductioncanbefoundin[2],andoneaimedmoreatbiologicalreadersin[1].Thefirstapplicationsforsequenceanalysisisprobablyformodel-ingcompositionaldifferencesbetweenvariousDNAtypes[10]andforstudyingthecompositionalstructureofgenomes[11].TheinitialworkonusinghiddenMarkovmodels(HMMs)as‘probabilisticprofiles’ofproteinfamilieswaspre-sentedataconferenceinthespringof1992andinatechnicalreportfromthesameyear,anditwaspublishedin[12,13].Theideawasquicklytakenupbyothers[14,15].Independently,someverysimilarideaswerealsodevelopedin[16,17].Alsothegeneralizedprofiles[18]areverysimilar.
Estimationandmultiplealignmentisdescribedin[13]indetail,andin[19]someofthepracticalmethodsarefurtherdiscussed.Alternativemethodsformodelestimationarepresentedin[14,20].MethodsforscoringsequencesagainstaprofileHMMweregivenin[13],buttheseissueshavemorerecentlybeenad-dressedin[21].Thebasicpseudocountmethodisalsoexplainedin[13],andmoreadvancedmethodsarediscussedin[22,23,24,25,26].
AreviewofprofileHMMscanbefoundin[27],andin[1]profileHMMsarediscussedingreatdetail.Also[28]willundoubtedlycontaingoodmaterialonprofileHMMs.
SomeoftherecentapplicationsofprofileHMMstoproteinsare:detectionoffibronectintypeIIIdomainsinyeast[29],adatabaseofproteindomainfamilies[30],proteintopologyrecognitionfromsecondarystructure[31],andmodelingofaproteinsplicingdomain[32].
Therearetwoprogrampackagesavailablefreeofchargetotheacademiccom-munity.One,developedbySeanEddy,iscalledhmmer(pronounced‘hammer’),andcanbeobtainedfromhisweb-site(http://genome.wustl.edu/eddy/hmm.html).Theotherone,calledSAM(http://www.cse.ucsc.edu/research/compbio/sam.html),wasdevelopedbymyselfandthegroupatUCSantaCruz,anditisnowbeingmaintainedandfurtherdevelopedunderthecommandofRichardHughey.
ThegenefindersketchedaboveiscalledHMMgene.Therearemanydetailsomitted,suchasspecialmethodsforestimationandpredictiondescribedin[33].
19
Itisstillunderdevelopment,anditispossibletofollowthedevelopmentandtestthecurrentversionatthewebsitehttp://www.cbs.dtu.dk/services/HMMgene/.Methodsforautomatedgenefindinggobackalongtime,see[34]forareview.ThefirstHMMbasedgenefinderisprobablyEcoParsedevelopedforE.coli[35].VEIL[36]isarecentHMMbasedgenefinderforforhumangenes.ThemaindifferencefromHMMgeneisthatitdoesnotusehighorderstates(neitherdoesEcoParse),whichmakesgoodmodelingofcodingregionsharder.
Tworecentmethodsuseso-calledgeneralizedHMMs.Genie[37,38,39]combinesneuralnetworksintoanHMM-likemodel,whereasGENSCAN[40]ismoresimilartoHMMgene,butusesadifferentmodeltypeforsplicesite.Also,thegeneralizedHMMcanexplicitlyuseexonlengthdistributions,whichisnotpossibleinastandardHMM.Webpointerstogenefindingcanbefoundathttp://www.cbs.dtu.dk/krogh/genefinding.html.
OtherapplicationsofHMMsrelatedtogenefindingare:detectionofshortproteincodingregionsandanalysisoftranslationinitiationsitesinCyanobac-terium[41,42],characterizationofprokaryoticandeukaryoticpromoters[43],andrecognitionofbranchpoints[44].
Apartfromtheareasmentionedhere,HMMshavebeenusedforpredictionofproteinsecondarystructure[45],modelinganoscillatorypatterninnucleosomes[46],modelingsitedependenceofevolutionaryrates[47],andforincludingevo-lutionaryinformationinproteinsecondarystructureprediction[48].
Acknowledgements
HenrikNielsen,StevenSalzberg,andSteenKnudsenaregratefullyacknowledgedforcommentsonthemanuscript.ThisworkwassupportedbytheDanishNationalResearchFoundation.
References
[1]Durbin,R.M.,Eddy,S.R.,Krogh,A.,andMitchison,G.(1998)Biological
SequenceAnalysisCambridgeUniversityPressToappear.[2]Rabiner,L.R.(1989)Proc.IEEE77,257–286.
[3]Gribskov,M.,McLachlan,A.D.,andEisenberg,D.(1987)Proc.oftheNat.
Acad.ofSciencesoftheU.S.A.84,4355–4358.[4]Staden,R.(1984)NucleicAcidsResearch12,505–519.
[5]Staden,R.(1988)ComputerApplicationsintheBiosciences4,53–60.
20
[6]Bairoch,A.,Bucher,P.,andHofmann,K.(1997)NucleicAcidsResearch
25,217–221.[7]Searls,D.B.(1992)AmericanScientist80,579–591.[8]Dong,S.andSearls,D.B.(1994)Genomics23,540–551.
[9]Borodovsky,M.andMcIninch,J.(1993)ComputersandChemistry17,
123–133.[10]Churchill,G.A.(1989)BullMathBiol51,79–94.
[11]Churchill,G.A.(1992)ComputersandChemistry16,107–115.
[12]Haussler,D.,Krogh,A.,Mian,I.S.,andSj¨olander,K.(1993)Proteinmodel-ingusinghiddenMarkovmodels:AnalysisofglobinsInMudge,T.N.,Mi-lutinovic,V.,andHunter,L.(Eds.),ProceedingsoftheTwenty-SixthAnnual
HawaiiInternationalConferenceonSystemSciencesvolume1pp.792–802LosAlamitos,California.IEEEComputerSocietyPress.[13]Krogh,A.,Brown,M.,Mian,I.S.,Sj¨olander,K.,andHaussler,D.(1994)
JournalofMolecularBiology235,1501–1531.[14]Baldi,P.,Chauvin,Y.,Hunkapiller,T.,andMcClure,M.A.(1994)Proc.of
theNat.Acad.ofSciencesoftheU.S.A.91,1059–1063.[15]Eddy,S.R.,Mitchison,G.,andDurbin,R.(1995)JComputBiol2,9–23.[16]Stultz,C.M.,White,J.V.,andSmith,T.F.(1993)ProteinScience2,305–
314.[17]White,J.V.,Stultz,C.M.,andSmith,T.F.(1994)MathematicalBiosciences
119,35–75.[18]Bucher,P.,Karplus,K.,Moeri,N.,andHofmann,K.(1996)Computersand
Chemistry20,3–24.[19]Hughey,R.andKrogh,A.(1996)CABIOS12,95–107.
[20]Eddy,S.R.(1995)MultiplealignmentusinghiddenMarkovmodels.In
Rawlings,C.,Clark,D.,Altman,R.,Hunter,L.,Lengauer,T.,andWodak,S.(Eds.),Proc.ofThirdInt.Conf.onIntelligentSystemsforMolecularBiologyvolume3pp.114–120MenloPark,CA.AAAIPress.[21]Barrett,C.,Hughey,R.,andKarplus,K.(1997)ComputerApplicationsin
theBiosciences13,191–199.
21
[22]Brown,M.,Hughey,R.,Krogh,A.,Mian,I.S.,Sj¨olander,K.,andHaussler,
D.(1993)UsingDirichletmixturepriorstoderivehiddenMarkovmodelsforproteinfamiliesInHunter,L.,Searls,D.,andShavlik,J.(Eds.),Proc.ofFirstInt.Conf.onIntelligentSystemsforMolecularBiologypp.47–55MenloPark,CA.AAAI/MITPress.[23]Tatusov,R.L.,Altschul,S.F.,andKoonin,E.V.(1994)Proc.oftheNat.
Acad.ofSciencesoftheU.S.A.91,12091–12095.[24]Karplus,K.(1995)Evaluatingregularizersforestimatingdistributionsof
aminoacids.InRawlings,C.,Clark,D.,Altman,R.,Hunter,L.,Lengauer,T.,andWodak,S.(Eds.),Proc.ofThirdInt.Conf.onIntelligentSystemsforMolecularBiologyvolume3pp.188–196MenloPark,CA.AAAIPress.[25]Henikoff,J.G.andHenikoff,S.(1996)ComputerApplicationsintheBio-sciences12,135–143.[26]Sj¨olander,K.,Karplus,K.,Brown,M.,Hughey,R.,Krogh,A.,Mian,I.S.,
andHaussler,D.(1996)CABIOS12,327–345.[27]Eddy,S.R.(1996)CurrentOpinioninStructuralBiology6,361–365.[28]Baldi,P.andBrunak,S.(1998)Bioinformatics-TheMachineLearning
ApproachMITPressCambridgeMAToappear.[29]Bateman,A.andChothia,C.(1997)Curr.Biol.6,1544–1547.
[30]Sonnhammer,E.L.L.,Eddy,S.R.,andDurbin,R.(1997)Proteins28,
405–420.[31]DiFrancesco,V.,Garnier,J.,andMunson,P.J.(1997)JournalofMolecular
Biology267,446–463.[32]Dalgaard,J.Z.,Moser,M.J.,Hughey,R.,andMian,I.S.(1997)Journalof
ComputationalBiology4,193–214.[33]Krogh,A.(1997)TwomethodsforimprovingperformanceofaHMMand
theirapplicationforgenefindingInGaasterland,T.,Karp,P.,Karplus,K.,Ouzounis,C.,Sander,C.,andValencia,A.(Eds.),Proc.ofFifthInt.Conf.onIntelligentSystemsforMolecularBiologypp.179–186MenloPark,CA.AAAIPress.[34]Fickett,J.W.(1996)TrendsGenet.12,316–320.
22
[35]Krogh,A.,Mian,I.S.,andHaussler,D.(1994)NucleicAcidsResearch22,
4768–4778.[36]Henderson,J.,Salzberg,S.,andFasman,K.H.(1997)Findinggenesin
DNAwithahiddenMarkovmodelJournalofComputationalBiology.[37]Kulp,D.,Haussler,D.,Reese,M.G.,andEeckman,F.H.(1996)Agen-eralizedhiddenMarkovmodelfortherecognitionofhumangenesinDNAInStates,D.,Agarwal,P.,Gaasterland,T.,Hunter,L.,andSmith,R.(Eds.),Proc.Conf.onIntelligentSystemsinMolecularBiologypp.134–142MenloPark,CA.AAAIPress.[38]Reese,M.G.,Eeckman,F.H.,Kulp,D.,andHaussler,D.(1997)Improved
splicesitedetectioninGenieInWaterman,M.(Ed.),ProceedingsoftheFirstAnnualInternationalConferenceonComputationalMolecularBiology(RECOMB)NewYork.ACMPress.[39]Kulp,D.,Haussler,D.,Reese,M.G.,andEeckman,F.H.(1997)Integrating
databasehomologyinaprobabilisticgenestructuremodelInAltman,R.B.,Dunker,A.K.,Hunter,L.,andKlein,T.E.(Eds.),ProceedingsofthePacificSymposiumonBiocomputingNewYork.WorldScientific.[40]Burge,C.andKarlin,S.(1997)JournalofMolecularBiology268,78–94.[41]Yada,T.andHirosawa,M.(1996)DNARes.3,355–361.[42]Yada,T.,Sazuka,T.,andHirosawa,M.(1997)DNARes.4,1–7.
[43]Pedersen,A.G.,Baldi,P.,Brunak,S.,andChauvin,Y.(1996)Characteriza-tionofprokaryoticandeukaryoticpromotersusinghiddenMarkovmodelsInProc.ofFourthInt.Conf.onIntelligentSystemsforMolecularBiologypp.182–191MenloPark,CA.AAAIPress.[44]Tolstrup,N.,Rouz´e,P.,andBrunak,S.(1997)NucleicAcidsResearch25,
3159–3164.[45]Asai,K.,Hayamizu,S.,andHanda,K.(1993)ComputerApplicationsinthe
Biosciences9,141–146.[46]Baldi,P.,Brunak,S.,Chauvin,Y.,andKrogh,A.(1996)JournalofMolec-ularBiology263,503–510.[47]Felsenstein,J.andChurchill,G.A.(1996)MolecularBiologicalEvolution
13.
23
[48]Goldman,N.,Thorne,J.L.,andJones,D.T.(1996)JournalofMolecular
Biology263,196–208.
24
因篇幅问题不能全部显示,请点此查看更多更全内容