KenDaigoroYokoyama1.,YangZhang2.,JianMa1,2*
1InstituteforGenomicBiology,UniversityofIllinoisatUrbana-Champaign,Urbana,Illinois,UnitedStatesofAmerica,2DepartmentofBioengineering,UniversityofIllinoisatUrbana-Champaign,Urbana,Illinois,UnitedStatesofAmerica
Abstract
Changesincis-regulatoryelementcompositionthatresultinnovelpatternsofgeneexpressionarethoughttobeamajorcontributortotheevolutionoflineage-specifictraits.Althoughtranscriptionfactorbindingeventsshowsubstantialvariationacrossspecies,mostcomputationalapproachestostudyregulatoryelementsfocusprimarilyuponhighlyconservedsites,andrelyheavilyuponmultiplesequencealignments.However,sequenceconservationbasedapproacheshavelimitedabilitytodetectlineage-specificelementsthatcouldcontributetospecies-specifictraits.Inthispaper,wedescribeanovelframeworkthatutilizesabirth-deathmodeltotracetheevolutionoflineage-specificbindingsiteswithoutrelyingondetailedbase-by-basecross-speciesalignments.OurmodelwasappliedtoanalyzetheevolutionofbindingsitesbasedontheChIP-seqdataforsixtranscriptionfactors(GATA1,SOX2,CTCF,MYC,MAX,ETS1)alongthelineagetowardhumanafterhuman-mousecommonancestor.Weestimatethatasubstantialfractionofbindingsites(,58–79%foreachfactor)inhumanshaveoriginssincethedivergencewithmouse.Over15%ofallbindingsitesareuniquetohominids.Suchelementsareoftenenrichedneargenesassociatedwithspecificpathways,andharbormorecommonSNPsthanolderbindingsitesinthehumangenome.Theseresultssupporttheabilityofourmethodtoidentifylineage-specificregulatoryelementsandhelpunderstandtheirrolesinshapingvariationingeneregulationacrossspecies.
Citation:YokoyamaKD,ZhangY,MaJ(2014)TracingtheEvolutionofLineage-SpecificTranscriptionFactorBindingSitesinaBirth-DeathFramework.PLoSComputBiol10(8):e1003771.doi:10.1371/journal.pcbi.1003771
Editor:ShengZhong,UniversityofCalifornia,SanDiego,UnitedStatesofAmeircaReceivedApril23,2014;AcceptedJune27,2014;PublishedAugust21,2014
Copyright:ß2014Yokoyamaetal.Thisisanopen-accessarticledistributedunderthetermsoftheCreativeCommonsAttributionLicense,whichpermitsunrestricteduse,distribution,andreproductioninanymedium,providedtheoriginalauthorandsourcearecredited.
DataAvailability:Theauthorsconfirmthatalldataunderlyingthefindingsarefullyavailablewithoutrestriction.AllrelevantdataandsourcecodecanbeobtainedfromtheSupportingInformationfilesandtheSupplementaryWebsite(http://bioen-compbio.bioen.illinois.edu/TFBSEvo/).
Funding:ThisworkwassupportedinpartbyNIHgrantHG006464(toJM),NSFgrant1054309(toJM),andNSFgrant1262575(toJM).Thefundershadnoroleinstudydesign,datacollectionandanalysis,decisiontopublish,orpreparationofthemanuscript.CompetingInterests:Theauthorshavedeclaredthatnocompetinginterestsexist.*Email:jianma@illinois.edu
.Theseauthorscontributedequallytothiswork.
Introduction
Changesingeneregulationplayakeyroleintheevolutionofmorphologicaltraits[1–3].Attheleveloftranscription,geneexpressioniscontrolledviatranscriptionfactor(TF)proteinsthatselectivelybindtocis-regulatoryelementsinasequence-specificmanner[2,4].UtilizingchromatinimmunoprecipitationofspecificTFsfollowedbyhigh-throughputsequencing(ChIP-seq),recentstudiesshowedthattheevolutionofthesetranscriptionfactorbindingsites(TFBS)ishighlydynamic,withsitesdifferingagreatdealevenwithinmammals[5–9].
Despitesubstantialexperimentalevidenceforrapiddivergenceofregulatoryprotein-bindingeventsacrossspecies,computationalmodelsdesignedtoanalyzeregulatoryelementsusingcross-speciescomparisonshavefocusedprimarilyupon‘phylogeneticfootprint-ing’approaches,inwhichputativelyfunctionalregulatoryelementsareidentifiedaccordingtosequenceconservation[10–15].Previouscomputationalstudieshaveinferredtheevolutionofregulatoryelementsusing,forexample,theemergenceofnewconservedelementsspecifictoaparticularcladeinthephylogeny[16]orlineage-specificalterationsleadingtoaloss-of-functionphenotype[17,18].Althoughsuchapproacheshavebeenhelpfulinunderstandinglineage-specificregulatoryelementevolution,allinherentlyrelyuponfixedcross-speciesalignments,whicharefrequentlyoflowqualitywithinnon-codingregionsinthegenome[19–21].Previousstudieshaveestimatedthatmorethan15%ofalignedbaseswithinhuman-mousewhole-genomealignmentsareincorrect[22]andtheerrorrateincreaseswhenmorespeciesareinvolved[19].Ancestralreconstruction,whichissensitivetodetailsofthemultiplealignment,isaparticularlychallengingproblemfornon-codingregions[23,24].Asaconsequence,cross-speciescomparisonsofnon-codingsequencesarelimitedintheirabilitytostudyregulatorysequenceevolution,particularlyincaseswheretheelementsareselectedfornoveltyornewly-derived.Suchnewly-derivedregulatoryelementsarenotrare;indeed,analysesusinghumanpopulationvariationdatafromthe1000GenomesProject[25]haveshownthathumangenomiclocationsunderselectionundergoconsiderableturnoverandfrequentlylieoutsidemammalian-conservedregions[26].Yet,systematicidentificationofbindingsitesforspecificTFsandassessmentoftheirconservationandprevalenceusingcross-speciescomparisonsremainsachallengingproblem.
Inthiswork,weintroduceanovelevolutionaryframeworkthroughwhichlineage-specificTFBSscanbeinferredonagenome-widescale.Incontrasttoconservation-basedapproaches[13,16,27],weutilizeabirth-deathmodeltoinferancestralstates
PLOSComputationalBiology|www.ploscompbiol.org1August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
AuthorSummary
Recentexperimentalstudiesshowedthattheevolutionoftranscriptionfactorbindingsites(TFBS)ishighlydynamic,withsitesdifferingagreatdealevenbetweencloselyrelatedmammalianspecies.Despitethesubstantialexperimentalevidenceforrapiddivergenceofregulatoryprotein-bindingeventsacrossspecies,computationalmethodsdesignedtoanalyzeregulatoryelementsevolu-tionhavefocusedprimarilyonphylogeneticfootprintingapproaches,inwhichputativefunctionalregulatoryelementsareidentifiedaccordingtostrongsequenceconservation.Cross-speciescomparisonsofnon-codingsequencesarelimitedintheirabilitytofullyunderstandtheevolutionofregulatorysequences,particularlyincaseswheretheelementsareselectedfornoveltyorspecies-specific.Wehavedevelopedanovelframeworktoreconstructthehistoryoflineage-specificTFBSandshowedthatlargeamountofTFBSinhumanwerebornafterhuman-mousedivergence.Theseelementsalsohavedistinctbiologicalimplicationsascomparedtomoreancientones.Thismethodcanhelpunderstandtherolesoflineage-specificTFBSinshapinggeneregulationacrossdifferentspecies.
ofagivenmotifwithouttheuseofthebase-by-basealignmentdetailsintheunderlyingcross-speciessequencealignment.GainsandlossesofTFBShavebeenexplicitlyusedbothtoimprovecross-speciessequencecomparisonsandtodetectcis-regulatorymodules,althoughsuchmodelsareusuallyframedwithinthecontextofanalignment[21,28].Amoresimilaralignment-freemodelwaspreviouslyusedtomeasuretheoverallrateofTFBScreationalongdifferentlineages[29].Inthiswork,weinsteadappliedourframeworktoinferlineage-specificTFBS,estimatingthebranchoforiginofeachindividualTFBSforsixTFs.WethenstudiedpatternsforTFBSwithdifferentbranchesoforigin,includingtargetgenesofthenewly-derivedsites,relationshipwithwithin-humanvariation,overlapwithtransposableelements,andcell-typespecificityofTF-binding.Ourresultsprovidestrongsupportthatthisnovelmethodcanhelpidentifylineage-specificregulatoryelements,afirststeptowardsunderstandingtheroleofregulatoryelementevolutioninshapingthevariationofgeneregulationacrossspecies.
Results
Overviewoftheprobabilisticframeworkforthebirth-deathevolutionarymodelofTFBS
Ourgoalistodetectlineage-specificratesofTFBSevolutionandthebranchoforiginforindividualTFBS.Here,lineagemeansanyancestralbranchinthephylogenyorabranchleadingtowardanymodernspecies.OurapproachistomodelTFBSevolutionusingabirth-deathframework,inwhichindividualTFBSscanbegained,lost,orconservedwithinagivenlineageduringevolution.TherateofTFBScreation(birthrate)andloss(deathrate)arefirstestimatedfromasetoforthologoussequences,andaresubsequentlyusedtotracetheevolutionaryoriginofindividualTFBSsatthesequencelevel.Thebirthrate(a)foragivenmotifrepresentstheprobabilityatwhichaTFBSappearsatasingleunoccupiedsiteinagivenyearofevolutionarytime.Similarly,thedeathrate(b)representstherateatwhichanexistingTFBSislostperyear.ThemethodconsidersonlyTFmotifcountswithinorthologoussequencesacrossspecies,andthereforedoesnotrequireanaccuratebase-to-basemultiplesequencealignment.
PLOSComputationalBiology|www.ploscompbiol.org
2
ThisframeworkallowsustoreconstructtheancestralstatesforeachTFBSthroughoutthegenome,providingadistributionforthebranchoforiginofthebindingsitesgenome-wide.
Foranysetoforthologoussequencesacrossspeciesandaknownphylogeny,wefirstestimatebirthanddeathratesaccordingtotheobservednumbersofTFmotifoccurrenceswithineachspecies.Suchorthologoussequencescan,forinstance,beobtainedusingagenome-widemultiplespeciesalignment.However,theunderlyingbase-levelalignmentisignoredoncetheorthologoussequencesareobtained,andsubsequentlythemodelconsidersonlythenumberofTFmotifswithineachsequence.Thus,themethodoperatesindependentlyofanydetailswithinthealignmentoncethesequencecorrespondencebetweenspecies(i.e.,orthologs)isobtained.Everynodeinthephylogenyisthenassociatedwitha(random)variableQx,whichrepresentsthenumberofoccurrencesoftheTFBSsatthatnodex.ThevalueofQxisknownforeachleafnodeinthetreeforanygivenorthologset.Birthanddeathratesofagivenmotifcanthenbeestimatedbymaximizingthelikelihoodacrosstheentiredataset,takingintoaccountbothbranchlengthsaswellasthesizeofthesequenceregion(seeMethods).Evolutionaryrateswereestimatedusinganiterativeapproach,butwerefoundtobeextremelyrobustaccordingtotheinitialparametersettings.Oncethebirthanddeathratesareestimatedusingthefulldataset,wecanusetheseratestotracethebranchoforiginofindividualTFBSs.Thiscanbedonebyreconstructingthemostlikelyancestralstateateachnodeofthephylogeny;i.e.,thevalueofQxthatmaximizesthelikelihoodofthedataforeachindividualChIP-seqpeakregion.ThisprovidesthemostlikelynumberofTFmotifoccurrencesateachnode,andallowsustotracethemostlikelybranchoforiginforindividualsite.Theoverallprocedureofourmethodworksasfollows.(1)WeidentifymotifoccurrenceswithinChIP-seqpeakregionsinhumanforagivenTF.(2)Weestimatethelikelihoodforeachancestralnodeinthephylogenygiventhemotifoccurrencesinthedescendantspecies.(3)WedeterminethebranchoforiginfortheTF-boundmotifwithinChIP-seqpeakregions.SeeMethodsandSupplementaryMethodsinTextS1fordetails.
ThemodelframeworkanditsmotivationsareillustratedinFigure1.Figure1Ashowsonescenariowherethebindingsitewasintroducedtothegenomethroughtransposableelements(TEs)insertionfollowedbypointmutation,whichismostlikelybranchoforiginofthissiteunderourmodel.Figure1BshowsanexamplethatourmethodisabletoidentifycasesofTFBSturnoverwithinstationarymodulesthatmightnototherwisebedetectedusinghuman-mouseChIP-seqdatadirectcomparisons.Inthisgenomicregion,thereisahumanGATA1bindingsiteoriginatingontheancestralprimatelineageandaGATA1bindingsitespecifictomouseandrat.AlthoughtheChIP-seqpeaksappearinthesamelocationbetweenhumanandmouse,ourmodelcanpredictsuchlineage-specificevents(whichisalsoreflectedinthecross-speciesalignment).Again,ouralgorithmpredictedthesebranchesoforiginaccuratelywithoutdetailedalignment.
Wenotethatinadditiontotherobustnessoftheparameterestimatesofaandb,thebranchoforiginestimateswerequiterobust,sincetheywerefarmoredependentonthenumberofsitesineachspeciesineachregion(usuallyeither0or1)thantheexactvaluesofaandb.AlthoughthelossofpositionalinformationwouldappeartomaketheapproachinsensitivetosomecasesofTFBSturnover,inwhichthecreationofanewbindingsitecoincideswiththelossofanoldbindingsite,inpracticesuchcasesareonlyproblematicwhenapplyingthemethodtolongbranchesinthephylogeny.Fordenselysampledphylogeniescontainingrelativelyshortbranchlengths,suchturnovereventscanbeinferredaslongasthegainandlossoccurondifferentbranches.
August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
Figure1.Modelframeworkforlineage-specificGATA1bindingsites.MultiplealignmentsareshownfortwoGATA1-boundregionsinhumans.RedandblueboxesinthealignmentcorrespondtoGATA1bindingsites.Phylogeniesillustratethebirth-deathmodelframework,wherethemostlikelynumberofbindingsitesisassignedtoeachancestralnode(denotedhereaseither‘present’or‘absent’,atvalues1or0inthisexample).Highlightedbranchesdenotethebranchoforigin.Evolutionarycomparisonswereconductedacrosstenprimatespecies,aswellas36non-primatevertebrates(notallareshown).(A)AbindingsiteoriginatingwithinanLTRinsertion.(B)AgenomicregioncontainingahumanGATA1bindingsiteoriginatingalongtheancestralprimatelineageandaGATA1bindingsitespecifictomouseandrat.DespitenearlyidenticallocationsoftheChIP-seqpeaksacrosshumanandmouse(inanalogousErythroblastcelllines),theabilityofthemethodtoidentifyspecificbranchesoforiginallowsustoidentifycasesofTFBSturnoverincloseproximity.doi:10.1371/journal.pcbi.1003771.g001
PLOSComputationalBiology|www.ploscompbiol.org3August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
Thisisusuallythecase,sinceoldbindingsitesareseldomlostthroughselectionandaregenerallylostslowly,followinganearly-neutralrateofdecay[29,30].Manyneutrally(ornear-neutrally)evolvingsitesarestillpresentafterrelativelylongperiodsofmammalianevolution[29].
aswellastheresultsfromtheMEMEsuite[38](SupplementaryMethodsinTextS1andTableS1).
SubstantialnumberofhumanTFBSshaverecentoriginsafterthehuman-mousedivergence
Usingourapproach,wesoughttodeterminethebranchoforiginforeachhumanbindingsiteforthesixTFs.Eachbindingsitewasthuseitherinferredtobepresentinthecommonhuman-mouseancestor,oramorerecentlineageleadingtohumanusingthephylogenyshowninFigure1.ThedistributionofthebranchoforiginforeachTFBSisshowninFigure2.Notably,between,58–79%ofallhumanTFBSshadinferredoriginsafterthehuman-mousesplit.
Inadditiontoestimatingthefractionofancestralbindingsitespresentinthehuman-mousecommonancestor,ourapproachallowsustoestimatetheagedistributionacrossallbindingsiteoccurrences.AsshowninFigure2(leftpanel),asizeablefractionofbindingsitesinhumanswereestimatedtohaverecentoriginsinprimates.Forinstance,thefractionofhumanbindingsitesthatareuniquetohominidsrangedfrom16.1%forETS1to31.1%forMYC.Additionally,thenumberofsitesestimatedtobehuman-specific,i.e.,withrecentoriginsafterthehuman-chimpdiver-gence,rangedfrom3.5%to10%.
Sincethetotalnumberofprotein-boundsitesforeachfactorisquitelarge,thesefractionsrepresentaconsiderablenumberofsitesuniquetohumansorclosely-relatedprimatelineages.Thenumberofhumanbindingsitesoriginatingsincethecommonhominidancestorrangedfrom1,084to3,931foreachfactor,including440to1,409human-specificbindingsites.TherateofappearancefornewsitesalongdifferentancestrallineagesshowedarelativelystablerateofcreationfornewbindingsitesformostTFsupthroughthecommonSimian-primateancestor(Figure2,rightpanel),withslightincreaseinmorerecentbranches.Thebirthrateofnewsitesrangesfrom50–300permillionyearsformostTFs.Recentworkshaveemphasizedtheemergenceofnewregulatoryelementsviatransposableelements(TEs)[7,39–41].Weassessedtheamountofoverlapbetweennewly-derivedTFBSsandannotatedTEsdeterminedbyRepeatMasker[42].Consistentwithpreviousreports[7,43],asubstantialfractionofbindingsiteswithrecentoriginwerelocatedwithinTEs.ThenumberofTE-derivedsitesvariedbetweenfactors,rangingfromapproximately35–40%forrecently-derivedSOX2andGATA1bindingsitestoapproximately15–20%fornewly-derivedETS1andMYCbindingsites.TE-derivedTFBSsofspecificbranchoforiginswereassociatedwithdifferentTEfamilieswithdifferenttimesoforigin(SupplementaryResultsinTextS1andFigureS1).
LargenumberofTFBSembeddedinChIP-seqpeaksofaparticularTFexhibitincreasedevolutionaryratesalonglineagesleadingtohuman
Inthiswork,weappliedourmethodtoChIP-seqdata,whichisnowcommonlyusedtomapinvivoTFoccupancygenome-wide[31].WeappliedourmethodtoChIP-seqdatasetsforsixTFs,namelyGATA1,SOX2,MYC,MAX,ETS1,andCTCF[32–35].TheseTFswerechosen,inpart,fortheirdiversefunctionalattributes,theirwell-documentedbindingmotifs,andtheavail-abilityofChIP-seqdatainanalogouscelltypesinhumanandmouse.Usingourmethod,wecandeterminecasesinwhichtherearelineage-specificdifferencesinevolutionaryratesofagivenmotifalongaparticularbranchinthephylogeny.SincepreviouscomparisonsofChIP-seqdatafromhumanandmousehavereportedsubstantialdivergenceinprotein-bindinglocationsacrossthetwospecies[5,6],ChIP-seqpeaksinhumanarelikelytocontainahighenrichmentofTFBSscomparedtotheorthologousregionsinmoredistantly-relatedspecies.WethushypothesizedthatfunctionalmotifspresentamongChIP-seqpeakregionsmightbedetectablebytestingforanincreasedbirthrateaalonglineagesancestraltohumansrelativetootherlineages,sinceanyrecently-acquiredTFBSsinhumanswouldnaturallyincreasethebirthratealongtheselineages.
Todeterminedifferencesintherateofmotifevolutionalongspecificlineages,wefirstassumeasimple(null)modelinwhichthebirthanddeathrates(aandb)remainconstantacrosstheentirephylogeny.Wecanthencomparethishypothesistoamodelinwhichbirthanddeathratesdifferalongasinglebranchofthephylogenyrelativetotheotherbranches.Thestatisticalsignifi-canceoflineage-specificevolutionaryratescanthenbeassessedusingalikelihood-ratiotest[36],producingaP-valuereflectingthesignificanceoflineage-specificdifferencesinevolutionaryratesalongthatbranch(SupplementaryMethodsinTextS1).
WeappliedthisapproachtohumanChIP-seqdatageneratedforthesixTFs,testingforincreasedbirthrateswithinthe(2100,+100)regionrelativetothesummitofthepeaks.Orthologousregionswerethendeterminedusing46-waymultizalignmentsfromtheUCSCGenomeBrowser[37],andanalyseswereconductedusingdatafromall46vertebratelineagesaccordingtotheirknownphylogeny.ForeveryTF,withtheexceptionofMYC,theknownbindingmotifofTFwaspredictedwithasubstantiallyincreasedbirthratealongbranchesancestraltohumans(P,1e-15).Wenotethatincontrasttomotifpredictionusingconservation-basedapproaches,ourmethodgeneratesmotifpredictionsspecificallyusinglineage-specificbindingsites(orrather,theirincreasedrateofcreationalongaspecificlineage).Forfiveofthesixfactors(GATA1,SOX2,MAX,CTCF,andETS1),thedocumentedbindingmotifoftheTFproducedthemoststatisticallysignificantmotifpredictionusingourmethod.TheMYCbindingmotif,whichhaspreviouslybeennotedforitsstrongpatternsofconservation[27],wastheonlyfactorwhosebindingmotifwasnotthetop-rankedprediction,althoughitwasstillpredictedundertheP,1e-15threshold.Foreachfactor,weusedaniterativemethodtogenerateaPositionWeightMatrix(PWM)accordingtothenucleotidecompositionateachsiteofthemotifwithinthe(2100,+100)windowofpeaksinhumans.ThesepredictedPWMsverywellmatchedwiththeknownbindingmotifs
PLOSComputationalBiology|www.ploscompbiol.org
4
TFBSoriginestimatescorrespondtohuman-mouseChIP-seqdataandcross-speciesalignments
Toassesstheaccuracyoftheageestimates,wefirstcomparedourresultstoChIP-seqdatafromhumanandmouse.Usinganalogouscelltypesacrossspecies,wedeterminedtheamountofoverlapbetweenhumanChIP-seqpeaksandChIP-seqpeaksintheorthologousregionsinmouse.AhumanChIP-seqpeakwasconsideredtobe‘shared’withmouseifitssummitwaswithin200bpofamouseChIP-seqpeaksummitintheorthologousregion(notethatthemouseChIP-seqdatawerenottheinputouralgorithm).Theamountofoverlapwasassessedseparatelyforregionscontainingahumanbindingsitepresentinthecommonhuman-mouseancestorandforregionsthatarenotancestral.Weemphasizethat,asillustratedpreviouslyinFigure1B,ChIP-seqpeakssharedacrosshumanandmousecanoftencontainTFBSsthataregenuinelylineage-specific,sinceChIP-seqpeaks
August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
Figure2.TimeoforiginsforbindingsitesofsixTFsinhumans.BindingmotifsweredeterminedusinghumanChIP-seqdataforGATA1,SOX2,MYC,CTCF,ETS1,andMAX.Thebranchoforiginwasdeterminedforeachbindingsitewithinthe(2100,+100)regionrelativetoahumanChIP-seqpeaksummit.(Left)Distributionofthebranchoforiginforeachbindingsite.BranchlabelscorrespondtothoseinFigure1B.‘Ancestral’bindingsiteshaveoriginspriortohuman-mousedivergence.(Right)Therateofbindingsitecreationalongbranchesancestraltohumans.Rateswereestimatedbydividingthenumberofsitesoriginatingalongeachbranchbyevolutionarytime,includingonlybindingsitescurrentlyexistinginhumans.
doi:10.1371/journal.pcbi.1003771.g002
spanarelativelybroadregionandcancontaininstancesofTFBSturnoverwithinstaticmodules.Inaddition,human-specificChIP-seqpeakscanalsocontainancestralbindingsites,sincesuchsitescaneitherbelost(non-conserved)alongthemouselineageormaynotbeboundbytheTFalongthatlineage.
Table1showstheamountofoverlapinChIP-seqpeaksbetweenhumanandmouseaccordingtotheestimatedbranchoforiginoftheTFBSs.HumanpeakscontainingpredictedancestralTFBSswerefarmorelikelytooverlapwithboundregionsinmousethanpeakscontainingonlypredictedlineage-specificsites.Between24–41%ofhumanpeaksthatoverlappedwithapeakforthesameTFinmousecontainedonlypredictedlineage-specificTFBSs,while59–76%ofsharedpeakscontainedapredictedancestralTFBS.Thus,therewasaclearenrichmentforTFBSspredictedtobeancestralamongtheChIP-seqpeakssharedbetweenhumanandmouse.Amonghuman-specificChIP-seqpeaks,asubstantiallygreaternumbercontainedonlylineage-specificTFBSsthansitespredictedtobeancestraltohumanandmouse.
AlthougharelativelysizeableportionofsharedChIP-seqpeakscontainedonlyTFBSspredictedtobelineage-specific,inthe
PLOSComputationalBiology|www.ploscompbiol.org
5
majorityofcases(.90%)themouseTFBSdidnotoccurwithininsequenceregionorthologoustothehumanpeakregionused,butwasinsteadoffsettoanon-overlappingregionwithinamousepeak.VeryfewoftheseTFBSsactuallyalignedacrossthetwospecies,comparedwiththosewithpredictedancestralorigin.WecomparedthesequencelevelconservationofpredictedTFBSsaccordingtotheirbranchoforigins.WeusedthePhyloPmammalianconservationscores[44]availableattheUCSCGenomeBrowsertodeterminetheconservationlevelforTFBSinhuman.ForaspecificTF,wefirstcomputedtheaveragePhyloPscore(X)ineachChIP-seqpeakandthencalculatedtheaveragescore(M)aswellasstandarddeviation(SD)acrossallpeaksinthegenome.Wethengroupedthebindingsitesaccordingtotheirbranchoforigin(infourgroups:Hominid-specific,Simian-specific,Primate-specific,andEutherian-specific)andcalculatedtheaveragePhyloPscore(X).Finally,wecalculatedtheZ-score,i.e.(X-M)/SD).Asexpected,olderbindingregionsshowhighersequencelevelconservationthanyoungerones(FigureS2).Theseresultssuggestthatourmethodcanidentifymorerecent,less-conservedTFBS,withoutrelyingonsequence-levelconservation.
August2014|Volume10|Issue8|e1003771
Table1.Human-mouseChIP-seqfactor-boundregionoverlap.Sharedpeaks(human-mouse)bAncestralSitesd4338834340123854061116442013479644172812008766256(8.7%)73(25.9%)277(9.4%)(2.5%)(68.0%)947(32.0%)(9.3%)9(1.0%)(19.6%)35(4.0%)(73.5%)232(26.5%)(13.1%)19(3.2%)(22.3%)49(8.1%)20912538857052833829770231(69.8%)182(30.2%)1167(9,2%)16(2.3%)55(16.0%)39(5.6%)109(58.5%)288(41.5%)704(18.9%)7(1.6%)747(3.8%)(26.4%)(4.1%)(2.1%)(21.2%)(3.8%)(2.3%)(48.1%)(8.7%)(3.5%)(41.2%)(8.3%)(2.5%)(27.4%)26(5.8%)2242(11.2%)(75.7%)109(24.3%)9451(47.4%)(5.8%)11(1.9%)358(1.8%)(14.9%)24(4.1%)985(5.0%)6239310487100913819669338435026966418928358545832342(73.3%)158(26.7%)6921(34.9%)12914Lineage-specificSiteseAncestralSitesdLineage-specificpeakscLineage-specificSitese(65.1%)(3.1%)(0.5%)(52.6%)(5.1%)(0.7%)(73.6%)(3.5%)(1.4%)(78.9%)(4.9%)(1.2%))(51.9%)(3.5%)(0.7%)(58.8%)(3.5%)(0.5%)Factor/MotifCategoryaGATA1Human(Total)AGATAAGWithMouseTFBS-pairalignedSOX2Human(Total)PLOSComputationalBiology|www.ploscompbiol.org
WTAACAAWithMouseTFBS-pairalignedMYCHuman(Total)KCACGTGWithMouseTFBS-pairalignedMAXHuman(Total)6
KCACGTGWithMouseTFBS-pairalignedETS1Human(Total)MGGAAGTWithMouseTFBS-pairalignedCTCFHuman(Total)GGGGCKCWithMouseTFBS-pairalignedAugust2014|Volume10|Issue8|e1003771
ThefirstrowineachsectiongivesthetotalnumberofChIP-seqpeakswithbindingsitesinhumanswithinthe(2100,+100)windowseparatedintocategories.ThesecondrowshowsthenumberofthesepeaksalsocontainingaTFBSintheorthologousregionsinmouse,whilethethirdrowgivesthenumberofalignedbindingsitesacrossthetwospecies.Percentagesaregivenwithrespecttothetotalnumberofsharedandlineage-specificChIP-seqpeaksforeachfactor.bSharedpeaksarehumanChIP-seqpeakswithin200bpofaChIP-seqpeaksummitintheorthologousregioninmouse.Analogouscelltypeswereusedacrossspecies(GATA1:Erythroblasts,SOX2:Embryonicstemcells,MYC,MAX,ETS1,CTCF:B-lymphocytes).cLineage-specificpeaksinhumanarenotwithin200bpofamouseChIP-seqpeakintheanalogouscelltype.Onlyhumanpeakswithidentifiableorthologousregionsinmousewereincluded.dThenumbers(andfractions)ofhumanChIP-seqpeaksineachcategorycontainingbindingmotifoccurrencesestimatedtobepresentinthehuman-mouseancestor(‘Ancestralsites’).eThenumbers(andfractions)ofhumanChIP-seqpeaksineachcategorycontainingonlybindingmotifoccurrencesoriginatingafterhuman-mousedivergence(‘Lineage-specificsites’).doi:10.1371/journal.pcbi.1003771.t001aLineage-SpecificBindingSitesEvolution
Lineage-SpecificBindingSitesEvolution
Additionally,tofurtherdemonstratetheeffectivenessofourmethodinidentifyingconservedTFBS,wedirectlycomparedwithmethodsthatusephylogeneticfootprintingapproaches.Wecomparedwithphylogeneticfootprintingmethodsatbothelementlevel(usingMotifMap[45]whichisbasedonthemethodusedin[46–48])andmodulelevel(usingPReMod[12]).Overall,ourmethodoutperformedbothMotifMapandPReMod(seeSupple-mentaryResultsinTextS1,TableS3,andFigureS5).
receptorclusteringandbinding.Hominid-specificbindingsitesforGATA1,mostcommonlyknownforitsroleinerythroiddifferentiation[51],werealsofoundenrichedneargenesinvolvedinaxonextensionofneuralcells.ForETS1,hominid-specificbindingsiteswereneargenesinvolvedinspinalcordneurondifferentiation,ventralspinalcorddevelopment,andbehavioralfearresponse.Wealsofoundthatthehominid-specificsitesareneargenesindifferentpathwaysascomparedtoSimian-specificsitesandmoreancestralsites(TableS4andTableS5).
Within-speciesvariationishigheramongTFBSsofmorerecentorigin
Recentworkhasreportedasubstantialdifferencebetweengenomiclocationsthatareconservedacrossspeciesversusthoseconservedwithinthehumanpopulation[26].Thus,wecomparedbothhumanvariationdataaswellassequenceconservationacrossprimatestotherelativeageoftheTFBSs.ComparingtheoverallfrequencyofcommonSNPsinhumansamongTFBSsoriginatingatdifferenttimesofevolutionshowedthatasubstantialfractionofhuman-specificTFBSscontainedcommonSNPs,comprisingover6%ofallhuman-specificTFBSs(Figure3).ThisismuchhigherthanthetotalfractionofTFBSsoverlappingwithacommonSNP,atamedianoflessthan3%acrossallsixfactors.
SincesubstantialvariationexistsinTF-bindingeventsbetweenhumanindividuals[9],thishighamountofvariationamonghuman-specificbindingsitesmaypartiallyreflectthefactthatsomeTFBSsinferredtobehuman-specificmaynotbesharedbytheentirehumanpopulation.However,recently-derivedTFBSsinhominidswerealsosubstantiallyenrichedforcommonSNPs,evenwhenexcludinghuman-specificTFBSs.Forinstance,amonghominid-specificbindingsitesthatarenothuman-specific,withamedianofalmost4%ofallsites.Asthesesitesaresharedacrossspecies,theycannotbefullyexplainedbyvariationwithinthepopulation.Incontrast,commonSNPswereconsistentlylowamongTFBSswithoriginspriortohominids(Figure3).NotethatthisobservationwasnotbiasedbytheSNPdensitysurroundingthebindingsites(FigureS3).
BranchesoforiginofCTCFbindingsitesdifferbetweencelltypespecificandubiquitouslyboundsites
ChIP-seqdataofCTCFisavailableforseveraldifferentcelltypes,andthuswesoughttodeterminewhetherCTCF-boundsiteshavedistinctagedistributionsaccordingtocelltype.WethusexpandedouranalysestoincludeChIP-seqdataforCTCFcollectedfromfourdifferentcelltypesinhumans:B-lymphocytes(GM12878),embryonicstemcells(H1hESC),cerebellum(HAC),andkidneycells(HRE).Interestingly,weobservedsubstantialdifferencesintheagedistributionofcell-typespecificandubiquitously-boundsites.Figure4showstheagedistributionforCTCF-boundsites,separatedaccordingtothenumberofcelltypesinwhicheachsitewasbound.Only43%ofallsitesboundbyCTCFinallfourcelltypeswereprimate-specificand10%werespecifictohominids.Forcell-specificsites,i.e.,thoseboundbyCTCFinonlyonecell-type,thesefractionsincreasedto63%and24%,respectively(Figure4).
Theseresultsareconsistentwithpreviousobservationsthatcell-typespecificCTCFbindingsitesarelessconservedacrossmammalianspeciesthanubiquitously-boundsites[39,52].How-ever,therecentoriginofthecell-typespecificTFBSssuggeststhatthiscannotsimplybeexplainedbyarelativelackofselectivepressure,sincecell-specificCTCFbindingsiteswerefrequentlyfoundtobeabsentinolderlineages.Instead,thereexiststhepossibilitythatcelltypespecificTFBSsmightcontributetolineage-specificregulatoryfunction.
Hominid-specificbindingsitestargetspecificbiologicalprocesses
Todeterminepotentialfunctionsforthenewlyderivedbindingsites,wetestedwhethergenespredictedtobetargetedbybindingsiteswithrecentoriginsinhominidswereinvolvedinspecificbiologicalprocessesorpathways.Suchenrichmentwasdeter-minedforgenesnearhominid-specificbindingsitescomparedtothetotallistofprotein-boundsitesforeachfactor,whereeachTFBSwasmappedtothenearestTSS,uptoadistanceof100kb.Thisallowedustoassesspotentiallineage-specificfunctionsofthesesitesrelativetositesofmoreancientorigin.
Geneslocatednearesttohominid-specificbindingsitesweremorefrequentlyenrichedforneuralandsensory-relatedfunctions,andwereinmanycasesinvolvedinneurologicalpathways(TableS2).CTCF,MYC,andSOX2targetgenesetswereallenrichedforGOcategoriesinvolvedinsensoryperception,whileGATA1,MYC,ETS1,andMAXwereenrichedforneuraldevelopmentanddifferentiationcategories.Amongthesixfactors,neural-relatedfunctionsareonlywell-documentedforSOX2,whichisinvolvedinneuronal-cellmaintenance[49,50]andwhosehominid-specifictargetsitesareenrichedgenesinvolvedinsensoryperception.Similarly,genesinproximitytohominid-specificbindingsitesforCTCFandMYCwereenrichedforsensoryperceptionprocessesandpathways,particularlythoserelatedtoolfaction,andinthecaseforMYC,hominid-specifictargetgeneswerealsoenrichedforgenesinvolvedinsynapseassemblyand
PLOSComputationalBiology|www.ploscompbiol.org
7
OurmethodidentifiedaTFBSturnovereventwithinafunctionallyconservedenhancer
Usingourframework,wethenutilizedgenome-widechromatindatatosearchforpotentialfunctionalconsequencesdrivenbybirthordeathofspecificlineage-specificTFBS.Weintersectedthelineage-specificTFBSswithpredictedhumanenhancerregionsmarkedbyChromHMMmodel[53]aswellasinvivoverifiedenhancerslistedintheVISTAEnhancerBrowser[54].Figure5showsapotentialfunctionaltake-overthroughTFBSturnoverinsideanenhancerafterhuman-mousedivergence.Atthesequencelevel,twoMAXbindingsiteswereidentifiedbyourmethodwithanancestraloneandaprimate-specificbindingsiteemergingafterhuman-bushbabysplit(Figure5).HerethesetwoMAXbindingsitesarealsoMYCbindingsitessinceMAXandMYChaveverysimilarmotif(theirChIP-seqpeaksoverlapinFigure5).Theorthologousregionofpredictedprimate-specificMAX/MYCbindingsitehasnoMAXorMYCChIP-seqsignalatallinmouse,whichisconsistentwithourlineage-specificprediction.SincetheyoungMAX/MYCbindingsiteonlylocates1,700bpupstreamoftheancestraloneandChIP-seqintensityofancestralbindingsiteismuchweakerinhumancomparedtomouse,thisislikelytobeaturnoverofMAX/MYCbindingsitewithintheenhancer.Thenweaskedwhetherthefunctionofpredictedenhancerwasconservedbetweenhumanandmouse.Interestingly,despitethepotentialturnoverofMAX/MYCbindingsiteinthesequencelevel,themouseorthologousregion
August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
Figure3.Within-speciesvariationofbindingsitesaccordingtotimeoforigin.BoxplotsshowthefractionofTFBSscontainingcommonSNPsinhumans[72],whereplotsshowthemedian(centerline),upper-andlower-quartile(boxes),andrange(whiskerextremes)ofpercentagesacrosstheTFBSsofsixTFs.TFBSsarecategorizedashuman-specific,hominid-specific(notincludinghuman-specificsites),Simianprimate-specific(notincludinghominid-specificsites),andancestral(presentinthehuman-mousecommonancestor).Overallfractions(includingallsites)areshownintheleft-mostboxplot.Notethesubstantialriseintheamountofhumanvariationwithinmorerecentlyderivedbindingsitescomparedtooldersites.doi:10.1371/journal.pcbi.1003771.g003
PLOSComputationalBiology|www.ploscompbiol.org8August2014|Volume10|Issue8|e1003771
PLOSComputationalBiology|www.ploscompbiol.org9Lineage-SpecificBindingSitesEvolution
August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
Figure4.TimeoforiginforhumanCTCFbindingsitesaccordingtocell-specificity.CTCFbindingsitesinhumanswereseparatedaccordingtocell-specificity,consideringfourdistinctcelllines(GM12878,H1hESC,HAC,andHRE).Coloredbarscorrespondtovaryingamountsofcell-specificity,denotingsitesboundinone,two,three,orinallfourcelltypes(redtodarkbluebars,respectively).Notethetendencyforcell-specificbindingsitestohavemorerecentevolutionaryoriginsthansitesboundubiquitouslyinallcelltypes.doi:10.1371/journal.pcbi.1003771.g004
ofpredictedenhancerwasfoundtodrivereproducibleLacZexpressioninE11.5mousebloodcellasdemonstratedbyinvivotransgenicmouseembryosessaybasedonVISTAEnhancerBrowser,whichconfirmsthatthepredictedenhanceralsofunctionsasanenhancerinmouse.ItcertainlyremainstobesolvedwhetherthisenhancerregulatesthesamegenesinhumanandmouseandwhytheChIP-seqsignalonancientMAX/MYCbindingsiteismuchweakerthantheyoungeroneinhuman.Nevertheless,thisexampledemonstratestheabilityofourmethodtocomparefunctionalleveldynamicswithsequenceleveldifferenceinanevolutionaryframework.
Discussion
Theapproachpresentedhererepresentsaninitialsteptowardsunderstandingregulatoryelementevolutionusinginsilicomethodswithoutrelyingonaccuratecross-speciesalignments.StudiesregardingtheevolutionofTFBSsarelargelyseparableintothosethatemphasizecross-speciesconservationofregulatoryelements[13,27,55–57]andstudieshighlightingthesubstantialdivergenceoftranscriptionfactor-bindingeventsacrossspecies[5–7,58–60].Tosomeextent,thisdichotomymaylargelyreflectdifferencesbetweenanalysesconductedinsilicoandexperiment-basedstudies.Althoughcomputationalapproacheshavehadsomesuccessinidentifyingcis-regulatoryalterationsresponsibleforchangesinphenotype[16,17],thestudyofregulatorysequenceevolutionislimitedbyrelianceuponmultiplesequencealignments[20,21].Reconstructingtheancestralstatesofregulatorysequenc-esisaparticularlychallengingproblem;comparativestudiesofregulatoryelementsgenerallycategorizesitesas‘conserved’and‘non-conserved’intermsoftheirpresenceacrossspecies[34,39,61].‘Non-conserved’sitesarethusoftenassumedtobeunderaweakeramountofpurifyingselection,indicatingarelativelackoffunction[62].However,anyinterpretationoftheresultsisobscuredbythefactthatnodistinctioncanbemadebetweenancestralsitesthathavelaterbeenlost,versussitesofrecentorigins.Newly-derivedfunctionalelements,whicharealso‘non-conserved’bythecommondefinition,mayinduceagain-of-functiontrait,harboringthepotentialforlineage-specificadapta-tionorpositiveselection.Ithaslongbeenarguedthatalterationsinregulatoryfunctionareresponsibleformany,ifnotmost,species-specifictraits[1–3],anditisindeedtheseelementsthatarelikelytocontributetophenotypicadaptationandthevariationseenacrossspecies.
Inthiscontext,thehighfractionofTFBSsthathaverecentoriginsafterhuman-mousedivergenceisparticularlynotable.Forallsixfactorsanalyzed,themajorityofhumanTFBSsboundinvivowereoriginallyabsentinhuman-mousecommonancestor,whichisconsistentwithpreviouscross-speciescomparisonsnotingsubstantialdivergenceinChIP-seqprotein-bindingeventsacrossthetwospecies[5,6]andsimilarcomparisonspresentedhere(Table1),andisalsocomparabletodetailedanalysesconductedinDrosophilausingalternativeapproaches[57].ThisdoesnotappeartosimplybeaconsequenceofthespecificselectionofTFs,sincebindingmotifsforseveralfactorsanalyzed(CTCF,ETS1,MYC,MAX)havebeenpreviouslydocumentedas‘highlyconserved’comparedtomotifsofotherfactors[27,39].Inparticular,acomprehensivescanforconservedmotifsidentified
PLOSComputationalBiology|www.ploscompbiol.org
10
thebindingmotifsofMYCandETS1asthesecondandthird-highestrankingmotifsacrossallmotifsintermsofconservationscoreacrosshuman,mouse,rat,anddog(wheretheETS1bindingmotifwasdenotedastheELK1motif)[27].Itisimportanttonotethat‘phylogeneticfootprinting’approachesusuallymeasuremotifconservationrelativetoneutrallyevolvingelementsinnon-codingregions.Thefactthatsuchmotifsaresubstantiallymoreconservedcomparedtoaneutralproxydoesnotmeanthatthemajorityofbindingsitesareconserved,nordoesitimplythatprotein-boundsitesconsideredcollectivelyacrossthegenomearemoreconservedthancodingsequencesunderrelativelyweakselection.Sincesomeofthemost‘highlyconserved’regulatorymotifsarelargelycomprisedbysiteswithrecentorigins,itisunlikelythatthisbirthofnewbindingsitesissimplyaspecialpropertyofahandfulofTFs,butinsteadislikelytoapplyacrossmanyotherfactors.Nevertheless,someanalysischallengesremain.Forexample,itisacknowledgedthatnotallcomputationallypredictedTFBSswithinChIP-seqpeakswereactivelyboundbytheTFandmanyChIP-seqpeaksmaycorrespondtoexperimentalnoise.WhileresearchershavecontinuouslyimprovedTFBSidentification,high-throughputexperimentalapproacheswouldbenecessarytosystematicallyvalidatebindingsitepredictions,especiallyforlineage-specificones.
AmongTFBSsinhumans,aconsiderableamountofthemareuniquetohominidsandareevenhuman-specific.Sincethereareanestimated,1700–1900TFsinthehumangenome[63],ifevenasmallfractionofthesesitesharborimportantregulatorypotential,thetotalnumberofhuman-specificfunctionalbindingsitesgenome-wideisquitelarge.AlthoughnotallTFBSswithrecentoriginswillberesponsibleforlineage-specifictraits,theseresultsnonethelessofferthepotentialtounderstandadaptiveevolutionofgeneregulationviathecreationofnewTFBSs,notonlyalone,butalsoincombination.Inaddition,anumberofrecentstudieshavehighlighteddifferencesintranscriptionfactorbindingwithinhumans[9,64],andourresultssuggestthatsequence-levelvariationofTFBSswithinthepopulationmaybemorecommonamongmorerecently-derivedbindingsites.
Thefunctionalcategoriesofgenesclosetorecentlyderivedbindingsitesmayservetoshednewlightuponadaptivetraitsobtainedalongthelineageleadingtohuman.Itisinterestingthat,despitesubstantialdifferencesinthebiologicalfunctionsgenerallyassociatedwiththesixTFsanalyzed,genesnearbindingsiteswithrecentoriginsareenrichedforseveralsensoryandneuralrelatedpathwaysandprocesses.WenotethatsincetheChIP-seqdataweusedinthisstudywerenotderivedfromneuronalcells,furtherstudyisneededtomorecomprehensivelyunderstandtherolesofhominid-specificsitesinspecificcelltypes.Nevertheless,identify-ingsuchlineage-specificregulatoryelementsnotonlyprovidespotentialinsightonhumanbiology,butmayalsoprovidenewknowledgeonthemolecularmechanismsofhumandiseases.AnaturalfuturedirectionforthisworkwouldbetodeterminethespecificregulatoryeffectsoftherecentlyderivedTFBSsidentifiedusingthismethod.Forinstance,enrichmentforwithin-speciesvariationamongrecentlyderivedbindingsitesraisestheintriguingpossibilitythatrecentlyderivedTFBSsmostresponsibleforphenotypicdifferencesacrossspeciesarealsotheelementsresponsibleforwithin-speciesvariation.Futureworkwillbenecessarytodemonstratewhetherthisisthecaseandthe
August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
Figure5.ATFBSturnovereventwithinafunctionallyconservedenhancer.ATFBSturnovereventshowstheimpactoflineage-specificTFBSwithinanenhancer.TheGenomeBrowserviewshowstheupstreamofhumangeneEPB41.VISTAEnhancertrackandChromHMMtrack(orangemeansstrongenhancer,yellowmeansweakenhancer)indicateaputativehumanenhancer.ChIP-seqsignalsofthreeTFsusedinthisstudynearpredictedenhancerregionareconsistentwithpredictedlineage-specificbindingsiterepresentedby46-waymultiplesequencealignment(onlyasubsetofspeciesareshown).NotethatherethetwoMAXbindingsitesarealsoMYCbindingsitessinceMAXandMYChaveverysimilarmotif.ApotentialTFBSturnoverisobservedbetweentwopredictedMAX/MYCbindingsites(1700bpapart).DifferentTFBSsarehighlightedindifferentcolorswithMAXinblueandGATA1inred.Thepredictedenhancermayfunctionasbloodcellspecificenhancerinmouse,demonstratedbyimagesofLacZpositiveE11.5mousetransgenicembryoontheVISTAEnhancerBrowser[54](ID:mm80;http://enhancer.lbl.gov/cgi-bin/imagedb3.pl?form=presentation&show=1&experiment_id=80&organism_id=2).doi:10.1371/journal.pcbi.1003771.g005
PLOSComputationalBiology|www.ploscompbiol.org11August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
differencesintraitsbroughtaboutbythisvariation.Also,ourcurrentmodelneedstobeintegratedwithgeneexpressiondatatounderstandtheinterplaybetweencis-regulatoryelementevolution(e.g.,bindingsiteturnoverandlineage-specificsites)andgeneexpressiondifferencesacrossdifferentspecies[65–68].TheextenttowhichnewlyderivedTFBSsoperateprimarilyascell-typespecificelementswithcell-specificregulatoryfunctionalsoremainsanopenquestion.Themechanismsthatcontributetocell-typespecificTFbinding,whetherthroughthepresenceorabsenceofotherproteinfactors,accessibilityofDNAwithinthechromatinstructure,orbyothermeans,arealsopossiblefuturedirectionsthatcanbemorefullyunderstoodusingacombinationofdifferenttypesofdatathatarebecomingavailable.
VN{i,b(t)thatboftheseunoccupiedsitesbecomeoccupiedaftertimetalsofollowsthebinomialdistribution:
VN{i,b(t)~
(N{i)!
v(t)b(1{v(t))N{i{b
(N{i{b)!b!
ð5Þ
Thetransitionprobabilitypij(t)thatthegivenregioncontainsjsitesaftertimetisthengivenby
min(i,j)Xk~0
pij(t)~
Ui,k(t):VN{i,j{k(t)
ð6Þ
MethodsThemodel
Ourapproachisdesignedtoestimategenome-wideratesof
evolutionforagivenmotifaccordingtoabirth-deathframework(formally,aquasibirth-deathprocess[69]),similartothatusedtomeasurethetimingofacceleratedmotifevolutionasin[29].Thebirthrate(a)representstherateatwhichanewmotifoccurrenceappearsatanyunoccupiedsiteperyear,whilethedeathrate(b)representstherateatwhichanexistingsiteislostperyear.Givenasetoforthologoussequencesandaknownphylogeny,weestimatebirthanddeathratesforthemotifacrossthephylogenetictreeusingamaximumlikelihoodapproach.
Letw(t)denotetheprobabilitythatagivenTFBSmotifoccupiesthatnucleotidepositionattimet.Theprobabilitythatthemotifwillexistattimetz1isthen
w(tz1)~a(1{w(t))z(1{b)w(t)
ð1Þ
Here,thesumisoverallpossiblevaluesk,wherekrepresentsthenumberofmotifoccurrencesattimetamongthesitesthatwereoriginallyoccupiedattimet~0.
Calculatingthelikelihoodofthedata
Giventhebirthanddeathrates(aandb)acrossthetree(whichareestimatedusingthemethoddescribedbelow),wecancalculatethelikelihoodofthedatausingFelsenstein’spruningalgorithm[70].Letusfirstconsiderdatafromasinglesequence.Weleth~½a,brepresenttheparametervectorcomprisingthebirthanddeathrates,andletDYrepresentthedatadownstreamofanodeYinthephylogeny.LetZ1,Z2,:::,ZmbethedaughternodesofY,occurringattimestZ1,tZ2,:::,tZmrelativetoparentnodeY,respectively.
IfrandomvariableQYrepresentsthenumberofmotifoccurrencesatnodeY,thelikelihoodxY(i;h)~Pr(DYDQY~i;h)ofthedatadownstreamofY,assumingimotifoccurrencesexistatnodeY,canbeobtainedrecursively.Thislikelihoodisgivenby
xY(i;h)~P
m
Settingw0(t)~w(tz1){w(t)tobetherateofchangeofwwithrespecttot,Eq(1)givesthedifferentialequation
w0(t)~a{(azb)w(t)
ð2Þ
X
j
k~1
pij(tZk):xZk(j;h)
ð7Þ
WedenotethesolutionstoEq(2)byu(t)andv(t),whereu(t)assumesthatthemotifwaspresentatthissiteattimet~0,whilev(t)assumesthatthemotifdidnotexistattimet~0(i.e.,initialconditionsw(0)~1andw(0)~0,respectively).Asu(t)andv(t)aresolutionsforw(t)inEq(2),bothrepresenttheprobabilitythatthemotifexistsataspecificnucleotidepositionaftertimet,differingonlyintheinitialconditions.SolvingEq(2)gives
u(t)~
Ã1Â
azbe{(azb)t,azb
v(t)~
ÃaÂ
1{e{(azb)tð3Þazb
wheretheinnersumisacrossallpossiblevaluesforj,correspondingtothenumberofmotifoccurrencesatdaughternodeZk.IfnodeZisamodernlineage,theprobabilityxZ(j;h)isequalto1ifweactuallyobservejmotifoccurrenceswithinthesequence,whilethelikelihoodiszerootherwise.
Thelikelihoodofthedatacanthereforebeobtainedrecursivelybydeterminingthevaluesx(i;h)progressivelyforeachnodefartherupthetree.Thelog-likelihoodL(D(k);h)forasinglesequence(thekthsequence)isthengivenby
L(D;h)~log
(k)
hX
P(j):xR(j;h)j
i
ð8Þ
Thetransitionprobabilitypij(t)istheconditionalprobability
thatagivenregionwillcontainjoccurrencesofthemotifaftertimet,assumingiinitialoccurrencesofthemotifwithintheregion.Namely,ifasequenceinitiallycontainsimotifoccurrenc-es,theprobabilityUi,k(t)thatkoftheseoccurrencesremainaftertimetisgivenbythebinomialdistribution:
Ui,k(t)~
i!
u(t)k(1{u(t))i{k
(i{k)!k!ð4Þ
whereRistherootnodeandP(j)isthepriorprobabilitythatjbindingsitesexistinasinglesequence.Forourimplementation,priorprobabilitiesP(j)weresettothePoissondistribution:P(j)~lje{l=j!wherelisthemeannumberofmotifoccurrencespersequence.Thetotallog-likelihoodL(D;h)isthen
Pn
(k)
thesumL(D;h)~;h)acrosseachofthenk~1L(Dregions.
Determiningtheoptimalancestralstates
Wecandeterminethemostlikelyancestralstatesusingthecomputedvaluesforx(j;h)ateachnodeinthephylogeny.AttherootnodeR,themostlikelyancestralstateistheonethatproducesthehighestlikelihood;thatis,thevalueofjthat
12
August2014|Volume10|Issue8|e1003771
Similarly,ifthewidthofourregionisNnucleotidesites,thereareinitiallyN{iunoccupiedsites.Thustheprobability
PLOSComputationalBiology|www.ploscompbiol.org
Lineage-SpecificBindingSitesEvolution
maximizestheexpressionqR~argmaxjP(j):xR(j;h).Progres-sivelymovingdownthetree,ifthemostlikelynumberofmotifoccurrencesatparentnodeYisqY,theoptimalnumberofmotifoccurrencesqZatadaughternodeZisgivenby
qZ~argmaxjxZ(j;h):pqYj(tZ)
wheretZisthebranchlengthfromnodeYtonodeZ.
ð9Þ
BackgroundSNPdensityforTFBSswith
differentbranchoforigin.TFBSsweregroupedintodifferentbranchesoforigin(X-axis).TocalculatethebackgroundSNPdensitysurroundingtheseTFBSs,weextended1kb,500bp,or125bptobothdirections(i.e.,2k,1k,or250bpwindow)andcountedthenumberofcommonSNPsinthis2kbwindow.ThefigureshowsthattherearenosignificantdifferencesofSNPdensitysurroundingtheTFBSswithdifferentbranchesoforigin.(TIF)
FigureS3
FigureS4PositiveenhancerratebasedontheVISTA
Birth-deathrateestimation
Birthanddeathratescanbeestimatedusingamaximum-likelihoodapproach.Namely,weuseanEM-basedapproach[71]toiterativelyoptimizethelikelihoodofthedataDgiventheparametersh~½a,b.Webeginwithaninitialestimateh(0)forthebirth-deathrates,generatedbydeterminingempiricalbirth-deathratesafterconductingancestralreconstructionusingparsimony(inouranalysis,wefoundthattheoptimizedparameterswerenotsensitivetotheinitialestimates).Wedeterminethemostlikelyancestralstateateachnodegiventheinitialparametervalues.Wethendeterminetheobservednumberofbirthsanddeathsaccordingtotheseoptimalancestralstates,providingnewestimatesforthebirthanddeathratesh(1)~½a(1),b(1).Wethencontinuetheprocess,usingthepreviousparameterestimatesh(i)ateachiterationtoestimatetheoptimalancestralstatesandobtainmoreoptimalestimatesofthebirthanddeathratesh(iz1)untilconvergence(i.e.,where(iz1)2h{h(i)fallsbelowacertainthreshold).
enhancerdatabase.Thepositiverateisthepercentageof
humanenhancersthatalsoshowenhanceractivityinmouse.TheexpectedrateisbasedonalltheenhancersoverlappingwithTFBSusedinoursixTFdatasets.‘Moreancestralenhancer’hashigherpositiveenhancerratecomparedwith‘morelineage-specificenhancers’or‘neutralenhancer’,whichisgenerallyconsistentwithourcomputationalpredictionthattheancestralTFBSaremorefunctionallyconservedthanlineage-specificTFBS,eventhoughthiscomparisondatasetisnotideal.SeeSupplementaryResultsinTextS1fordetails.(TIF)
FigureS5ComparisonbetweenourmethodandMotif-
SupportingInformation
Fractionofbindingsitesoverlappingtrans-posableelements.PlotsshowthepercentageofhumanTFBSsoverlappingtransposableelements(TEs)(y-axis),wherebindingsitesareseparatedaccordingtothebranchoforigin(x-axis).(A)Eachcoloredplotcorrespondstoasingleconsensusmotif;thefractionofbindingsitesoverlappingTEsdocumentedbyRepeatMasker[42]aregivenaspercentagesalongthey-axis.(B)Genome-wideprevalenceofsevencommonTEfamiliesinhumans.FractionsdenotethetotalnumberofsitesderivedfromeachfamilyacrossallTEsdocumentedbyRepeatMasker.(C)TEfamilycompositionforTE-derivedTFBSaccordingtotheageoforiginofSOX2,GATA1,andCTCFbindingsites.ThetotalheightofeachplotshowsthetotalfractionofTFBSoverlappingknownTEs.ColoredregionscorrespondingtothecoloredregionsofPanelBdenotethefractionofTFBSderivedfromeachTEfamilyaccordingtotheestimatedageofthebindingsites(x-axis).(TIF)
FigureS1
FigureS2PhyloPconservationvs.TFBSwithdifferent
Map.Areceiveroperatingcharacteristic(ROC)curveshowsthepredictionpowerbetweenourmethodandMotifMap.ROCcurvesforMotifMapweregeneratedusingdifferentBBLSthresholds(rangingfromzerotothemaximumpossibleBBLSscorehere,4.73)tocallaTFBSasaconservedone.Inourmethod,wetestedtwoshiftsizes,+/215bp(lightblue)and+/230bp(darkblue).TheresultsfromMotifMapwerebasedon+/215bpshiftsize(magenta).SeeSupplementaryResultsinTextS1fordetailedexplanationofthecomparisonmethodandhowthebenchmarkdatasetwasconstructed.(TIF)
FigureS6Sensitivityofourmethodwhenweaddnoisy
sitesinleafnodes.Sensitivityofourmethodontheuncertaintyofnumberofbindingsitesinleafnodeswasdeterminedbyrandomlydeleting/adding5%sitesin+/2100bpofpeaksummitinallthespeciesexcepthuman.Predictionswerecomparedwithoriginalresultsandthechangesofbranchoforiginforeachbindingsitewerecounted.Y-axisshowsthepercentageofbindingsiteswithvariousbranchchange.0meansthepredictionofbranchoforiginsremainsamebeforeandafterwerandomlydeleteorinsertbindingsites.(TIF)
Sensitivityofourmethodwhenwechangethe
branchlengths.Sensitivityofourmethodonthebranchlengthofphylogenetictreewascharacterizedbyrandomlychangingthebranchlengthstoacertainextent.Thelengthofeachbranchinphylogenetictreecanbevariedincertainrangerelativetoitsoriginallength(shownonX-axis).Y-axisshowsthepercentageofbindingsitesthathavedifferentbranchoforiginbeforeandafterwerandomlychangebranchlengthsinthephylogenetictree.(TIF)
FigureS7
TableS1Comparisonofconsensusmotifs.
branchoforigins.WeusedthePhyloPmammalianconserva-tionscoresavailableattheUCSCGenomeBrowsertodeterminethesequenceconservationlevelforTFBSwithdifferentbranchoforiginsinhuman.X-axisshowsTFBSwithdifferentbranchoforiginsforfourdifferentwindowsizessurroundingthepeaksummit.Y-axisshowstheZ-scoredistributionforeachgroup.ForaspecificTF,wefirstcomputedtheaveragePhyloPscore(X)ineachChIP-seqpeakandthencalculatedtheaveragescore(M)aswellasstandarddeviation(SD)acrossallpeaksinthegenome.Wethengroupedthebindingsitesaccordingtotheirbranchoforigin(infourgroups:Hominid-specific,Simian-specific,Primate-specific,andEutherian-specific)andcalculatedtheaveragePhyloPscore(X).Finally,wecalculatedtheZ-score,i.e.(X-M)/SD.t-statisticfromt-testbetweentheyoungestandtheoldestforeachgroupisalsoshown.(TIF)
PLOSComputationalBiology|www.ploscompbiol.org
13
(PDF)
TableS2Genefunctionsandpathwaysassociatedwith
hominid-specificTFBS.(PDF)
TableS3PerformancecomparisonwithMotifMapand
PReMod.(PDF)
August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
TableS4Genefunctionsandpathwaysassociatedwith
simian-specificTFBS.(PDF)
TableS5Genefunctionsandpathwaysassociatedwith
Acknowledgments
TheauthorswouldliketothankLisaStubbsandSaurabhSinhaforhelpfuldiscussionsandcommentsthatimprovedthemanuscript.
ancestralTFBS.(PDF)
TextS1Supplementarymethodsandsupplementary
AuthorContributions
Conceivedanddesignedtheexperiments:KDYYZJM.Performedtheexperiments:KDYYZ.Analyzedthedata:KDYYZ.Contributedreagents/materials/analysistools:KDYYZ.Contributedtothewritingofthemanuscript:KDYYZJM.
results.(PDF)
References
1.WrayGA(2007)Theevolutionarysignificanceofcis-regulatorymutations.NatRevGenet8:206–216.
2.DavidsonEH(2001)Genomicregulatorysystems:developmentandevolution.SanDiego:AcademicPress.xii,261p.p.
3.KingMC,WilsonAC(1975)Evolutionattwolevelsinhumansandchimpanzees.Science188:107–116.
4.WrayGA,HahnMW,AbouheifE,BalhoffJP,PizerM,etal.(2003)Theevolutionoftranscriptionalregulationineukaryotes.MolBiolEvol20:1377–1419.
5.OdomDT,DowellRD,JacobsenES,GordonW,DanfordTW,etal.(2007)Tissue-specifictranscriptionalregulationhasdivergedsignificantlybetweenhumanandmouse.NatGenet39:730–732.
6.SchmidtD,WilsonMD,BallesterB,SchwaliePC,BrownGD,etal.(2010)Five-vertebrateChIP-seqrevealstheevolutionarydynamicsoftranscriptionfactorbinding.Science328:1036–1040.
7.BourqueG,LeongB,VegaVB,ChenX,LeeYL,etal.(2008)Evolutionofthemammaliantranscriptionfactorbindingrepertoireviatransposableelements.GenomeRes18:1752–1762.
8.ScallyA,DutheilJY,HillierLW,JordanGE,GoodheadI,etal.(2012)Insightsintohominidevolutionfromthegorillagenomesequence.Nature483:169–175.9.KasowskiM,GrubertF,HeffelfingerC,HariharanM,AsabereA,etal.(2010)Variationintranscriptionfactorbindingamonghumans.Science328:232–235.10.SiepelA,BejeranoG,PedersenJS,HinrichsAS,HouM,etal.(2005)
Evolutionarilyconservedelementsinvertebrate,insect,worm,andyeastgenomes.GenomeResearch15:1034–1050.
11.HardisonRC,OeltjenJ,MillerW(1997)Longhuman-mousesequence
alignmentsrevealnovelregulatoryelements:areasontosequencethemousegenome.Genomeresearch7:959–966.
12.BlanchetteM,BatailleAR,ChenX,PoitrasC,LaganiereJ,etal.(2006)
Genome-widecomputationalpredictionoftranscriptionalregulatorymodulesrevealsnewinsightsintohumangeneexpression.GenomeRes16:656–668.13.KellisM,PattersonN,EndrizziM,BirrenB,LanderES(2003)Sequencingand
comparisonofyeastspeciestoidentifygenesandregulatoryelements.Nature423:241–254.
14.MarguliesEH,BlanchetteM,ProgramNCS,HausslerD,GreenED(2003)
Identificationandcharacterizationofmulti-speciesconservedsequences.GenomeRes13:2507–2518.
15.Lindblad-TohK,GarberM,ZukO,LinMF,ParkerBJ,etal.(2011)Ahigh-resolutionmapofhumanevolutionaryconstraintusing29mammals.Nature478:476–482.
16.LoweCB,KellisM,SiepelA,RaneyBJ,ClampM,etal.(2011)Threeperiodsof
regulatoryinnovationduringvertebrateevolution.Science333:1019–1024.17.HillerM,SchaarBT,IndjeianVB,KingsleyDM,HageyLR,etal.(2012)A
‘‘ForwardGenomics’’ApproachLinksGenotypetoPhenotypeusingIndepen-dentPhenotypicLossesamongRelatedSpecies.CellRep2:817–823.
18.McLeanCY,RenoPL,PollenAA,BassanAI,CapelliniTD,etal.(2011)
Human-specificlossofregulatoryDNAandtheevolutionofhuman-specifictraits.Nature471:216–219.
19.ChenX,TompaM(2010)Comparativeassessmentofmethodsforaligning
multiplegenomesequences.NatBiotechnol28:567–572.
20.KimJ,MaJ(2011)PSAR:measuringmultiplesequencealignmentreliabilityby
probabilisticsampling.NucleicAcidsRes39:6359–6368.
21.MajorosWH,OhlerU(2010)Modelingtheevolutionofregulatoryelementsby
simultaneousdetectionandalignmentwithphylogeneticpairHMMs.PLoSComputBiol6:e1001037.
22.LunterG,RoccoA,MimouniN,HegerA,CaldeiraA,etal.(2008)Uncertainty
inhomologyinferences:assessingandimprovinggenomicsequencealignment.GenomeRes18:298–309.
23.MarguliesEH,BlanchetteM,HausslerD,GreenED(2003)Identificationand
characterizationofmulti-speciesconservedsequences.GenomeRes13:2507–2518.
24.BlanchetteM,KentWJ,RiemerC,ElnitskiL,SmitAF,etal.(2004)Aligning
multiplegenomicsequenceswiththethreadedblocksetaligner.GenomeRes14:708–715.
25.(2010)Amapofhumangenomevariationfrompopulation-scalesequencing.
Nature467:1061–1073.
26.WardLD,KellisM(2012)Evidenceofabundantpurifyingselectioninhumans
forrecentlyacquiredregulatoryfunctions.Science337:1675–1678.
27.XieX,LuJ,KulbokasEJ,GolubTR,MoothaV,etal.(2005)Systematic
discoveryofregulatorymotifsinhumanpromotersand39UTRsbycomparisonofseveralmammals.Nature434:338–345.
28.HeX,LingX,SinhaS(2009)Alignmentandpredictionofcis-regulatory
modulesbasedonaprobabilisticmodelofevolution.PLoSComputBiol5:e1000299.
29.YokoyamaKD,PollockDD(2012)SPtranscriptionfactorparalogsandDNA-bindingsitescoevolveandadaptivelyconvergeinmammalsandbirds.GenomeBiolEvol4:1102–1117.
30.KimJ,HeX,SinhaS(2009)Evolutionofregulatorysequencesin12Drosophila
species.PLoSGenet5:e1000330.
31.JohnsonDS,MortazaviA,MyersRM,WoldB(2007)Genome-widemappingof
invivoprotein-DNAinteractions.Science316:1497–1502.
32.ChenX,XuH,YuanP,FangF,HussM,etal.(2008)Integrationofexternal
signalingpathwayswiththecoretranscriptionalnetworkinembryonicstemcells.Cell133:1106–1117.
33.ListerR,PelizzolaM,DowenRH,HawkinsRD,HonG,etal.(2009)Human
DNAmethylomesatbaseresolutionshowwidespreadepigenomicdifferences.Nature462:315–322.
34.MyersRM,StamatoyannopoulosJ,SnyderM,DunhamI,HardisonRC,etal.
(2011)Auser’sguidetotheencyclopediaofDNAelements(ENCODE).PLoSbiology9:e1001046.
35.MouseEC,StamatoyannopoulosJA,SnyderM,HardisonR,RenB,etal.
(2012)AnencyclopediaofmouseDNAelements(MouseENCODE).GenomeBiol13:418.
36.DavisonAC(2003)StatisticalModels.NewYork:CambridgeUniversityPress.37.MillerW,RosenbloomK,HardisonRC,HouM,TaylorJ,etal.(2007)28-way
vertebratealignmentandconservationtrackintheUCSCGenomeBrowser.GenomeRes17:1797–1808.
38.BaileyTL,BodenM,BuskeFA,FrithM,GrantCE,etal.(2009)MEME
SUITE:toolsformotifdiscoveryandsearching.NucleicAcidsRes37:W202–208.
39.SchmidtD,SchwaliePC,WilsonMD,BallesterB,GoncalvesA,etal.(2012)
WavesofretrotransposonexpansionremodelgenomeorganizationandCTCFbindinginmultiplemammalianlineages.Cell148:335–348.
40.WardMC,WilsonMD,Barbosa-MoraisNL,SchmidtD,StarkR,etal.(2012)
LatentRegulatoryPotentialofHuman-SpecificRepetitiveElements.MolCell.41.WangT,ZengJ,LoweCB,SellersRG,SalamaSR,etal.(2007)Species-specific
endogenousretrovirusesshapethetranscriptionalnetworkofthehumantumorsuppressorproteinp53.ProcNatlAcadSciUSA104:18613–18618.42.SmitAF,HubleyR,GreenP(1996–2010)RepeatMaskerOpen-3.0.
43.KunarsoG,ChiaNY,JeyakaniJ,HwangC,LuX,etal.(2010)Transposable
elementshaverewiredthecoreregulatorynetworkofhumanembryonicstemcells.NatGenet42:631–634.
44.PollardKS,HubiszMJ,RosenbloomKR,SiepelA(2010)Detectionof
nonneutralsubstitutionratesonmammalianphylogenies.GenomeRes20:110–121.
45.XieX,RigorP,BaldiP(2009)MotifMap:ahumangenome-widemapof
candidateregulatorymotifsites.Bioinformatics25:167–174.
46.StarkA,LinMF,KheradpourP,PedersenJS,PartsL,etal.(2007)Discoveryof
functionalelementsin12Drosophilagenomesusingevolutionarysignatures.Nature450:219–232.
47.XieX,MikkelsenTS,GnirkeA,Lindblad-TohK,KellisM,etal.(2007)
Systematicdiscoveryofregulatorymotifsinconservedregionsofthehumangenome,includingthousandsofCTCFinsulatorsites.ProcNatlAcadSciUSA104:7145–7150.
48.KheradpourP,StarkA,RoyS,KellisM(2007)Reliablepredictionofregulator
targetsusing12Drosophilagenomes.GenomeRes17:1919–1931.
49.GiorgettiA,MarchettoMC,LiM,YuD,FazzinaR,etal.(2012)Cordblood-derivedneuronalcellsbyectopicexpressionofSox2andc-Myc.ProcNatlAcadSciUSA109:12556–12561.
50.CavallaroM,MarianiJ,LanciniC,LatorreE,CacciaR,etal.(2008)Impaired
generationofmatureneuronsbyneuralstemcellsfromhypomorphicSox2mutants.Development135:541–557.
51.PevnyL,LinCS,D’AgatiV,SimonMC,OrkinSH,etal.(1995)Development
ofhematopoieticcellslackingtranscriptionfactorGATA-1.Development121:163–172.
PLOSComputationalBiology|www.ploscompbiol.org14August2014|Volume10|Issue8|e1003771
Lineage-SpecificBindingSitesEvolution
52.ChenH,TianY,ShuW,BoX,WangS(2012)Comprehensiveidentification
andannotationofcelltype-specificandubiquitousCTCF-bindingsitesinthehumangenome.PLoSONE7:e41374.
53.ErnstJ,KellisM(2012)ChromHMM:automatingchromatin-statediscovery
andcharacterization.NatMethods9:215–216.
54.ViselA,MinovitskyS,DubchakI,PennacchioLA(2007)VISTAEnhancer
Browser–adatabaseoftissue-specifichumanenhancers.NucleicAcidsRes35:D88–92.
55.Lindblad-TohK,GarberM,ZukO,LinMF,ParkerBJ,etal.(2011)Ahigh-resolutionmapofhumanevolutionaryconstraintusing29mammals.Nature478:476–482.
56.WoolfeA,GoodsonM,GoodeDK,SnellP,McEwenGK,etal.(2005)Highly
conservednon-codingsequencesareassociatedwithvertebratedevelopment.PLoSBiol3:e7.
57.EmberlyE,RajewskyN,SiggiaED(2003)Conservationofregulatoryelements
betweentwospeciesofDrosophila.BMCBioinformatics4:57.
58.GoncalvesA,Leigh-BrownS,ThybertD,StefflovaK,TurroE,etal.(2012)
Extensivecompensatorycis-transregulationintheevolutionofmousegeneexpression.GenomeRes22:2376–2384.
59.DermitzakisET,ClarkAG(2002)Evolutionoftranscriptionfactorbindingsites
inMammaliangeneregulatoryregions:conservationandturnover.MolBiolEvol19:1114–1121.
60.BradleyRK,LiXY,TrapnellC,DavidsonS,PachterL,etal.(2010)Binding
siteturnoverproducespervasivequantitativechangesintranscriptionfactorbindingbetweencloselyrelatedDrosophilaspecies.PLoSBiol8:e1000343.61.WhitfieldTW,WangJ,CollinsPJ,PartridgeEC,AldredSF,etal.(2012)
Functionalanalysisoftranscriptionfactorbindingsitesinhumanpromoters.GenomeBiol13:R50.62.WassermanWW,SandelinA(2004)Appliedbioinformaticsfortheidentification
ofregulatoryelements.NatRevGenet5:276–287.
63.VaquerizasJM,KummerfeldSK,TeichmannSA,LuscombeNM(2009)A
censusofhumantranscriptionfactors:function,expressionandevolution.NatRevGenet10:252–263.
64.McDaniellR,LeeBK,SongL,LiuZ,BoyleAP,etal.(2010)Heritable
individual-specificandallele-specificchromatinsignaturesinhumans.Science328:235–239.
65.TiroshI,BarkaiN(2011)Inferringregulatorymechanismsfrompatternsof
evolutionarydivergence.MolSystBiol7:530.
66.TiroshI,WeinbergerA,BezalelD,KaganovichM,BarkaiN(2008)Onthe
relationbetweenpromoterdivergenceandgeneexpressionevolution.MolSystBiol4:159.
67.RomeroIG,RuvinskyI,GiladY(2012)Comparativestudiesofgeneexpression
andtheevolutionofgeneregulation.NatRevGenet13:505–516.
68.CusanovichDA,PavlovicB,PritchardJK,GiladY(2014)Thefunctional
consequencesofvariationintranscriptionfactorbinding.PLoSGenet10:e1004226.
69.CavenderJA(1978)Quasi-stationarydistributionsofbirth-and-deathprocesses.
AdvApplProb10:570–586.
70.FelsensteinJ(1973)Maximum-likelihoodestimationofevolutionarytreesfrom
continuouscharacters.AmJHumGenet25:471–492.
71.DempsterAP,LairdNM,RubinDB(1977)MaximumLikelihoodfrom
IncompleteDataViaEmAlgorithm.JournaloftheRoyalStatisticalSocietySeriesB-Methodological39:1–38.
72.SherryST,WardMH,KholodovM,BakerJ,PhanL,etal.(2001)dbSNP:the
NCBIdatabaseofgeneticvariation.NucleicAcidsRes29:308–311.
PLOSComputationalBiology|www.ploscompbiol.org15August2014|Volume10|Issue8|e1003771
因篇幅问题不能全部显示,请点此查看更多更全内容