SortingSignedPermutationsbyReversals
(ProjectReport)
CS290IBioinformaticsSpring’03
AnkurJainHodaMokhtar
ComputerScienceDept.
UniversityofCaliforniaSantaBarbara
ankurj,hmokhtar@cs.ucsb.edu
June8,2003
Abstract
Amodeltostudygenomeevolutioncanbeconstructedbyrepresentinggenomesaspermutationsofgenes,andcomputetheminimumnumberofrearrangementsneededtoconvertonegenometotheother.Underthismodel,theshorterthedistance,thelesseristheevolutionarydistancebetweenthetwospecies.However,everystudyofgenomerearrangementsinvolvessolvingthecombinatorialproblemreferredtoassortingbyreversalsthatfindstheseriesofrequiredrearrangements.Boththeproblemsofcomputingthereversaldistanceandfindingtheoptimalsequenceofsortingreversalshavebeenextensivelystudiedduringthelastdecade.Inthis
.Ourapproachintegratespreviouspaperwepresentapolynomialtimealgorithmforsortingbyreversals
workpresentedin[Ber01]togetherwithagreedybreakpointmethod[Sha95].Themaincontributionofthepaperisobtainingnearoptimalsolutionsusingthesimpleapproachproposed.AbasicfeatureoftheprposedapproachisavoidingtheexpensivestepsintroducedinHannenhalliandPevzneralgorithm[HP95].
1Introduction
Themotivationforstudyinggenomerearrangementarisesfromthefastprogressofthehumangenomeproject,andthefactthatgeneticandDNAdataonmanymodelorganismsareaccumulatingrapidly.Analysisofgenomerearrangementsinmolecularbiologystartedinthelate1930s,whenDobzhanskyandSturtevantpublishedamilestonepaperpresentingarearrangementscenariowith17inversionsbetweenthespeciesofDrosophila[HP95].Inthelate1980’sJeffreyPalmerdemonstratedthatdifferentspeciesmayhaveessentiallythesamegenes,butthegeneordermaydifferbetweenspecies[HRR97].Thisdiscoveryandmanyotherstudiesinthelastdecadeprovedthatthereversaldistancebetweentwogenomesofdifferentspeciesfaithfullyreflectstheevolutionarydistancebetweenthem.
MathematicalanalysisofgenomerearrangementproblemswasinitiatedbySankoff[SCA90,KS95].In1997,Caprara[Cap97]establishedthattheproblemofsortingunsignedpermutationsisNP-hard,usingsomeofthecombinatorialtoolsdevelopedbyBafnaandPevzner[BP96].In1995,HannenhalliandPevzner
fortheproblemofsortingasignedpermutation[HP95]proposedthefirstpolynomialtimealgorithm
byreversals.In1996,BermanandHannenhalli[BH96]describedafasterimplementationthatfindsa
timeusingUnion-Finddatastructure,whereistheinverseminimumsequenceofreversalsin
algorithmforofAckerman’sfunction.Laterin1997,Kaplan,ShamirandTarjan[HRR97]gavean
sortingasignedpermutationofelements,therebyimprovinguponthepreviousbest-knownbound.In
1
2001,Bergeron[Ber01,BS01]presentedabit-vectorimplementationthatcomputestheoptimalsequenceofreversalstosortasignedpermutationusingoriented-sorttechnique.
Sortingbyreversalscanbeformulatedas:Givenapermutationfindtheminimumnumberofreversalstoobtaintheidentitypermutation.Intheremainderofthispaperweusethetermsdefinedbelow.Asignedpermutationisapermutationofintegersbetweenand,
Permutationelementsareprovidedwithaplusorminussigndenotingthegeneorientationalongthechro-mosome.
isanoperationthatreversesablockofconsecutiveelementsstartingfrompositionAreversal
topositionin,togetherwithchangingthesignsoftheelementsbetween
(1)
istheminimumnumberofreversalsthattransformtotheidentitypermuta-Areversaldistancetion.Clearly,sortingbyreversalsisbasicallyfindingtherequiredsequenceofreversalstoobtainthereversaldistance.Areversaliscalledsafeifitreducesreversaldistance(i.e.issafeif).
bedefinedas-=.Extendthepermutationbyaddingand.WeLet
callapairofelementsforofisanadjacencyif.Otherwise,wecallitabreakpoint.
suchthatarebreakpointsandnobreakpointliesAstripofisaninterval
betweenthem.Inotherwords,astripisamaximalrunofincreasingordecreasingelements.
2ASimplePolynomialTimeSortingbyReversalsTechnique
Inthissectionwedescribeourapproachtowardssortingbyreversals.Thealgorithmweproposeconsistsof2stages:1)removingorientedpairsfromthepermutationusingthebit-vectormethod,2)completingthesorting(ifrequired)usingthegreedybreakpointalgorithm.
ThemotivationbehindthisworkistoavoidexpensiveoperationsintroducedbyHannenhalliet.al[HP95],andKaplanet.al[HRR97].Thoseoperationarefindingandclearinghurdles(i.e.unorientedcomponentsintheoverlapgraphwhoseelementsareconsecutive)andfortresses(permutationswithodd
timebesidethetimeneededfornumberofsuperhurdles).Performingthosegraphoperationsrequire
removingtheorientedpairsafterwards.
Inthefirststageofthealgorithmwestartwithimplicitlyconstructinganoverlapgraphusingabitmatrix.Anoverlapgraph(OV-Graph)isanedge-coloredgraphconstructedontheunsignedversionofthesignedpermutation.isobtainedbyreplacingeachpositiveelementbythesequence
,andeachnegativeelementsby.AnOV-Graphofapermutationoflengthhas
edgesandvertices.IngeneralanOV-Graphhastwotypesofedges,blackedgesthatjoinevery
where.Acharacteristicofotherelementinwhilegrayedgesthatjoinadjacentelements
OV-Graphisthateachvertexhasadegreeofexactly2givingauniquecycledecompositionofthegraph.A
withoppositesignsiscalledanorientedpair.pairofconsecutiveintegers,thatis
IntheremainingdiscussionwefocusonarcoverlapgraphwhereeveryedgeintheOV-GraphisavertexinthearcOV-Graph,edgesinthisgraphjoinverticeswhosegrayedgeswereintersectingintheOV-Graph.
Example2.1Considerthepermutationsignedpermutationisgivenby:forisshowninFigure1.
.Thecorrespondingun-.TheOV-Graph
Ingeneral,avertexinthearcOV-Graphisanorientedvertexifitenclosesanoddnumberofelementsin.Theorientedsortalgorithmmakesuseof2basicfacts:1)avertexhasanodddegreeiffitisoriented,
2
Figure1:OverlapGraph
2)ifoneperformsareversalcorrespondingtoanorientedvertex,eachvertexadjacenttowillchangeitsorientation[Ber01].Thesetwofactstogetherwiththetheoremin[BS01]thatstatedthatareversalthatmaximizesthenumberoforientedverticesissafeformthetheoreticalbackbonefortheorientedsortingtechnique.
TheorientedsortalgorithmstartswithconstructinganadjacencymatrixforthearcOV-Graph.Twoextra
areappendedattheendcorrespondingtotheparityandscoreofthevertex.Theparityspecifiesrows
whetheravertexhasevenorodddegree(i.e.unorientedororientedresp.).Thescoreisdefinedasthedifferencebetweenthenumberoforientedandunorientedneighborsofthevertex.
.ThecorrespondingarcOV-GraphisshowninExample2.2Consider
Figure2,whereorientedverticesaremarkedinblack.ThebitmatrixforisshowninFigure3.
Figure2:ArcOverlapGraph
Theorientedsortingalgorithmisasfollows:
WHILEthereexistsanorientedvertexinthebitmatrixBEGIN
Letbeavertexwithmaximumscore
Figure3:BitMatrixofOV-graphinFigure2
FOReachvertexadjacenttoIFisorientedBEGIN
END
ELSE
BEGIN
3
END
END
Applyingtheabovealgorithmonasignedpermutationremovesallorientedpairsandresultsinan
unsignedpermutation.Ifisstillunsortedwepresentittoagreedybreakpointalgorithmthatguaranteestoobtaintheidentityattheend.In[Sha95]itisprovedthatreversaldistanceobtainedfromthegreedybreakpointalgorithmisatmosttwiceasbadastheoptimalsolution.Thegreedyalgorithmproceedsasfollows:
BEGIN
WHILEcontainsabreakpointDOBEGIN
Letbeareversalthatremovesthemostbreakpointsof,resolvingtiesamongthosethatremoveonebreakpointinfavorofreversalsthatleaveadecreasingstrip.
ENDEND
Thecomplexityofbothalgorithmsis.Therefore,theoverallcomplexityofourintegratedap-makingitcompetitivewithothertechniquesthataremorecomplextoapply.proachis
3ExperimentalResults
Inthissectionwepresenttheresultsobtainedbyapplyingourapproachonbothrealandsyntheticdatasets.
ThesyntheticdatathatweusedintheexperimentswereobtainedusingtheJavaappletprovidedbyMantin[Man].Wegeneratedrandomsingedpermutationsofvariouslengthanddifferentevolutionaryrate(i.e.numberofreversals).Ontheotherhand,therealdatawereobtainedfromthetestsetsprovidedbyGRAPPA[oNMoTaA01]fordifferent”Campanulaceae”species(flowerplant),MGR[GP02]forhuman-and-mousegeneorder,and[BP02]forHerpesvirusdata.TheexperimentswereconductedonLinuxplatformonaGHzAMDAthlon(tm)XP+processorwithMBmemory.
Weconducted3differentexperimentsonsyntheticdata.Thefirstexperimentconsideredthesortingperformanceoftheorientedsortingtechniquedefinedasthenumberofsuccessfullysortedpermutations.Figure4presentstheresultsproducedbyourimplementationonfilesofrandompermutationsofdifferent
)eachfilewithpermutationsgeneratedusing[Man]withvalueslengths(
.Theresultsshowthatontheaverage%ofthesequenceswerefortheevolutionrate””:
completelysorted.Theconclusionfromthisexperimentisthatonly%ofthesequencesneedtobefurthertreatedusinggreedybreakpointmethod.
Thesecondexperimentconsideredtherunningtimeoftheorientedsortingtechnique.Figure5presentstheresultsoftherunningtimeofthealgorithmunderthesamesettingsinExperiment1.WefocusedonlyonthesequencesthatwerecompletelysortedinExperiment1.Ourconclusionsisthatastheevolutionrateincreases,therequiredsortingtimeincreasesaswell.Besides,thetimeremainstobealmostlinearfor
whichisanoverestimateofatypicalgenomicsequencelength(i.e.humansequencesoflengthupto
genesmatchingwithmouse).genomehasonly
Thethirdexperimentconsideredtherelationbetweenevolutionrateandtime.Wegeneratedfilesof
andcomputedthetimeneededforvaryinginincrementsof.Therandompermutationsoflength
4
Performance5040Time300250k=20k=30No. SortedTime (sec)3020100k=20k=30k=40503942421003637332003740374003231388003232341600283034200150100500501002004008001600Sequence Lengthk=20k=30k=40k=40Sequence LengthFigure4:SortingPerformance
Evolution Rate vs Time43.532.521.510.502006001003004005007008009001000Figure5:RunningTimePerformance
time (sec)Length=1000kFigure6:EvolutionRateversusTimeFigure7:HerpesVirusPhylogeneticTree
(i.e.slopetendstomainobservationisthatsaturationstateisreachedasevolutionrateapproaches
)decreaseafter
Next,weconducted3experimentsonrealdata.Inexperiment1weconsidereddifferentspeciesof”Campanulaceae”[oNMoTaA01],wecomputedthereversaldistancebetweenthedifferentspeciesandTobacco(i.e.identitypermutation).WecomparedtheresultswiththeoptimalresultsfromGRIMM[G.02].AlthoughTable1showsgoodcomparativeresults,theorientedsortalgorithmfailedtosortPlatyncodon,LegousiaandCodonopsisspeciescompletely.
Experiment2wasdoneonHerpesVirusdata[BP02].Weobservedthatthereversaldistancesandreversalsequencesin[BP02]matchedexactlywiththeonesperformedbyouralgorithm.TheresultsforthisexperimentareshowninFigure7.
ThelastexperimentwasdoneonHumanandMousegeneorderdata[GP02].Inthisexperimentwein-SpeciesCyanathusTriodanusSymphanraReversalDistancetoTobacco
111312GRIMMOptimalDistance
111312
Table1:ReversalDistancetoTobacco
5
12 13 14 15 -9 -8 -7 -6 47 48 -46 -45 -44 -11 -10 -58 -57 -56 92 93 -95 -94 -21 -20 -5 -4 -3 -2 -1 34 35 41 42 43 36 37 38 -64 -63 61 62 65 66 67 68 90 91 -55 -54 51 52 53 39 40 -60 -59 -77 -76 -19 -18 16 17 -97 -96 -75 -74 -73 24 25 78 79 -83 -82 -81 -80 84 85 86 87 -28 -27 -26 22 23 98 99 69 70 -72 -71 -33 -32 -31 -30 -29 88 89 -50 -49 -105 -104 106 107 108 114 115 -117 -116 -103 -102 109 110 111 112 113 -101 -100 118 119 120 121 122 123 (mouse genome and human is identity)40 reversals0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 9339 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 6768 69 7094 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 1243 reversalsIdentityFigure8:Human-MouseGeneOrder
tegratedtheorientedsortalgorithmtogetherwiththegreedybreakpointmethodtoconvertmousetohuman.First,weobtainedasequencefreeoforientedpairsin40reversals.Thissequencerequired3morereversalsbythegreedyalgorithmtoobtainidentity.The43reversalsarecomparabletotheoptimalfigureof41fromGRIMM.TheblockssortedbythegreedyalgorithmareindicatedinitalicsandboldfontinFigure8.
4ConclusionsandFutureWork
Inthisworkwepresentedatechniquethatintegratesthebit-matrixorientedsortingandgreedybreakpointreversaltechniques.Theapproachproposedwastestedonbothrealandsyntheticdataandwasableachievenearoptimalresultsforsortingsignedpermutations.Abasicfeatureoftheproposedmethodisbeingsimpleandrelativelyfast.
Webelievethatthetechniqueweimplementedcanprovidegoodresultsandthatfurtherexperimentscanstrengthenourclaim.Aspartoftheproposedfuturework,wethinkthatimplementingothertechniquescanhelptopresentabettercomparativeanalysis.Besides,wefinditusefultotrytestingthetechniqueondifferentdatasetsincludingexonorderratherthangeneorder,aswellasconsideringdifferentspeciesandtryingtocomputereversaldistancebetweenthemtoconstructphylogenetictrees.
References
[Ber01][BH96]
AnneBergeron.AveryelementarypresentationoftheHannenhalli-Pevznertheory.LectureNotesinComputerScience,2001.
P.BermanandS.Hannenhalli.Fastsortingbyreversal.InD.S.HirschbergandE.W.My-ers,editors,Proceedingsofthe7thAnnualSymposiumonCombinatorialPatternMatching,number1075,pages168–185,LagunaBeach,CA,1996.Springer-Verlag,Berlin.V.BafnaandP.A.Pevzner.Genomerearragementsandsortingbyreversals.SIAMJ.Comput.,25:272–289,1996.
GuillaumeBourqueandPavelA.Pevzner.Genome-scaleevolution:Reconstructinggeneordersintheancestralspecies.GenomeResearch,12(1):26–36,January2002.
[BP96][BP02]
6
[BS01][Cap97][G.02][GP02][HP95][HRR97]
AnneBergeronandF.Strasbourg.Experimentsincomputingsequencesofinversions.InWABI01,pages166–176,Berlin,2001.Springer-Verlag.
A.Caprara.Sortingbyreversalsisdifficult.InNM.ACMPress,editor,1stConferenceonComputationalMolecularBiology(RECOMB97),pages75–83,SantaFe,1997.
TeslerG.Grimm:genomerearrangementswebserver.Bioinformatics,18(3):492–493,March2002.
BourqueG.andPevznerP.Genome-scaleevolution:Reconstructinggeneordersintheancestralspecies.GenomeRes.,12(1):26–36,2002.
SridharHannenhalliandPavelPevzner.Transformingcabbageintoturnip:polynomialalgorithmforsortingsignedpermutationsbyreversals.pages178–189,1995.
KaplanH.,ShamirR.,andTarjanR.Fasterandsimpleralgorithmforsortingsignedper-mutationsbyreversals.InSODA:ACM-SIAMSymposiumonDiscreteAlgorithms(ACon-ferenceonTheoreticalandExperimentalAnalysisofDiscreteAlgorithms),1997.
J.KececiogluandD.Sankoff.Exactandapproximationalgorithmsforsortingbyreversals,withapplicationtogenomerearrangement.Algorithmica,13:180–210,1995.
ItsikMantin.Javaappletfor:Analgorithmforsortingsignedpermutationsbyreversals.http://www.math.tau.ac.il/rshamir/GR/.
[KS95][Man]
[oNMoTaA01]TheUniversityofNewMexicoandTheUniversityofTexasatAustin.Grappa:
Genomerearrangementsanalysisunderparsimonyandotherphylogeneticalgorithms.http://www.cs.unm.edu/moret/GRAPPA/,2001.
[SCA90][Sha95]
D.Sankoff,R.Cedergren,andY.Abel.Genomicdivergencethroughgenerearrangement.MethodsinEnzymology,183:428–438,1990.
RonShamir.Lecturenotes:Genomehttp://www.math.tau.ac.il/rshamir/algmb/95/94-95.html,1995.
rearrangement.
7
因篇幅问题不能全部显示,请点此查看更多更全内容