您的当前位置:首页正文

ycsb

2021-05-09 来源:好走旅游网
BenchmarkingCloudServingSystemswithYCSB

BrianF.Cooper,AdamSilberstein,ErwinTam,RaghuRamakrishnan,RussellSears

Yahoo!ResearchSantaClara,CA,USA

{cooperb,silberst,etam,ramakris,sears}@yahoo-inc.com

ABSTRACT

WhiletheuseofMapReducesystems(suchasHadoop)forlargescaledataanalysishasbeenwidelyrecognizedandstudied,wehaverecentlyseenanexplosioninthenumberofsystemsdevelopedforclouddataserving.Thesenewersystemsaddress“cloudOLTP”applications,thoughtheytypicallydonotsupportACIDtransactions.ExamplesofsystemsproposedforcloudservinguseincludeBigTable,PNUTS,Cassandra,HBase,Azure,CouchDB,SimpleDB,Voldemort,andmanyothers.Further,theyarebeingap-pliedtoadiverserangeofapplicationsthatdifferconsider-ablyfromtraditional(e.g.,TPC-Clike)servingworkloads.Thenumberofemergingcloudservingsystemsandthewiderangeofproposedapplications,coupledwithalackofapples-to-applesperformancecomparisons,makesitdifficulttoun-derstandthetradeoffsbetweensystemsandtheworkloadsforwhichtheyaresuited.WepresenttheYahoo!CloudServingBenchmark(YCSB)framework,withthegoaloffa-cilitatingperformancecomparisonsofthenewgenerationofclouddataservingsystems.Wedefineacoresetofbenchmarksandreportresultsforfourwidelyusedsystems:Cassandra,HBase,Yahoo!’sPNUTS,andasimpleshardedMySQLimplementation.Wealsohopetofosterthedevel-opmentofadditionalcloudbenchmarksuitesthatrepresentotherclassesofapplicationsbymakingourbenchmarktoolavailableviaopensource.Inthisregard,akeyfeatureoftheYCSBframework/toolisthatitisextensible—itsupportseasydefinitionofnewworkloads,inadditiontomakingiteasytobenchmarknewsystems.

CategoriesandSubjectDescriptors:H.3.4[SystemsandSoftware]:Performanceevaluation

GeneralTerms:Measurement,Performance

1.INTRODUCTION

Therehasbeenanexplosionofnewsystemsfordatastor-ageandmanagement“inthecloud.”OpensourcesystemsincludeCassandra[2,24],HBase[4],Voldemort[9]andoth-

Permissionpersonaltomakedigitalorhardcopiesofallorpartnotbearmadeororclassroomdistributeduseforisprofitgrantedorcommercialwithoutfeeprovidedofthatthiscopiesworkforarerepublish,thisnoticepermissiontopostandontheserversfullcitationonthefirstpage.advantageTocopyandotherwise,thatcopiestoSoCC’10,and/orafee.

ortoredistributetolists,requirespriorspecificCopyrightJune201010–11,ACM978-1-4503-0036-0/10/062010,Indianapolis,Indiana,...$10.00.

USA.ers[3,5,7,8].Somesystemsareofferedonlyascloudservices,eitherdirectlyinthecaseofAmazonSimpleDB[1]andMicrosoftAzureSQLServices[11],oraspartofapro-grammingenvironmentlikeGoogle’sAppEngine[6]orYa-hoo!’sYQL[13].Stillothersystemsareusedonlywithinaparticularcompany,suchasYahoo!’sPNUTS[17],Google’sBigTable[16],andAmazon’sDynamo[18].Manyofthese“cloud”systemsarealsoreferredtoas“key-valuestores”or“NoSQLsystems,”butregardlessofthemoniker,theysharethegoalsofmassivescaling“ondemand”(elasticity)andsimplifiedapplicationdevelopmentanddeployment.

Thelargevarietyhasmadeitdifficultfordeveloperstochoosetheappropriatesystem.Themostobviousdiffer-encesarebetweenthevariousdatamodels,suchasthecolumn-grouporientedBigTablemodelusedinCassandraandHBaseversusthesimplehashtablemodelofVoldemortorthedocumentmodelofCouchDB.However,thedatamodelscanbedocumentedandcomparedqualitatively.Com-paringtheperformanceofvarioussystemsisaharderprob-lem.Somesystemshavemadethedecisiontooptimizeforwritesbyusingon-diskstructuresthatcanbemaintainedus-ingsequentialI/O(asinthecaseofCassandraandHBase),whileothershaveoptimizedforrandomreadsbyusingamoretraditionalbuffer-poolarchitecture(asinthecaseofPNUTS).Furthermore,decisionsaboutdatapartitioningandplacement,replication,transactionalconsistency,andsoonallhaveanimpactonperformance.

Understandingtheperformanceimplicationsofthesede-cisionsforagiventypeofapplicationischallenging.De-velopersofvarioussystemsreportperformancenumbersforthe“sweetspot”workloadsfortheirsystem,whichmaynotmatchtheworkloadofatargetapplication.Moreover,anapples-to-applescomparisonishard,givennumbersfordif-ferentsystemsbasedondifferentworkloads.Thus,devel-opersoftenhavetodownloadandmanuallyevaluatemul-tiplesystems.EngineersatDigg[20]reportedevaluatingeightdifferentdatastoresinordertoimplementonefeature(theGreenBadge,or“whathavemyfriendsdugg”feature).TherehavebeenmultiplesimilarexamplesatYahoo!.Thisprocessistime-consumingandexpensive.

Ourgoalistocreateastandardbenchmarkandbench-markingframeworktoassistintheevaluationofdifferentcloudsystems.Wefocusonservingsystems,whicharesys-temsthatprovideonlineread/writeaccesstodata.Thatis,usuallyawebuseriswaitingforawebpagetoload,andreadsandwritestothedatabasearecarriedoutaspartofthepageconstructionanddelivery.Incontrast,batchoran-alyticalsystemssuchasHadooporrelationalOLAPsystems

provideonlynear-lineorofflinequeries,andarenottypicallyusedtosupportservingworkloads(thoughtheresultofabatchcomputationmaybecachedintoaservingstoreforlow-latencyaccess).Otherresearchers[26]aredoingworktobenchmarkanalyticalsystemssuchasthese.Similarly,thereareexistingbenchmarksforavarietyofdatastoragesystems(suchasSQLdatabases[22]andfilesystems[12,10]).However,thenovelinterfaces(usuallyneitherSQLnorPOSIX),elasticity,andnewusecasesofcloudservingsystemsmotivateanewbenchmark.

WepresenttheYahoo!CloudServingBenchmark(YCSB)framework.WeareusingthisframeworktobenchmarkourownPNUTSsystemandtocompareittootherclouddatabases.Theframeworkconsistsofaworkloadgener-atingclientandapackageofstandardworkloadsthatcoverinterestingpartsoftheperformancespace(read-heavywork-loads,write-heavyworkloads,scanworkloads,etc.).Anim-portantaspectoftheYCSBframeworkisitsextensibility:theworkloadgeneratormakesiteasytodefinenewwork-loadtypes,anditisalsostraightforwardtoadapttheclienttobenchmarknewdataservingsystems.TheYCSBframe-workandworkloadsareavailableinopensourcesothatdeveloperscanuseittoevaluatesystems,andcontributenewworkloadpackagesthatmodelinterestingapplications1.Inthispaper,wedescribetheYCSBbenchmark,andre-portperformanceresultsforfoursystems:Cassandra,HBase,PNUTS,andasimpleshardedMySQLimplementation.Al-thoughourfocusinthispaperisonperformanceandelas-ticity,theframeworkisintendedtoserveasatoolforevalu-atingotheraspectsofcloudsystemssuchasavailabilityandreplication,andwediscussapproachestoextendingitforthesepurposes.

Thepaperisorganizedasfollows.Section2providesanoverviewofclouddataservingsystems.Section3discussesbenchmarktiersforperformanceandscaling.Section4dis-cussesthecoreworkloadsofthebenchmarkindetail,whileSection5examinesthearchitectureandextensibilityoftheYCSBtool.Section6presentsbenchmarkingresultsonsev-eralsystems.WeproposefuturebenchmarktierscoveringavailabilityandreplicationinSection7.Section8examinesrelatedwork,andSection9presentsourconclusions.

2.CLOUDSYSTEMOVERVIEW

2.1CloudServingSystemCharacteristics

Cloudsservingsystemssharecommongoals,despitethedifferentarchitecturesanddesigndecisions.Ingeneral,thesesystemsaimfor:

•Scale-outorpetabytes):Toandsupportveryhighhugerequestdatasetsrates,(multiplecloudterabytessystemsarearchitectedtoscale-out,sothatlargescaleisachievedusinglargenumbersofcommodityservers,eachrunningcopiesofthedatabasesoftware.Aneffectivescale-outsystemmustbalanceloadacrossserversandavoidbot-tlenecks.

•Elasticitylargesystems,:Whileelasticityscale-outmeansprovidesthattheweabilitycanaddtomorehavecapacitytoarunningsystembydeployingnewinstancesofeachcomponent,andshiftingloadtothem.1

WebYCSBInformationcanbeobtainedManagement/YCSB.

fromhttp://research.yahoo.com/-•Highlevelsofavailabilityavailability.:InCloudparticular,systemstheymustareprovideoftenmulti-hightenantsystems,whichmeansthatanoutageaffectsmanydifferentapplications.Moreover,theuseofcommodityhardwaremeansthatfailuresarerelativelycommon,andautomatedrecoverymustbeafirst-classoperationofthesystem.

Themainmotivationfordevelopingnewcloudservingsystemsisthedifficultyinprovidingallofthesefeatures(es-peciallyscale-outandelasticity)usingtraditionaldatabasesystems.Asatradeoff,cloudsystemstypicallysacrificethecomplexquerycapabilitiesandstrong,sophisticatedtrans-actionmodelsfoundintraditionalsystems.Withouttheneedforcomplexplanningandprocessingofjoinsandaggre-gates,scale-outandelasticitybecomesignificantlyeasiertoachieve.Similarly,scale-out(especiallytomultipledatacen-ters)iseasiertoachievewithoutstrongtransactionprotocolsliketwo-phasecommitorPaxos.Inparticular,itisimpos-sibletosimultaneouslyguaranteeavailability,consistencyandpartitiontolerance[21].Becausenetworkpartitions(ordelaysandfailureswhichmimicpartitions)areunavoidable,systemsmustprioritizeeitheravailabilityorconsistency,andmostcloudsystemschooseavailability.Asaresult,cloudsystemstypicallyprovideaconsistencymodelthatisweakerinvariouswaysthantraditionalACIDdatabases.

2.2ClassificationofSystemsandTradeoffs

Wenowexaminethedifferentarchitecturaldecisionsmadebycloudsystems.Aswithmanytypesofcomputersystems,noonesystemcanbebestforallworkloads,anddifferentsystemsmakedifferenttradeoffsinordertooptimizefordif-ferentapplications.Themaintradeoffsfacingcloudservingsystemsare:

In•aReadservingperformancesystem,itisversusdifficultwritetopredictperformance

whichrecordwillbereadorwrittennext.Unlessalldatafitsinmem-ory,thismeansthatrandomI/Otothediskisneededtoservereads(e.g.,asopposedtoscans).RandomI/Ocanbeusedforwritesaswell,butmuchhigherwritethroughputcanbeachievedbyappendingallupdatestoasequentialdisk-basedlog.However,log-structuredsystemsthatonlystoreupdatedeltascanveryinefficientforreadsifthedataismodifiedovertime,astypicallymultipleupdatesfromdifferentpartsofthelogmustbemergedtoprovideacon-sistentrecord.Writingthecompleterecordtothelogoneachupdateavoidsthecostofreconstructionatreadtime,butthereisacorrespondinglyhighercostonupdate.Log-structuredmergetrees[29]avoidthecostofreconstructingonreadsbyusingabackgroundprocesstomergeupdatesandclusterrecordsbyprimarykey,butthediskcostofthisprocesscanreduceperformanceforotheroperations.Over-all,then,thereisaninherenttradeoffbetweenoptimizingforreadsandoptimizingforwrites.

AparticularcaseofthistradeoffisseeninsystemssuchasHBasethatarebasedonfilesystemsoptimizedforbatchprocessing(forexample,HBaseisbuiltonHDFS,whichisthedatastoreforHadoop).Forasystemtoexcelatbatchprocessing,itrequireshighthroughputsequentialreadsandwrites,ratherthanfastrandomaccesses;thus,Hadooponlysupportsappend-onlyfiles.Updatestoexistingrecordsmustbehandledbyusingadifferentialfilesschemethatshares

thesamedisadvantagesasalog-structuredfilesystemwithrespecttoreads.

Writes•Latencymaybeversussyncheddurability

todiskbeforethesystemreturnssuc-cesstotheuser,ortheymaybestoredinmemoryatwritetimeandsynchedlater.Theadvantagesofthelatterap-proacharethatavoidingdiskgreatlyimproveswritelatency,andpotentiallyimprovesthroughput(ifmultiplewritestothesamerecordcanbeservicedbyasingleI/Ooperationorcanbecondensedinmemory).Thedisadvantageisriskofdatalossifaservercrashesandlosesunsynchedupdates.Replication•Synchronousisusedversustoimproveasynchronoussystemavailabilityreplication(bydi-rectingtraffictoareplicaafterafailure),avoiddataloss(byrecoveringlostdatafromareplica),andimproveper-formance(byspreadingloadacrossmultiplereplicasandbymakinglow-latencyaccessavailabletousersaroundtheworld).However,therearedifferentapproachestorepli-cation.Synchronousreplicationensuresallcopiesareuptodate,butpotentiallyincurshighlatencyonupdates.Furthermore,availabilitymaybeimpactedifsynchronouslyreplicatedupdatescannotcompletewhilesomereplicasareoffline.Asynchronousreplicationavoidshighwritelatency(inparticular,makingitsuitableforwideareareplication)butallowsreplicastobestale.Furthermore,datalossmayoccurifanupdateislostduetofailurebeforeitcanbereplicated.

Systems•Datamaypartitioning

bestrictlyrow-based,orallowforcolumnstorage.Inrow-basedstorageallofarecord’sfieldsarestoredcontiguouslyondisk.Withcolumnstorage,differ-entcolumnsorgroupsofcolumnscanbestoredseparately(possiblyondifferentservers).Row-basedstoragesupportsefficientaccesstoanentirerecord(includinglowlatencyreadsandinsertion/updateinaserving-orientedsystem),andisidealifwetypicallyaccessafewrecordsintheiren-tirety.Column-basedstorageismoreefficientforaccessingasubsetofthecolumns,particularlywhenmultiplerecordsareaccessed.

2.3ABriefSurveyofCloudDataSystems

Toillustratethesetradeoffs,Table1listsseveralcloudsystemsandthechoicestheyhavemadeforeachdimension.Wenowexaminesomeofthesedecisions.

Anapplicationdevelopermustmatchtheirworkloadre-quirementstothebestsuitedclouddatabasesystem.Con-sidertheread-optimizedversuswrite-optimizedtradeoff.BigTable-likesystemssuchasCassandraandHBaseattempttoalwaysperformsequentialI/Oforupdates.Recordsondiskareneveroverwritten;instead,updatesarewrittentoabufferinmemory,andtheentirebufferiswrittensequen-tiallytodisk.Multipleupdatestothesamerecordmaybeflushedatdifferenttimestodifferentpartsofthedisk.Theresultisthattoperformareadofarecord,multipleI/Osareneededtoretrieveandcombinethevariousup-dates.Sinceallwritingissequential,itisveryfast;butreadsarecorrespondinglyde-optimized.Incontrast,amoretraditionalbuffer-poolarchitecture,suchasthatinMySQLandPNUTS,overwritesrecordswhentheyareupdated.Be-causeupdatesrequirerandomI/O,theyareslowerthanthe

BigTable-likesystems;butreadsarefastbecauseasingleI/Ocanretrievetheentire,up-to-daterecord.

Latencyversusdurabilityisanotherimportantaxis.Ifdevelopersknowtheycanloseasmallfractionofwrites(forexample,webpollvotes),theycanreturnsuccesstowriteswithoutwaitingforthemtobesynchedtodisk.Cassandraallowstheclienttospecifyonaper-callbasiswhetherthewriteisdurablypersisted.MySQLandPNUTSalwaysforcelogupdatestodiskwhencommittingatransaction,althoughthislogforcecanbedisabled.HBasedoesnotsynclogupdatestodisk,whichprovideslowlatencyupdatesandhighthroughput.ThisisappropriateforHBase’stargetusecases,whichareprimarilytorunbatchanalyticsoverservingdata,ratherthantoprovideguaranteeddurabilityforsuchdata.Forsuchasystem,highthroughputsequentialreadsandwritesarefavoredoverdurabilityforrandomupdates.Synchronousreplicationensuresfreshnessofreplicas,andisusedinHBaseandCassandra.Cassandraalsosupportsasynchronousreplication,asdoMySQLandPNUTS.Asyn-chronousreplicationsupportswide-areareplicationwithoutaddingsignificantoverheadtotheupdatecallitself.

Columnstorageisadvantageousforapplicationsthatneedonlyaccessasubsetofcolumnswitheachrequest,andknowthesesubsetsinadvance.BigTable,HBase,andCassandraallprovidetheabilitytodeclarecolumngroupsorfami-lies,andaddcolumnstoanyofthem.Eachgroup/familyisphysicallystoredseparately.Ontheotherhand,ifre-queststypicallywanttheentirerow,orarbitrarysubsetsofit,partitioningthatkeepstheentirerowphysicallytogetherisbest.Thiscanbedonewithrowstorage(asinPNUTS),orbyusingasinglecolumngroup/familyinacolumnstorelikeCassandra.

Thesystemswediscussherearerepresentative,ratherthancomprehensive.Avarietyofothersystemsmakedif-ferentdecisions.Whilewecannotsurveythemallhere,wewillpointoutafewinterestingcharacteristics.Dynamo,VoldemortandCassandrauseeventualconsistencytobal-ancereplicationandavailability.Inthismodel,writesareallowedanywhere,andconflictingwritestothesameobjectareresolvedlater.AmazonSimpleDBandMicrosoftAzurearehostedcloudservingstores.Bothprovidetransactionalfunctionsnotfoundinotherservingstores.Thecaveatisthattheusermustpartitiontheirdataintodifferentcontain-ers,bothintermsofsizeandrequestrate.SimpleDBcallsthesecontainersdomains,whileAzurecallsthemdatabases.Thewidevarianceindesigndecisionshassignificantperfor-manceimplications,whichmotivatedustodevelopabench-markforquantitativelyevaluatingthoseimplications.

3.BENCHMARKTIERS

Inthissection,weproposetwobenchmarktiersforeval-uatingtheperformanceandscalabilityofcloudservingsys-tems.Ourintentionistoextendthebenchmarktodealwithmoretiersforavailabilityandreplication,andwediscussourinitialideasfordoingsoinSection7.

3.1Tier1—Performance

ThePerformancetierofthebenchmarkfocusesonthelatencyofrequestswhenthedatabaseisunderload.La-tencyisveryimportantinservingsystems,sincethereisusuallyanimpatienthumanwaitingforawebpagetoload.However,thereisaninherenttradeoffbetweenlatencyandthroughput:onagivenhardwaresetup,astheamountof

SystemPNUTSBigTableHBaseCassandraShardedMySQLRead/WriteoptimizedReadWriteWriteWriteRead

Latency/durability

DurabilityDurabilityLatencyTunableTunableSync/asyncreplicationAsyncSyncAsyncTunableAsync

Row/column

RowColumnColumnColumnRow

Table1:Designdecisionsofvarioussystems.

loadincreases,thelatencyofindividualrequestsincreasesaswellsincethereismorecontentionfordisk,CPU,network,andsoon.Typicallyapplicationdesignersmustdecideonanacceptablelatency,andprovisionenoughserverstoachievethedesiredthroughputwhilepreservingacceptablelatency.Asystemwithbetterperformancewillachievethedesiredlatencyandthroughputwithfewerservers.

ThePerformancetierofthebenchmarkaimstocharac-terizethistradeoffforeachdatabasesystembymeasuringlatencyasweincreasethroughput,untilthepointatwhichthedatabasesystemissaturatedandthroughputstopsin-creasing.IntheterminologyoftheWisconsinBenchmark,apopularearlybenchmarkusedforparalleldatabasesystems,ourmetricissimilartosizeup[19],wherethehardwareiskeptconstantbutthesizeoftheworkloadincreases.

Toconductthisbenchmarktier,weneedaworkloadgen-eratorwhichservestwopurposes:first,todefinethedatasetandloaditintothedatabase;andsecond,toexecuteop-erationsagainstthedatasetwhilemeasuringperformance.WehaveimplementedtheYCSBClient(describedinmoredetailinSections4and5)forbothpurposes.Asetofpa-rameterfilesdefinesthenatureofthedatasetandtheopera-tions(transactions)performedagainstthedata.TheYCSBClientallowstheusertodefinetheofferedthroughputasacommandlineparameter,andreportstheresultingla-tency,makingitstraightforwardtoproducelatencyversusthroughputcurves.

3.2Tier2—Scaling

Akeyaspectofcloudsystemsistheirabilitytoscaleelas-tically,sothattheycanhandlemoreloadasapplicationsaddfeaturesandgrowinpopularity.TheScalingtierofthedatabaseexaminestheimpactonperformanceasmorema-chinesareaddedtothesystem.Therearetwometricstomeasureinthistier:

Scaleup—Howdoesthedatabaseperformasthenumberofmachinesincreases?Inthiscase,weloadagivennumberofserverswithdataandruntheworkload.Then,wedeletethedata,addmoreservers,loadalargeramountofdataonthelargercluster,andruntheworkloadagain.Ifthedatabasesystemhasgoodscaleupproperties,theperformance(e.g.,latency)shouldremainconstant,asthenumberofservers,amountofdata,andofferedthroughputscaleproportionally.Thisisequivalenttothescaleupmetricfrom[19].

Elasticspeedup—Howdoesthedatabaseperformasthenumberofmachinesincreaseswhilethesystemisrunning?Inthiscase,weloadagivennumberofserverswithdataandruntheworkload.Astheworkloadisrunning,weaddoneormoreservers,andobservetheimpactonperformance.Asystemthatoffersgoodelasticityshouldshowaperfor-manceimprovementwhenthenewserversareadded,witha

shortornon-existentperiodofdisruptionwhilethesystemisreconfiguringitselftousethenewserver.Thisissimilartothespeedupmetricfrom[19],withtheaddedtwistthatthenewserverisaddedwhiletheworkloadisrunning.

4.BENCHMARKWORKLOADS

Wehavedevelopedacoresetofworkloadstoevaluatedif-ferentaspectsofasystem’sperformance,calledtheYCSBCorePackage.Inourframework,apackageisacollectionofrelatedworkloads.Eachworkloadrepresentsaparticularmixofread/writeoperations,datasizes,requestdistribu-tions,andsoon,andcanbeusedtoevaluatesystemsatoneparticularpointintheperformancespace.Apackage,whichincludesmultipleworkloads,examinesabroadersliceoftheperformancespace.Whilethecorepackageexaminessev-eralinterestingperformanceaxes,wehavenotattemptedtoexhaustivelyexaminetheentireperformancespace.UsersofYCSBcandeveloptheirownpackageseitherbydefininganewsetofworkloadparameters,orifnecessarybywritingJavacode.Wehopetofosteranopen-sourceefforttocreateandmaintainasetofpackagesthatarerepresentativeofmanydifferentapplicationsettingsthroughtheYCSBopensourcedistribution.TheprocessofdefiningnewpackagesisdiscussedinSection5.

Todevelopthecorepackage,weexaminedavarietyofsystemsandapplicationstoidentifythefundamentalkindsofworkloadswebapplicationsplaceonclouddatasystems.Wedidnotattempttoexactlymodelaparticularapplica-tionorsetofapplications,asisdoneinbenchmarkslikeTPC-C.Suchbenchmarksgiverealisticperformanceresultsforanarrowsetofusecases.Incontrast,ourgoalwastoexamineawiderangeofworkloadcharacteristics,inordertounderstandinwhichportionsofthespaceofworkloadssystemsperformedwellorpoorly.Forexample,somesys-temsmaybehighlyoptimizedforreadsbutnotforwrites,orforinsertsbutnotupdates,orforscansbutnotforpointlookups.Theworkloadsinthecorepackagewerechosentoexplorethesetradeoffsdirectly.

Theworkloadsinthecorepackageareavariationofthesamebasicapplicationtype.Inthisapplication,thereisatableofrecords,eachwithFfields.Eachrecordisidenti-fiedbyaprimarykey,whichisastringlike“user234123”.Eachfieldisnamedfield0,field1andsoon.ThevaluesofeachfieldarearandomstringofASCIIcharactersoflengthL.Forexample,intheresultsreportedinthispaper,weconstruct1,000byterecordsbyusingF=10fields,eachofL=100bytes.

Eachoperationagainstthedatastoreisrandomlychosentobeoneof:

•Insert:Insertanewrecord.

•Updatefield.

:Updatearecordbyreplacingthevalueofone•Readorall:fields.

Readarecord,eitheronerandomlychosenfield•Scanchosen:Scanrecordrecordskey.Theinorder,numberstartingofrecordsatatorandomlyscanisrandomlychosen.

Forscanspecifically,thedistributionofscanlengthsischosenaspartoftheworkload.Thus,thescan()methodtakesaninitialkeyandthenumberofrecordstoscan.Ofcourse,arealapplicationmayinsteadspecifyascaninterval(i.e.,fromFebruary1sttoFebruary15th).Thenumberofrecordsparameterallowsustocontrolthesizeofthesein-tervals,withouthavingtodetermineandspecifymeaningfulendpointsforthescan.(Allofthedatabasecalls,includingscan(),aredescribedinSection5.2.1.)

4.1Distributions

Theworkloadclientmustmakemanyrandomchoiceswhengeneratingload:whichoperationtoperform(Insert,Update,ReadorScan),whichrecordtoreadorwrite,howmanyrecordstoscan,andsoon.Thesedecisionsaregov-ernedbyrandomdistributions.YCSBhasseveralbuilt-indistributions:

•Uniformample,when:Choosechoosinganitemarecord,uniformlyallrecordsatrandom.inthedatabaseForex-areequallylikelytobechosen.

•Zipfiantribution.:ChooseForexample,anitemwhenaccordingchoosingtothearecord,Zipfiansomedis-recordswillbeextremelypopular(theheadofthedistri-bution)whilemostrecordswillbeunpopular(thetail).•Latestmostrecently:LikeinsertedtheZipfianrecordsdistribution,areintheexceptheadofthatthedis-thetribution.

•Multinomialfied.Forexample,:Probabilitieswemightassignforeachaprobabilityitemcanbeofspeci-0.95totheReadoperation,aprobabilityof0.05totheUp-dateoperation,andaprobabilityof0toScanandInsert.Theresultwouldbearead-heavyworkload.

Figure1illustratesthedifferencebetweentheuniform,zipfianandlatestdistributions.Thehorizontalaxesinthefigurerepresenttheitemsthatmaybechosen(e.g.,records)inorderofinsertion,whiletheverticalbarsrepresenttheprobabilitythattheitemischosen.Notethatthelastin-serteditemmaynotbeinsertedattheendofthekeyspace.Forexample,Twitterstatusupdatesmightbeclusteredbyuser,ratherthanbytimestamp,meaningthattworecentlyinserteditemsmaybefarapartinthekeyspace.

AkeydifferencebetweentheLatestandZipfiandistribu-tionsistheirbehaviorwhennewitemsareinserted.UndertheLatestdistribution,thenewlyinserteditembecomesthemostpopular,whilethepreviouslypopularitemsbecomelessso.UndertheZipfiandistribution,itemsretaintheirpopularityevenasnewitemsareinserted,whetherornotthenewlyinserteditemispopular.TheLatestdistributionismeanttomodelapplicationswhererecencymatters;forexample,onlyrecentblogpostsornewsstoriesarepopular,andthepopularitydecaysquickly.Incontrast,theZipfiandistributionmodelsitemswhosepopularityisindependentoftheirnewness;aparticularusermightbeextremelypop-

Uniform:

ytiralupoP0 1 ...

Insertion order

N

Zipfian:

ytiralupoP0 1 ...

Insertion order

N

Latest:

ytiralupoP0 1 ...

Insertion order

N

Figure1:Probabilitydistributions.Horizontalaxesrepresentsitemsinorderofinsertion,andverticalaxesrepresentprobabilityofbeingchosen.

ular,withmanyviewsofherprofilepage,eventhoughshehasjoinedmanyyearsago.

4.2TheWorkloads

Wedefinedtheworkloadsinthecorepackagebyassign-ingdifferentdistributionstothetwomainchoiceswemustmake:whichoperationtoperform,andwhichrecordtoreadorwrite.ThevariouscombinationsareshowninTable2.Althoughwedonotattempttomodelcomplexapplicationsprecisely(asdiscussedabove),welistasampleapplicationthatgenerallyhasthecharacteristicsoftheworkload.

Loadingthedatabaseislikelytotakelongerthananyindividualexperiment.Inourtests,loadstookbetween10-20hours(dependingonthedatabasesystem),whileweraneachexperiment(e.g.,aparticularworkloadataparticulartargetthroughputagainstaparticulardatabase)for30min-utes.Allthecorepackageworkloadsusethesamedataset,soitispossibletoloadthedatabaseonceandthenrunalltheworkloads.However,workloadsAandBmodifyrecords,andDandEinsertrecords.Ifdatabasewritesarelikelytoimpacttheoperationofotherworkloads(e.g.,byfragment-ingtheon-diskrepresentation)itmaybenecessarytore-loadthedatabase.Wedonotprescribeaparticulardatabaseloadingstrategyinourbenchmark,sincedifferentdatabasesystemshavedifferentloadingmechanisms(includingsomethathavenospecialbulkloadfacilityatall).

5.DETAILSOFTHEBENCHMARKTOOL

Wehavedevelopedatool,calledtheYCSBClient,toexecutetheYCSBbenchmarks.Akeydesigngoalofourtoolisextensibility,sothatitcanbeusedtobenchmarknewclouddatabasesystems,andsothatnewworkloadscanbedeveloped.Wehaveusedthistooltomeasuretheperformanceofseveralcloudsystems,aswereportinthenextsection.Thistoolisalsoavailableunderanopensourcelicense,sothatothersmayuseandextendthetool,andcontributenewworkloadsanddatabaseinterfaces.

Workload

A—UpdateheavyB—ReadheavyC—ReadonlyD—ReadlatestE—Shortranges

OperationsRead:50%Update:50%Read:95%Update:5%Read:100%Read:95%Insert:5%Scan:95%Insert:5%

RecordselectionZipfianZipfianZipfianLatest

Zipfian/Uniform*

Applicationexample

SessionstorerecordingrecentactionsinausersessionPhototagging;addatagisanupdate,butmostoperationsaretoreadtags

Userprofilecache,whereprofilesareconstructedelsewhere(e.g.,Hadoop)

Userstatusupdates;peoplewanttoreadthelateststatusesThreadedconversations,whereeachscanisforthepostsinagiventhread(assumedtobeclusteredbythreadid)

*WorkloadEusestheZipfiandistributiontochoosethefirstkeyintherange,andtheUniformdistributiontochoosethenumberofrecordsto

scan.

Table2:Workloadsinthecorepackage

Workload fileCommand line properties! Read/write mix! DB to use

! Record size

! Workload to use! Popularity distribution! Target throughput! ...! Number of threads! ...YCSB ClientCloudServingedcStorearClientaootThreadsfrlureekctyroenaIWx LEStatsBDFigure2:YCSBclientarchitecture

InthissectionwedescribethearchitectureoftheYCSBclient,andexaminehowitcanbeextended.Wealsode-scribesomeofthecomplexitiesinproducingdistributionsfortheworkloads.

5.1Architecture

TheYCSBClientisaJavaprogramforgeneratingthedatatobeloadedtothedatabase,andgeneratingtheop-erationswhichmakeuptheworkload.ThearchitectureoftheclientisshowninFigure2.Thebasicoperationisthattheworkloadexecutordrivesmultipleclientthreads.Eachthreadexecutesasequentialseriesofoperationsbymak-ingcallstothedatabaseinterfacelayer,bothtoloadthedatabase(theloadphase)andtoexecutetheworkload(thetransactionphase).Thethreadsthrottletherateatwhichtheygeneraterequests,sothatwemaydirectlycontroltheofferedloadagainstthedatabase.Thethreadsalsomeasurethelatencyandachievedthroughputoftheiroperations,andreportthesemeasurementstothestatisticsmodule.Attheendoftheexperiment,thestatisticsmoduleaggregatesthemeasurementsandreportsaverage,95thand99thpercentilelatencies,andeitherahistogramortimeseriesofthelaten-cies.

Theclienttakesaseriesofproperties(name/valuepairs)whichdefineitsoperation.Byconvention,wedividethesepropertiesintotwogroups:

•Workloadproperties:Propertiesdefiningthework-

load,independentofagivendatabaseorexperimentalrun.Forexample,theread/writemixofthedatabase,thedistributiontouse(zipfian,latest,etc.),andthesizeandnumberoffieldsinarecord.

•Runtimeperiment.properties:Forexample,Propertiesthedatabasespecificinterfacetoagivenlayerex-touse(e.g.,Cassandra,HBase,etc.),propertiesusedtoinitializethatlayer(suchasthedatabaseservicehost-names),thenumberofclientthreads,etc.

Thus,therecanbeworkloadpropertyfileswhichremainstaticandareusedtobenchmarkavarietyofdatabases(suchastheYCSBcorepackagedescribedinSection4).Incontrast,runtimeproperties,whilealsopotentiallystoredinpropertyfiles,willvaryfromexperimenttoexperiment,asthedatabase,targetthroughput,etc.,change.

5.2Extensibility

AprimarygoalofYCSBisextensibility.Infact,oneofourmotivationswastomakeiteasyfordeveloperstobenchmarktheincreasingvarietyofcloudservingsystems.TheshadedboxesinFigure2showthecomponentswhichcanbeeasilyreplaced.TheWorkloadExecutorcontainscodetoexecuteboththeloadandtransactionphasesoftheworkload.TheYCSBpackageincludesCoreWorkload,astandardworkloadexecutorforthecorepackagedescribedinSection4.UsersofYCSBcandefinenewpackagesintwoways.ThemoststraightforwardistodefineasetofworkloadsthatuseCoreWorkloadbutdefinedifferentwork-loadparameters.Thisallowsuserstovaryseveralaxesofthecorepackage:whichoperationtoperform,theskewinrecordpopularity,andthesizeandnumberofrecords.Thesecondapproachistodefineanewworkloadexecutorclass(e.g.,bywritingJava)andassociatedparameters.Thisap-proachallowsforintroducingmorecomplexoperations,andexploringdifferenttradeoffs,thanthecorepackagedoes;butinvolvesgreatereffortcomparedtotheformerapproach.TheDatabaseInterfaceLayertranslatessimplerequests(suchasread())fromtheclientthreadsintocallsagainstthedatabase(suchasThriftcallstoCassandraorRESTrequeststoPNUTS).TheWorkloadExecutorandDatabaseInterfaceLayerclassestouseforanexperimentarespecifiedasproperties,andthoseclassesareloadeddynamicallywhentheclientstarts.Ofcourse,asanopensourcepackage,anyclassintheYCSBtoolcanbereplaced,buttheWorkloadExecutorandDatabaseInterfaceLayercanbereplacedmosteasily.WenowdiscussinmoredetailhowtheYCSBclientcanbeextendedwithnewdatabasebackendsandworkloads.

5.2.1NewDatabaseBackends

TheYCSBClientcanbeusedtobenchmarknewdatabasesystemsbywritinganewclasstoimplementthefollowingmethods:

•read()turneither—readaspecifiedasinglerecordsetoffieldsfromtheoralldatabase,fields.

andre-••insert()update()—insertasinglerecordintothedatabase.

orreplacing—updatethespecifiedasinglefields.

recordinthedatabase,adding••delete()scan()berofrecords—execute—deleteasinglerecordinthedatabase.

startingarangeatscan,agivenreadingrecordakey.

specifiednum-Theseoperationsarequitesimple,representingthestan-dard“CRUD”operations:Create,Read,Update,Delete,withReadoperationstoreadonerecordortoscanrecords.Despiteitssimplicity,thisAPImapswelltothenativeAPIsofmanyofthecloudservingsystemsweexamined.

5.2.2NewWorkloadExecutors

AusercandefineanewworkloadexecutortoreplaceCoreWorkloadbyextendingtheWorkloadclassoftheYCSBframework.Oneinstanceoftheworkloadobjectiscre-atedandsharedamongtheworkerthreads,whichallowsthethreadstosharecommondistributions,countersandsoon.Forexample,theworkloadobjectcanmaintainacounterusedtogeneratenewuniquerecordidswheninsert-ingrecords.SimilarlytheworkloadobjectcanmaintainacommonLatestGeneratorobject,whichassignshighpop-ularitytothelatestrecordidsgeneratedbythecounter.Foreachoperation,thethreadwilleitherexecutetheworkloadobject’sdoInsert()method(iftheclientisintheloadphase)ortheworkloadobject’sdoTransaction()method(iftheclientisinthetransactionphase).

5.3Distributions

OneunexpectedlycomplexaspectofimplementingtheYCSBtoolinvolvedimplementingtheZipfianandLatestdistributions.Inparticular,weusedthealgorithmforgen-eratingaZipfian-distributedsequencefromGrayetal[23].However,thisalgorithmhadtobemodifiedinseveralwaystobeusedinourtool.Thefirstproblemisthatthepopularitemsareclusteredtogetherinthekeyspace.Inparticular,themostpopularitemisitem0;thesecondmostpopularitemisitem1,andsoon.FortheZipfiandistribution,thepopularitemsshouldbescatteredacrossthekeyspace.Inrealwebapplications,themostpopularuserorblogtopicisnotnecessarilythelexicographicallyfirstitem.

Toscatteritemsacrossthekeyspace,wehashedtheout-putoftheGraygenerator.Thatis,wecalledanextItem()methodtogetthenext(integer)item,thentookahashofthatvaluetoproducethekeythatweuse.Thechoiceofhashfunctioniscritical:theJavabuilt-inString.hashCode()functiontendedtoleavethepopularitemsclustered.Fur-thermore,afterhashing,collisionsmeantthatonlyabout80percentofthekeyspacewouldbegeneratedinthesequence.Thiswastrueevenaswetriedavarietyofhashfunctions(FNV,Jenkins,etc.).Oneapproachwouldbetouseperfecthashing,whichavoidscollisions,withadownsidethatmoresetuptimeisneededtoconstructtheperfecthash(multipleminutesforhundredsofmillionsofrecords)[15].Theap-proachthatwetookwastoconstructaZipfiangeneratorfor

amuchlargerkeyspacethanweactuallyneeded;applytheFNVhashtoeachgeneratedvalue;andthentakemodN(whereNsizeofthekeyspace,thatis,numberofrecordsinthedatabase).Theresultwasthat99.97%ofthekeyspaceisgenerated,andthegeneratedkeyscontinuedtohaveaZipfiandistribution.

Thesecondissuewasdealingwithchangingnumbersofitemsinthedistribution.Forsomeworkloads,newrecordsareinsertedintothedatabase.TheZipfiandistributionshouldresultinthesamerecordsbeingpopular,evenaf-terinsertions,whileintheLatestdistribution,popularityshouldshifttothenewkeys.FortheLatest,wecomputedanewdistributionwhenarecordwasinserted;todothischeaplywemodifiedtheGrayalgorithmof[23]tocomputeitsconstantsincrementally.ForZipfian,weexpandedtheinitialkeyspacetotheexpectedsizeafterinserts.IfadatasethadNrecords,andtheworkloadhadTtotalopera-tions,withanexpectedfractionIofinserts,thenwecon-structedtheZipfiangeneratortodrawfromaspaceofsizeN+T×I+󰀁.Weaddedanadditionalfactor󰀁sincetheactualnumberofinsertsdependsontherandomchoiceofoperationsduringtheworkloadaccordingtoamultinomialdistribution.Whilerunningtheworkload,ifthegenera-torproducedanitemwhichhadnotbeeninsertedyet,weskippedthatvalueanddrewanother.Then,thepopularitydistributiondidnotshiftasnewrecordswereinserted.

6.RESULTS

Wepresentbenchmarkingresultsforfoursystems:Cas-sandra,HBase,PNUTSandshardedMySQL.WhilebothCassandraandHBasehaveadatamodelsimilartothatofGoogle’sBigTable[16],theirunderlyingimplementationsarequitedifferent—HBase’sarchitectureissimilartoBigTable(usingsynchronousupdatestomultiplecopiesofdatachunks),whileCassandra’sissimilartoDynamo[18](e.g.,usinggos-sipandeventualconsistency).PNUTShasitsowndatamodel,andalsodiffersarchitecturallyfromtheothersys-tems.OurimplementationofshardedMySQL(likeotherimplementationswehaveencountered)doesnotsupportelas-ticgrowthanddatarepartitioning.However,itserveswellasacontrolinourexperiments,representingaconventionaldistributeddatabasearchitecture,ratherthanacloud-orientedsystemdesignedtobeelastic.MoredetailsofthesesystemsarepresentedinSection2.

Inourtests,werantheworkloadsofthecorepackagede-scribedinSection4,bothtomeasureperformance(bench-marktier1)andtomeasurescalabilityandelasticity(bench-marktier2).Herewereporttheaveragelatencyofrequests.The95thand99thpercentilelatenciesarenotreported,butfollowedthesametrendsasaveragelatency.Insummary,ourresultsshow:

•Themizationhypothesizedareapparenttradeoinffpractice:sbetweenCassandrareadandandwriteHBaseopti-havehigherreadlatenciesonareadheavyworkloadthanPNUTSandMySQL,andlowerupdatelatenciesonawriteheavyworkload.

•PNUTSserversandandworkloadCassandraincreasedscaledproportionally.wellasthenumberHBase’sofperformancewasmoreerraticasthesystemscaled.

•Cassandra,callywhiletheHBaseworkloadandPNUTSwasexecuting.wereableHowever,togrowPNUTS

elasti- 70 60Read latency (ms) 50 40 30 20 10 0 0

Update latency (ms)CassandraHBasePNUTSMySQL

80 70 60 50 40 30 20 10 0

CassandraHBasePNUTSMySQL

2000 4000 6000 8000 10000 12000 14000

Throughput (ops/sec)

0 2000 4000 6000 8000 10000 12000 14000

Throughput (ops/sec)

(a)(b)

Figure3:WorkloadA—updateheavy:(a)readoperations,(b)updateoperations.Throughputinthis(andallfigures)representstotaloperationspersecond,includingreadsandwrites.providedthebest,moststablelatencywhileelasticallyrepartitioningdata.

Itisimportanttonotethattheresultswereporthereareforparticularversionsofsystemsthatareundergoingcon-tinuousdevelopment,andtheperformancemaychangeandimproveinthefuture.Evenduringtheintervalfromtheinitialsubmissionofthispapertothecamerareadyversion,bothHBaseandCassandrareleasednewversionsthatsig-nificantlyimprovedthethroughputtheycouldsupport.WeprovideresultsprimarilytoillustratethetradeoffsbetweensystemsanddemonstratethevalueoftheYCSBtoolinbenchmarkingsystems.Thisvalueisbothtousersandde-velopersofcloudservingsystems:forexample,whiletryingtounderstandoneofourbenchmarkingresults,theHBasedevelopersuncoveredabugand,aftersimplefixes,nearlydoubledthroughputforsomeworkloads.

HBaseandPNUTSsystems.ForHBase,weallocated1GBofheaptoHadoop,and5GBtoHBase.ForPNUTSandshardedMySQL,weallocated6GBofRAMtotheMySQLbufferpool.ForCassandra,weallocated3GBofheaptotheJVM,atthesuggestionofCassandradevelopers,sotherestofRAMcouldbeusedfortheLinuxfilesystembuffer.Wedisabledreplicationoneachsystemsothatwecouldbench-markthebaselineperformanceofthesystemitself.Inon-goingworkweareexaminingtheimpactofreplication.ForCassandra,shardedMySQLandPNUTS,allupdatesweresynchedtodiskbeforereturningtotheclient.HBasedoesnotsynctodisk,butreliesonin-memoryreplicationacrossmultipleserversfordurability;thisincreaseswritethrough-putandreduceslatency,butcanresultindatalossonfail-ure.WeranHBaseexperimentswithandwithoutclient-sidebuffering;sincebufferinggaveasignificantthroughputben-efit,wemainlyreportonthosenumbers.Cassandra,andpossiblyPNUTSandshardedMySQL,mayhavebenefitedifwehadgiventhemadedicatedlogdisk.However,toensureafaircomparison,weconfiguredallsystemswithasingleRAID-10arrayandnodedicatedlogdisk.UsersofYCSBarefreetosetupalternativehardwareconfigurationstoseeiftheycangetbetterperformance.

HBaseperformanceissensitivetothenumberoflogstruc-turedfilesperkeyrange,andthenumberofwritesbufferedinmemory.HBaseshrinksthesenumbersusingcompactionsandflushes,respectively,andtheycanbesystemoruser-initiated.Weperiodicallyappliedtheseoperationsduringourexperiments;butHBaseusersmustevaluatehowoftensuchoperationsareneededintheirownenvironment.

Ourdatabaseconsistedof120million1KBrecords,foratotalsizeof120GB.Eachserverthushadanaverageof20GBofdata,morethanitcouldcacheentirelyinRAM.Readoperationsretrievedanentirerecord,whileupdatesmodifiedoneofthefields(outoften).

6.1ExperimentalSetup

Formostexperiments,weusedsixserver-classmachines(dual64-bitquadcore2.5GHzIntelXeonCPUs,8GBofRAM,6diskRAID-10arrayandgigabitethernet)toruneachsystem.WealsoranPNUTSona47serverclustertosuccessfullydemonstratethatYCSBcanbeusedtobench-marklargersystems.PNUTSrequiredtwoadditionalma-chinestoserveasaconfigurationserverandrouter,andHBaserequiredanadditionalmachinecalledthe“masterserver.”Theseserverswerelightlyloaded,andtheresultswereportheredependprimarilyonthecapacityofthesixstorageservers.TheYCSBClientranonaseparate8coremachine.TheClientwasrunwithupto500threads,de-pendingonthedesiredofferedthroughput.Weobservedinourteststhattheclientmachinewasnotabottleneck;inparticular,theCPUwasalmostidleasmosttimewasspentwaitingforthedatabasesystemtorespond.

WeranCassandra0.5.0,HBase0.20.3,andMySQL5.1.24(forPNUTS)and5.1.32(forshardedMySQL).Foroneex-periment(elasticspeedup),weusedCassandra0.6.0-beta2,atthesuggestionoftheCassandradevelopers.ForCas-sandra,weusedtheOrderedPartitionerwithnodetokensevenlyspacedaroundthekeyspace.OurshardedMySQLimplementationusedclient-sidehashingtodeterminewhichserveragivenrecordshouldbestoredon.

Weconfiguredandtunedeachsystemaswellasweknewhow.Inparticular,wereceivedextensivetuningassistancefrommembersofthedevelopmentteamsoftheCassandra,

6.2WorkloadA—UpdateHeavy

First,weexaminedWorkloadA,whichhas50percentreadsand50percentupdates.Figure3showslatencyver-susthroughputcurvesforeachsystemforboththereadandupdateoperations.Ineachcase,weincreasedtheof-feredthroughputuntiltheactualthroughputstoppedin-creasing.Asthefigureshows,forallsystems,operationla-tencyincreasedasofferedthroughputincreased.Cassandra,whichisoptimizedforwrite-heavyworkloads,achievedthe

40 35Read latency (ms) 30 25 20 15 10 5 0 0

Update latency (ms)CassandraHBasePNUTSMySQL

20 15 10 5 0

CassandraHBasePNUTSMySQL

2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000

Throughput (ops/sec)Throughput (ops/sec)

(a)(b)

Figure4:WorkloadB—readheavy:(a)readoperations,(b)updateoperations.

bestthroughputandthelowestlatencyforreads.Athighthroughputs,Cassandra’sefficientsequentialuseofdiskforupdatesreducescontentionforthediskhead,meaningthatreadlatencyislowerthanfortheothersystems.PNUTShasslightlyhigherlatencythanMySQL,becauseithasextralogicontopofMySQLfordistributedconsistency.HBasehadverylowlatencyforupdates,sinceupdatesarebufferedinmemory.Withbufferingoff,writesarecommittedtomemoryontheHBaseserver,andlatencywasonly10-50%lowerthanreadlatency.Becauseofefficientsequentialup-dates,Cassandraalsoprovidesthebestthroughput,peakingat11978operations/sec,comparedto7904operations/secforHBase,7283operations/secforshardedMySQL,and7448operations/secforPNUTS.

120 100Read latency (ms) 80 60 40 20 0 0

200 400 600 800 1000 1200 1400 1600

Throughput (ops/sec)

CassandraHBasePNUTS

Figure5:WorkloadE—shortscans.

resultsshow,bothHBaseandPNUTScansustainsimilarmaximumthroughputs(1519ops/secforHBase,and1440ops/secforPNUTS)withroughlyequivalentlatency.Cas-sandraperformsmuchworse.Cassandra’srangescansup-portisrelativelynewinversion0.5.0,andisnotyetheavilyoptimized;futureversionsmayormaynotbeabletopro-videbetterscanlatency.HBaseandPNUTShavesimilarscanperformance,butonlyforshortscans.Werananex-perimentwherewevariedtheaveragerangequeryfrom25to800records.Theresults(notshown)demonstratethatHBasehasbetterscanperformancewhentherangescansretrievemorethan100recordsonaverage.MySQL’sondiskstorageusedaB-Treetosupportlowlatencyreads;butB-treesinherentlyhavesomefragmentation,sothecostofreadingpageswithemptyspaceincreasesthecostofarangescaninPNUTS.HBasestoresdatamorecompactlyondisk,improvingperformanceforlargeranges.Forexam-ple,whentheaveragerangescanisfor800records,HBase’sresponsetimeis3.5timesfasterthanPNUTS.

6.3WorkloadB—ReadHeavy

WorkloadB,thereadheavyworkload,providesadiffer-entpicturethanworkloadA.TheresultsforworkloadBareshowninFigure4.Asthefigureshows,PNUTSandshardedMySQLarenowabletoprovidelowerlatenciesonreadsthaneitherCassandraorHBase.CassandraandHBasestillperformbetterforwrites.TheextradiskI/OsbeingdonebyCassandratoassemblerecordsforreadingdominatesitsperformanceonreads.NotethatCassandraonlybeginstoshowhigherreadlatencyathighthroughputs,indicat-ingthattheeffectsmatterprimarilywhenthediskisclosetosaturation.HBasealsohastoreconstructfragmentsofrecordsfrommultiplediskpages.However,thereadlatencyisrelativelyhigherbecauseofHBase’slog-structuredstor-ageimplementation.HBaseflushesitsmemtablestodiskinseparatefiles,andpotentiallymustsearcheachsuchfileforfragmentsoftherecord,evenifonlysomecontainrelevantfragments.Infact,weobserveworselatencyandthrough-putasthenumberoffilesgrows,andimprovementwhenthenumbershrinksthroughcompaction.Whenthereisagreatdealoffragmentation(forexample,afteralargenum-berofwrites),throughputdropstoaslowas4800opera-tions/secduetotheexpenseofreconstruction.Futureim-provements,suchasBloomfilters,maycutdownsuchfalsepositivesearches.

6.5OtherWorkloads

Next,werantheothercoreYCSBworkloads.There-sultsofworkloadC(readonly)weresimilartothoseoftheread-heavyworkloadB:PNUTSandshardedMySQLachievedthelowestlatencyandhighestthroughputforthereadoperations.WorkloadD(readlatest)alsoshowedsim-ilarresultstoworkloadB.Althoughthepopularityofitemsshiftedovertime,thedominanteffectwasthatPNUTSandMySQLweremostefficientforreads,andworkloadDisdominatedbyreadrequests.NotethatinworkloadDtherecentlyinsertedrecordsarenotnecessarilyclusteredonthe

6.4WorkloadE—ShortRanges

WeranworkloadE(shortranges)usingHBase,CassandraandPNUTS,withrangesupto100records.OurshardedMySQLimplementationisahashtableanddoesnotsup-portrangescans.TheresultsareshowninFigure5.Asthe

70Cassandra 60PNUTS

HBase)sm 50( ycn 40etal 30daeR 20 10 0 0

2

4

6

8 10 12

Servers

Figure6:Readperformanceasclustersizein-

creases.

sameserver.Suchaclusteringschemewouldnotlikelybeusedinpractice,asitwouldresultinaseverehotspotononeserverwhiletheotherserverswereunderutilized.“Readlatest”applicationsinsteadtypicallyconstructtherecordkeystoclusterrecordsbysomeotherattributetoavoidthishotspot,andwedidthisaswell.ThismimicsthedesignofanapplicationlikebloggingorTwitter,whererecentup-datesaremostpopularbutarelikelyclusteredbyuserratherthanbytime.WehaveomittedthegraphsforworkloadsCandDforlackofspace.

Wealsorana“read-modify-write”workloadthatreflectsthefrequentlyusedpatternofreadingarecord,modifyingit,andwritingthechangesbacktothedatabase.Thiswork-loadissimilartoworkloadA(50/50read/writeratio)exceptthattheupdatesare“read-modify-write”ratherthanblindwrites.Theresults(notshown)showedthesametrendsasworkloadA.

6.6Scalability

Sofar,allexperimentshaveusedsixstorageservers.(Asmentionedabove,wedidrunoneexperimentwithPNUTSon47servers,toverifythescalabilityofYCSBitself).How-ever,cloudservingsystemsaredesignedtoscaleout:moreloadcanbehandledwhenmoreserversareadded.WefirsttestedthescaleupcapabilityofCassandra,HBaseandPNUTSbyvaryingthenumberofstorageserversfrom2to12(whilevaryingthedatasizeandrequestrateproportion-ally).TheresultingreadlatencyforworkloadBisshowninFigure6.Astheresultsshow,latencyisnearlyconstantforCassandraandPNUTS,indicatinggoodelasticscalingproperties.Incontrast,HBase’sperformancevariesalotasthenumberofserversincreases;inparticular,performanceisbetterforlargernumbersofservers.AknownissueinHBaseisthatitsbehavioriserraticforverysmallclusters(lessthanthreeservers.)

6.7ElasticSpeedup

WealsoexaminedtheelasticspeedupofCassandra,HBaseandPNUTS.ShardedMySQLisinherentlyinelastic.Inthiscase,westartedtwoservers,loadedwith120GBofdata.Wethenaddedmoreservers,oneatatime,untilthereweresixserversrunning.Afteraddingeachserver,weattemptedtorunlongenoughtoallowthesystemtostabilizebeforeaddingthenextserver(althoughinsomecasesitwasdiffi-culttodetermineifthesystemhadtrulystabilized.)Dur-ingtheentirerun,wesettheYCSBclienttoofferthesamethroughput;inparticular,theofferedthroughputwas80

percentofthatachievablewith6servers.Thismodelsasituationwhereweattempttoelasticallyexpandanover-loadedclustertoasizewhereitcanhandletheofferedload,asituationthatoftenoccursinpractice.

First,weshowasliceofthetimeseriesofreadlatenciesforCassandrainFigure7(a).Inthisfigure,thefirsttenminutesrepresentfiveservers,afterperformancehasstabilized;then,asixthserverisadded.Asthefigureshows,thisresultsinasharpincreaseinlatency,aswellasawidevarianceinthelatency.Thisperformancedegradationresultsfrommovingdatatothe6thserver;regularservingrequestscompetefordiskandnetworkresourceswiththedatarepartitioningpro-cess,resultinginhighandhighlyvariablelatency.Infact,underload,ittakesmanyhoursforCassandratostabilize.Inourtest,wehadtostoptheYCSBworkloadafter5.5hourstoallowthesystemtocompleteitsrepartitioningandquiesce.ThehighcostofrepartitioningisaknownissuewithCassandra0.5.0andisbeingoptimizedinongoingdevelop-ment.Aftercompletingitsrepartitioning,theperformanceofthesystemunderloadmatchedthatofasystemthathadstartedwith6servers,indicatingthateventually,theelasticexpansionoftheclusterwillresultingoodperformance.ResultsforHBaseareshowninFigure7(b).Asbefore,thisfigurerepresentsjustonesliceofthetotalexperiment.Asthefigureshows,thereadlatencyspikesinitiallyafterthesixthserverisadded,beforethelatencystabilizesatavalueslightlylowerthanthelatencyforfiveservers.Thisre-sultindicatesthatHBaseisabletoshiftreadandwriteloadtothenewserver,resultinginlowerlatency.HBasedoesnotmove2existingdatatothenewserveruntilcompactionsoccur.TheresultislesslatencyvariancecomparedtoCas-sandrasincethereisnorepartitioningprocesscompetingforthedisk.However,thenewserverisunderutilized,sinceexistingdataisservedofftheoldservers.

Asimilarsliceofthetimeseries(addingasixthserverattime=10minutes)isshownforPNUTSinFigure7(c).PNUTSalsomovesdatatothenewserver,resultinginhigherlatencyafterthesixthserverisadded,aswellaslatencyvariability.However,PNUTSisabletostabilizemorequickly,asitsrepartitioningschemeismoreoptimized.Afterstabilizingattime=80minutes,thereadlatencyiscomparabletoaclusterthathadstartedwithsixservers,indicatingthatPNUTSprovidesgoodelasticspeedup.

7.FUTUREWORK

Inadditiontoperformancecomparisons,itisimportanttoexamineotheraspectsofcloudservingsystems.Inthissection,weproposetwomorebenchmarktierswhichwearedevelopinginongoingwork.

7.1Tier3—Availability

Aclouddatabasemustbehighlyavailabledespitefailures,andtheAvailabilitytiermeasurestheimpactoffailuresonthesystem.Thesimplestwaytomeasureavailabilityistostartaworkload,killaserverwhiletheworkloadisrunning,andobserveanyresultingerrorsandperformanceimpact.However,inarealdeployedsystem,avarietyoffailurescanoccuratvariousotherlevels,includingthedisk,thenetwork,2

toIttotheispossiblenewservers,torunbutthethisHDFSloadbalancertoforcedataareservestored.

datapartitionsfromgreatlythesamedisruptsserversHBase’sonwhichabilitythey 800 700Read latency (ms) 600 500 400 300 200 100 0 0

50

CassandraRead latency (ms) 250 200 150 100 50 0

HBase

Read latency (ms) 120 100 80 60 40 20

PNUTS 100 150 200 250 300 350Duration of test (min)

0 5 10 15 20 25

0

0 20 40 60 80 100 120

Duration of test (min)Duration of test (min)

(a)Cassandra(b)HBase(c)PNUTS

Figure7:Elasticspeedup:Timeseriesshowingimpactofaddingserversonline.

andthewholedatacenter.Thesefailuremaybetheresultofhardwarefailure(forexample,afaultynetworkinterfacecard),poweroutage,softwarefaults,andsoon.Aproperavailabilitybenchmarkwouldcause(orsimulate)avarietyofdifferentfaultsandexaminetheirimpact.

Atleasttwodifficultiesarisewhenbenchmarkingavail-ability.First,injectingfaultsisnotstraightforward.Whileitiseasytosshtoaserverandkillthedatabaseprocess,itismoredifficulttocauseanetworkfaultforexample.Oneapproachistoalloweachdatabaserequesttocarryaspecialfaultflagthatcausesthesystemtosimulateafault.Thisapproachisparticularlyamenabletobenchmarkingbe-causetheworkloadgeneratorcanaddtheappropriateflagfordifferentteststomeasuretheimpactofdifferentfaults.A“fault-injection”headercanbeinsertedintoPNUTSre-quests,buttoourknowledge,asimilarmechanismisnotcurrentlyavailableintheothersystemswebenchmarked.AnalternateapproachtofaultinjectionthatworkswellatthenetworklayeristouseasystemlikeModelNet[34]orEmulab[33]tosimulatethenetworklayer,andinserttheappropriatefaults.

Theseconddifficultyisthatdifferentsystemshavedif-ferentcomponents,andthereforedifferent,uniquefailuremodes.Thus,itisdifficulttodesignfailurescenariosthatcoverallsystems.

Despitethesedifficulties,evaluatingtheavailabilityofsys-temsisimportantandwearecontinuingtoworkondevel-opingbenchmarkingapproaches.

cascanpotentiallybeusedtospreadoutload,improvingperformance.

•Availabilitycostorbenefit—whatistheimpacttoavailabilityasweincreasethereplicationfactoronacon-stantamountofhardware?Insomesystems,thereadavailabilitymayincreasebutthewriteavailabilitymaydecreaseifallreplicasmustbelivetocommitanupdate.•Freshness—arereplicasconsistent,oraresomereplicasstale?Howmuchofthedataisstaleandhowstaleisit?Thisisprimarilyanissueforsystemsthatuseasyn-chronousreplication.

•Wideareaperformance—howdoesreplicationper-formbetweendatacentersingeographicallyseparatelo-cations?Somereplicationschemesareoptimizedtobeusedwithinthedatacenterorbetweennearbydatacen-ters,whileothersworkwellingloballydistributeddata-centers.

8.RELATEDWORK

Benchmarking

Benchmarkingiswidelyusedforevaluatingcomputersys-tems,andbenchmarksexistforavarietyoflevelsofab-straction,fromtheCPU,tothedatabasesoftware,tocom-pleteenterprisesystems.Ourworkismostcloselyrelatedtodatabasebenchmarks.Graysurveyedpopulardatabasebenchmarks,suchastheTPCbenchmarksandtheWiscon-sinbenchmark,in[22].Healsoidentifiedfourcriteriaforasuccessfulbenchmark:relevancetoanapplicationdomain,portabilitytoallowbenchmarkingofdifferentsystems,scala-bilitytosupportbenchmarkinglargesystems,andsimplicitysotheresultsareunderstandable.Wehaveaimedtosatisfythesecriteriabydevelopingabenchmarkthatisrelevanttoservingsystems,portabletodifferentbackendsthroughourextensibilityframework,scalabletorealisticdatasizes,andemployingsimpletransactiontypes.

Despitetheexistenceofdatabasebenchmarks,wefeltitwasnecessarytodefineanewbenchmarkforcloudservingsystems.First,mostcloudsystemsdonothaveaSQLin-terface,andsupportonlyasubsetofrelationaloperations(usually,justtheCRUDoperations),sothatthecomplexqueriesofmanyexistingbenchmarkswerenotapplicable.Second,theusecasesofcloudsystemsareoftendifferentthantraditionaldatabaseapplications,sothatnarrowdo-mainbenchmarks(suchasthedebit/creditstylebenchmarks

7.2Tier4—Replication

Cloudsystemsusereplicationforbothavailabilityandperformance.Replicascanbeusedforfailover,andtospreadoutreadload.Somesystemsarealsodesignedtosplitwriterequestsacrossreplicastoimproveperformance,althoughaconsistencymechanism(suchaseventualconsistencyinCassandra,ortimelineconsistencyinPNUTS)isneededtoavoidcorruptionduetowrite-writeconflicts.

Replicationmaybesynchronousorasynchronous.HBase,forexample,writessynchronouslytomultiplereplicas,whilePNUTSperformsreplicationasynchronously.Thus,thefol-lowingmeasuresareimportantwhenevaluatingreplication:•Performancecostorbenefit—whatistheimpacttoperformanceasweincreasethereplicationfactoronaconstantamountofhardware?Theextraworktomain-tainreplicasmayhurtperformance,buttheextrarepli-

likeTPC-AorE-commercebenchmarkslikeTPC-W)maynotmatchtheintendedusageofthesystem.Furthermore,ourgoalwastodevelopabenchmarkingframeworkthatcouldbeusedtoexploretheperformancespaceofdifferentsystems,ratherthantomeasureasingleperformancenum-berrepresentingaparticularapplication.Itisforsimilarreasonsthatnewbenchmarkshavebeendevelopedforothernon-traditionaldatabasesystems(suchasXMarkforXMLsystems[28]andLinearRoadforstreamsystems[14]).Designinganaccurateandfairbenchmark,andusingittogatheraccurateresults,isnon-trivial.Seltzeretal[30]arguethatmanymicroandmacrobenchmarksdonoteffec-tivelymodelrealworkloads.Oneapproachtheypropose(thevectorapproach)istomeasuretheperformanceofsys-temoperations,andcomputetheexpectedperformanceforaparticularapplicationthatusessomespecifiedcombinationofthoseoperations.Ourapproachissimilartothis,exceptthatwedirectlymeasuretheperformanceofaparticularcombinationofoperations;thisallowsustoaccuratelymea-suretheimpactofthingslikediskorcachecontentionwhentheoperationsareusedtogether.Shivametal[31]describeaworkbenchtoolforefficientlyrunningmultiplebenchmarkteststoachievehighconfidenceresults.Theirtoolinter-faceswithaworkloadgenerator,liketheYCSBClient,toexecuteeachrun.Weareexaminingthepossibilityofusingtheirworkbenchtorunourbenchmark.

Cloudsystems

Theterm“cloud”hasbeenusedforavarietyofdifferentkindsofsystemsandarchitectures.AspecialissueoftheDataEngineeringBulletin[25]showcasedseveralaspectsofdatamanagementinthecloud.WehavefocusedonservingsystemslikePNUTS,Cassandra,andothers.Incontrast,batchsystemsprovidenear-lineorofflineanalysis,butarenotappropriateforonlineserving.Pavloetal[26]havebenchmarkedcloudsystemslikeHadoopagainstmoretra-ditionalrelationalsystemsandrelationalcolumnstoreslikeVertica/C-Store[32].Somebatchsystemsmightusethesamedatabasesystemsthatwouldbeusedinaservingenvi-ronment.Forexample,HBasecanbeusedbothasaservingstoreandasastoragebackendforHadoop,andisreportedlyusedthiswayatStumbleUpon,oneofthemajordevelopersofHBase[27].

9.CONCLUSIONS

WehavepresentedtheYahoo!CloudServingBenchmark.Thisbenchmarkisdesignedtoprovidetoolsforapples-to-applescomparisonofdifferentservingdatastores.Onecon-tributionofthebenchmarkisanextensibleworkloadgener-ator,theYCSBClient,whichcanbeusedtoloaddatasetsandexecuteworkloadsacrossavarietyofdataservingsys-tems.Anothercontributionisthedefinitionoffivecoreworkloads,whichbegintofilloutthespaceofperformancetradeoffsmadebythesesystems.Newworkloadscanbeeasilycreated,includinggeneralizedworkloadstoexaminesystemfundamentals,aswellasmoredomain-specificwork-loadstomodelparticularapplications.Asanopen-sourcepackage,theYCSBClientisavailablefordeveloperstouseandextendinordertoeffectivelyevaluatecloudsystems.Wehaveusedthistooltobenchmarktheperformanceoffourcloudservingsystems,andobservedthattherearecleartradeoffsbetweenreadandwriteperformancethatresultfromeachsystem’sarchitecturaldecisions.Theseresults

highlighttheimportanceofastandardframeworkforexam-iningsystemperformancesothatdeveloperscanselectthemostappropriatesystemfortheirneeds.

10.ACKNOWLEDGEMENTS

Wewouldliketothankthesystemdeveloperswhohelpedustunethevarioussystems:JonathanEllisfromCassan-dra,RyanRawsonandMichaelStackfromHBase,andtheSherpaEngineeringTeaminYahoo!CloudComputing.

11.[1]AmazonREFERENCES

SimpleDB.http://aws.amazon.com/simpledb/.

[2]ApacheCassandra.http://incubator.apache.org/cassandra/.[3]ApacheCouchDB.http://couchdb.apache.org/.[4]ApacheHBase.http://hadoop.apache.org/hbase/.

[5]

DynomiteFramework.http://wiki.github.com/cliffmoon/-dynomite/dynomite-framework.

[6]GoogleAppEngine.http://appengine.google.com.[7]Hypertable.http://www.hypertable.org/.[8]mongodb.http://www.mongodb.org/.

[9]ProjectVoldemort.http://project-voldemort.com/.[10]SolarisFileBench.

http://www.solarisinternals.com/wiki/index.php/FileBench.[11]SQLDataServices/AzureServicesPlatform.http://www.microsoft.com/azure/data.mspx.[12]StoragePerformanceCouncil.

http://www.storageperformance.org/home.

[13]Yahoo!QueryLanguage.http://developer.yahoo.com/yql/.[14]A.Arasuetal.LinearRoad:astreamdatamanagementbenchmark.InVLDB,2004.

[15]F.C.Botelho,D.Belazzougui,andM.Dietzfelbinger.

Compress,hashanddisplace.InProc.ofthe17thEuropeanSymposiumonAlgorithms,2009.

[16]F.Changetal.Bigtable:Adistributedstoragesystemforstructureddata.InOSDI,2006.

[17]B.F.Cooperetal.PNUTS:Yahoo!’shosteddataservingplatform.InVLDB,2008.

[18]G.DeCandiaetal.Dynamo:Amazon’shighlyavailablekey-valuestore.InSOSP,2007.

[19]D.J.DeWitt.TheWisconsinBenchmark:Past,presentandfuture.InJ.Gray,editor,TheBenchmarkHandbook.MorganKaufmann,1993.

[20]I.Eure.LookingtothefuturewithCassandra.http://blog.digg.com/?p=966.

[21]S.GilbertandN.Lynch.Brewer’sconjectureandthefeasibilityofconsistent,available,partition-tolerantwebservices.ACMSIGACTNews,33(2):51–59,2002.

[22]J.Gray,editor.TheBenchmarkHandbookForDatabaseandTransactionProcessingSystems.MorganKaufmann,1993.[23]J.Grayetal.Quicklygeneratingbillion-recordsyntheticdatabases.InSIGMOD,1994.

[24]A.Lakshman,P.Malik,andK.Ranganathan.Cassandra:AstructuredstoragesystemonaP2Pnetwork.InSIGMOD,2008.

[25]B.C.OoiandS.Parthasarathy.Specialissueondatamanagementoncloudcomputingplatforms.IEEEDataEngineeringBulletin,vol.32,2009.

[26]A.Pavloetal.Acomparisonofapproachestolarge-scaledataanalysis.InSIGMOD,2009.

[27]R.Rawson.HBaseintro.InNoSQLOakland,2009.[28]A.Schmidtetal.Xmark:AbenchmarkforXMLdatamanagement.InVLDB,2002.

[29]R.Sears,M.Callaghan,andE.Brewer.Rose:Compressed,log-structuredreplication.InVLDB,2008.

[30]M.Seltzer,D.Krinsky,K.A.Smith,andX.Zhang.Thecaseforapplication-specificbenchmarking.InProc.HotOS,1999.[31]P.Shivametal.Cuttingcorners:Workbenchautomationforserverbenchmarking.InProc.USENIXAnnualTechnicalConference,2008.

[32]M.Stonebrakeretal.C-store:acolumn-orientedDBMS.InVLDB,2005.

[33]B.Whiteetal.Anintegratedexperimentalenvironmentfordistributedsystemsandnetworks.InOSDI,2002.

[34]

K.Yocumetal.Scalabilityandaccuracyinalarge-scalenetworkemulator.InOSDI,2002.

因篇幅问题不能全部显示,请点此查看更多更全内容