BrianF.Cooper,AdamSilberstein,ErwinTam,RaghuRamakrishnan,RussellSears
Yahoo!ResearchSantaClara,CA,USA
{cooperb,silberst,etam,ramakris,sears}@yahoo-inc.com
ABSTRACT
WhiletheuseofMapReducesystems(suchasHadoop)forlargescaledataanalysishasbeenwidelyrecognizedandstudied,wehaverecentlyseenanexplosioninthenumberofsystemsdevelopedforclouddataserving.Thesenewersystemsaddress“cloudOLTP”applications,thoughtheytypicallydonotsupportACIDtransactions.ExamplesofsystemsproposedforcloudservinguseincludeBigTable,PNUTS,Cassandra,HBase,Azure,CouchDB,SimpleDB,Voldemort,andmanyothers.Further,theyarebeingap-pliedtoadiverserangeofapplicationsthatdifferconsider-ablyfromtraditional(e.g.,TPC-Clike)servingworkloads.Thenumberofemergingcloudservingsystemsandthewiderangeofproposedapplications,coupledwithalackofapples-to-applesperformancecomparisons,makesitdifficulttoun-derstandthetradeoffsbetweensystemsandtheworkloadsforwhichtheyaresuited.WepresenttheYahoo!CloudServingBenchmark(YCSB)framework,withthegoaloffa-cilitatingperformancecomparisonsofthenewgenerationofclouddataservingsystems.Wedefineacoresetofbenchmarksandreportresultsforfourwidelyusedsystems:Cassandra,HBase,Yahoo!’sPNUTS,andasimpleshardedMySQLimplementation.Wealsohopetofosterthedevel-opmentofadditionalcloudbenchmarksuitesthatrepresentotherclassesofapplicationsbymakingourbenchmarktoolavailableviaopensource.Inthisregard,akeyfeatureoftheYCSBframework/toolisthatitisextensible—itsupportseasydefinitionofnewworkloads,inadditiontomakingiteasytobenchmarknewsystems.
CategoriesandSubjectDescriptors:H.3.4[SystemsandSoftware]:Performanceevaluation
GeneralTerms:Measurement,Performance
1.INTRODUCTION
Therehasbeenanexplosionofnewsystemsfordatastor-ageandmanagement“inthecloud.”OpensourcesystemsincludeCassandra[2,24],HBase[4],Voldemort[9]andoth-
Permissionpersonaltomakedigitalorhardcopiesofallorpartnotbearmadeororclassroomdistributeduseforisprofitgrantedorcommercialwithoutfeeprovidedofthatthiscopiesworkforarerepublish,thisnoticepermissiontopostandontheserversfullcitationonthefirstpage.advantageTocopyandotherwise,thatcopiestoSoCC’10,and/orafee.
ortoredistributetolists,requirespriorspecificCopyrightJune201010–11,ACM978-1-4503-0036-0/10/062010,Indianapolis,Indiana,...$10.00.
USA.ers[3,5,7,8].Somesystemsareofferedonlyascloudservices,eitherdirectlyinthecaseofAmazonSimpleDB[1]andMicrosoftAzureSQLServices[11],oraspartofapro-grammingenvironmentlikeGoogle’sAppEngine[6]orYa-hoo!’sYQL[13].Stillothersystemsareusedonlywithinaparticularcompany,suchasYahoo!’sPNUTS[17],Google’sBigTable[16],andAmazon’sDynamo[18].Manyofthese“cloud”systemsarealsoreferredtoas“key-valuestores”or“NoSQLsystems,”butregardlessofthemoniker,theysharethegoalsofmassivescaling“ondemand”(elasticity)andsimplifiedapplicationdevelopmentanddeployment.
Thelargevarietyhasmadeitdifficultfordeveloperstochoosetheappropriatesystem.Themostobviousdiffer-encesarebetweenthevariousdatamodels,suchasthecolumn-grouporientedBigTablemodelusedinCassandraandHBaseversusthesimplehashtablemodelofVoldemortorthedocumentmodelofCouchDB.However,thedatamodelscanbedocumentedandcomparedqualitatively.Com-paringtheperformanceofvarioussystemsisaharderprob-lem.Somesystemshavemadethedecisiontooptimizeforwritesbyusingon-diskstructuresthatcanbemaintainedus-ingsequentialI/O(asinthecaseofCassandraandHBase),whileothershaveoptimizedforrandomreadsbyusingamoretraditionalbuffer-poolarchitecture(asinthecaseofPNUTS).Furthermore,decisionsaboutdatapartitioningandplacement,replication,transactionalconsistency,andsoonallhaveanimpactonperformance.
Understandingtheperformanceimplicationsofthesede-cisionsforagiventypeofapplicationischallenging.De-velopersofvarioussystemsreportperformancenumbersforthe“sweetspot”workloadsfortheirsystem,whichmaynotmatchtheworkloadofatargetapplication.Moreover,anapples-to-applescomparisonishard,givennumbersfordif-ferentsystemsbasedondifferentworkloads.Thus,devel-opersoftenhavetodownloadandmanuallyevaluatemul-tiplesystems.EngineersatDigg[20]reportedevaluatingeightdifferentdatastoresinordertoimplementonefeature(theGreenBadge,or“whathavemyfriendsdugg”feature).TherehavebeenmultiplesimilarexamplesatYahoo!.Thisprocessistime-consumingandexpensive.
Ourgoalistocreateastandardbenchmarkandbench-markingframeworktoassistintheevaluationofdifferentcloudsystems.Wefocusonservingsystems,whicharesys-temsthatprovideonlineread/writeaccesstodata.Thatis,usuallyawebuseriswaitingforawebpagetoload,andreadsandwritestothedatabasearecarriedoutaspartofthepageconstructionanddelivery.Incontrast,batchoran-alyticalsystemssuchasHadooporrelationalOLAPsystems
provideonlynear-lineorofflinequeries,andarenottypicallyusedtosupportservingworkloads(thoughtheresultofabatchcomputationmaybecachedintoaservingstoreforlow-latencyaccess).Otherresearchers[26]aredoingworktobenchmarkanalyticalsystemssuchasthese.Similarly,thereareexistingbenchmarksforavarietyofdatastoragesystems(suchasSQLdatabases[22]andfilesystems[12,10]).However,thenovelinterfaces(usuallyneitherSQLnorPOSIX),elasticity,andnewusecasesofcloudservingsystemsmotivateanewbenchmark.
WepresenttheYahoo!CloudServingBenchmark(YCSB)framework.WeareusingthisframeworktobenchmarkourownPNUTSsystemandtocompareittootherclouddatabases.Theframeworkconsistsofaworkloadgener-atingclientandapackageofstandardworkloadsthatcoverinterestingpartsoftheperformancespace(read-heavywork-loads,write-heavyworkloads,scanworkloads,etc.).Anim-portantaspectoftheYCSBframeworkisitsextensibility:theworkloadgeneratormakesiteasytodefinenewwork-loadtypes,anditisalsostraightforwardtoadapttheclienttobenchmarknewdataservingsystems.TheYCSBframe-workandworkloadsareavailableinopensourcesothatdeveloperscanuseittoevaluatesystems,andcontributenewworkloadpackagesthatmodelinterestingapplications1.Inthispaper,wedescribetheYCSBbenchmark,andre-portperformanceresultsforfoursystems:Cassandra,HBase,PNUTS,andasimpleshardedMySQLimplementation.Al-thoughourfocusinthispaperisonperformanceandelas-ticity,theframeworkisintendedtoserveasatoolforevalu-atingotheraspectsofcloudsystemssuchasavailabilityandreplication,andwediscussapproachestoextendingitforthesepurposes.
Thepaperisorganizedasfollows.Section2providesanoverviewofclouddataservingsystems.Section3discussesbenchmarktiersforperformanceandscaling.Section4dis-cussesthecoreworkloadsofthebenchmarkindetail,whileSection5examinesthearchitectureandextensibilityoftheYCSBtool.Section6presentsbenchmarkingresultsonsev-eralsystems.WeproposefuturebenchmarktierscoveringavailabilityandreplicationinSection7.Section8examinesrelatedwork,andSection9presentsourconclusions.
2.CLOUDSYSTEMOVERVIEW
2.1CloudServingSystemCharacteristics
Cloudsservingsystemssharecommongoals,despitethedifferentarchitecturesanddesigndecisions.Ingeneral,thesesystemsaimfor:
•Scale-outorpetabytes):Toandsupportveryhighhugerequestdatasetsrates,(multiplecloudterabytessystemsarearchitectedtoscale-out,sothatlargescaleisachievedusinglargenumbersofcommodityservers,eachrunningcopiesofthedatabasesoftware.Aneffectivescale-outsystemmustbalanceloadacrossserversandavoidbot-tlenecks.
•Elasticitylargesystems,:Whileelasticityscale-outmeansprovidesthattheweabilitycanaddtomorehavecapacitytoarunningsystembydeployingnewinstancesofeachcomponent,andshiftingloadtothem.1
WebYCSBInformationcanbeobtainedManagement/YCSB.
fromhttp://research.yahoo.com/-•Highlevelsofavailabilityavailability.:InCloudparticular,systemstheymustareprovideoftenmulti-hightenantsystems,whichmeansthatanoutageaffectsmanydifferentapplications.Moreover,theuseofcommodityhardwaremeansthatfailuresarerelativelycommon,andautomatedrecoverymustbeafirst-classoperationofthesystem.
Themainmotivationfordevelopingnewcloudservingsystemsisthedifficultyinprovidingallofthesefeatures(es-peciallyscale-outandelasticity)usingtraditionaldatabasesystems.Asatradeoff,cloudsystemstypicallysacrificethecomplexquerycapabilitiesandstrong,sophisticatedtrans-actionmodelsfoundintraditionalsystems.Withouttheneedforcomplexplanningandprocessingofjoinsandaggre-gates,scale-outandelasticitybecomesignificantlyeasiertoachieve.Similarly,scale-out(especiallytomultipledatacen-ters)iseasiertoachievewithoutstrongtransactionprotocolsliketwo-phasecommitorPaxos.Inparticular,itisimpos-sibletosimultaneouslyguaranteeavailability,consistencyandpartitiontolerance[21].Becausenetworkpartitions(ordelaysandfailureswhichmimicpartitions)areunavoidable,systemsmustprioritizeeitheravailabilityorconsistency,andmostcloudsystemschooseavailability.Asaresult,cloudsystemstypicallyprovideaconsistencymodelthatisweakerinvariouswaysthantraditionalACIDdatabases.
2.2ClassificationofSystemsandTradeoffs
Wenowexaminethedifferentarchitecturaldecisionsmadebycloudsystems.Aswithmanytypesofcomputersystems,noonesystemcanbebestforallworkloads,anddifferentsystemsmakedifferenttradeoffsinordertooptimizefordif-ferentapplications.Themaintradeoffsfacingcloudservingsystemsare:
In•aReadservingperformancesystem,itisversusdifficultwritetopredictperformance
whichrecordwillbereadorwrittennext.Unlessalldatafitsinmem-ory,thismeansthatrandomI/Otothediskisneededtoservereads(e.g.,asopposedtoscans).RandomI/Ocanbeusedforwritesaswell,butmuchhigherwritethroughputcanbeachievedbyappendingallupdatestoasequentialdisk-basedlog.However,log-structuredsystemsthatonlystoreupdatedeltascanveryinefficientforreadsifthedataismodifiedovertime,astypicallymultipleupdatesfromdifferentpartsofthelogmustbemergedtoprovideacon-sistentrecord.Writingthecompleterecordtothelogoneachupdateavoidsthecostofreconstructionatreadtime,butthereisacorrespondinglyhighercostonupdate.Log-structuredmergetrees[29]avoidthecostofreconstructingonreadsbyusingabackgroundprocesstomergeupdatesandclusterrecordsbyprimarykey,butthediskcostofthisprocesscanreduceperformanceforotheroperations.Over-all,then,thereisaninherenttradeoffbetweenoptimizingforreadsandoptimizingforwrites.
AparticularcaseofthistradeoffisseeninsystemssuchasHBasethatarebasedonfilesystemsoptimizedforbatchprocessing(forexample,HBaseisbuiltonHDFS,whichisthedatastoreforHadoop).Forasystemtoexcelatbatchprocessing,itrequireshighthroughputsequentialreadsandwrites,ratherthanfastrandomaccesses;thus,Hadooponlysupportsappend-onlyfiles.Updatestoexistingrecordsmustbehandledbyusingadifferentialfilesschemethatshares
thesamedisadvantagesasalog-structuredfilesystemwithrespecttoreads.
Writes•Latencymaybeversussyncheddurability
todiskbeforethesystemreturnssuc-cesstotheuser,ortheymaybestoredinmemoryatwritetimeandsynchedlater.Theadvantagesofthelatterap-proacharethatavoidingdiskgreatlyimproveswritelatency,andpotentiallyimprovesthroughput(ifmultiplewritestothesamerecordcanbeservicedbyasingleI/Ooperationorcanbecondensedinmemory).Thedisadvantageisriskofdatalossifaservercrashesandlosesunsynchedupdates.Replication•Synchronousisusedversustoimproveasynchronoussystemavailabilityreplication(bydi-rectingtraffictoareplicaafterafailure),avoiddataloss(byrecoveringlostdatafromareplica),andimproveper-formance(byspreadingloadacrossmultiplereplicasandbymakinglow-latencyaccessavailabletousersaroundtheworld).However,therearedifferentapproachestorepli-cation.Synchronousreplicationensuresallcopiesareuptodate,butpotentiallyincurshighlatencyonupdates.Furthermore,availabilitymaybeimpactedifsynchronouslyreplicatedupdatescannotcompletewhilesomereplicasareoffline.Asynchronousreplicationavoidshighwritelatency(inparticular,makingitsuitableforwideareareplication)butallowsreplicastobestale.Furthermore,datalossmayoccurifanupdateislostduetofailurebeforeitcanbereplicated.
Systems•Datamaypartitioning
bestrictlyrow-based,orallowforcolumnstorage.Inrow-basedstorageallofarecord’sfieldsarestoredcontiguouslyondisk.Withcolumnstorage,differ-entcolumnsorgroupsofcolumnscanbestoredseparately(possiblyondifferentservers).Row-basedstoragesupportsefficientaccesstoanentirerecord(includinglowlatencyreadsandinsertion/updateinaserving-orientedsystem),andisidealifwetypicallyaccessafewrecordsintheiren-tirety.Column-basedstorageismoreefficientforaccessingasubsetofthecolumns,particularlywhenmultiplerecordsareaccessed.
2.3ABriefSurveyofCloudDataSystems
Toillustratethesetradeoffs,Table1listsseveralcloudsystemsandthechoicestheyhavemadeforeachdimension.Wenowexaminesomeofthesedecisions.
Anapplicationdevelopermustmatchtheirworkloadre-quirementstothebestsuitedclouddatabasesystem.Con-sidertheread-optimizedversuswrite-optimizedtradeoff.BigTable-likesystemssuchasCassandraandHBaseattempttoalwaysperformsequentialI/Oforupdates.Recordsondiskareneveroverwritten;instead,updatesarewrittentoabufferinmemory,andtheentirebufferiswrittensequen-tiallytodisk.Multipleupdatestothesamerecordmaybeflushedatdifferenttimestodifferentpartsofthedisk.Theresultisthattoperformareadofarecord,multipleI/Osareneededtoretrieveandcombinethevariousup-dates.Sinceallwritingissequential,itisveryfast;butreadsarecorrespondinglyde-optimized.Incontrast,amoretraditionalbuffer-poolarchitecture,suchasthatinMySQLandPNUTS,overwritesrecordswhentheyareupdated.Be-causeupdatesrequirerandomI/O,theyareslowerthanthe
BigTable-likesystems;butreadsarefastbecauseasingleI/Ocanretrievetheentire,up-to-daterecord.
Latencyversusdurabilityisanotherimportantaxis.Ifdevelopersknowtheycanloseasmallfractionofwrites(forexample,webpollvotes),theycanreturnsuccesstowriteswithoutwaitingforthemtobesynchedtodisk.Cassandraallowstheclienttospecifyonaper-callbasiswhetherthewriteisdurablypersisted.MySQLandPNUTSalwaysforcelogupdatestodiskwhencommittingatransaction,althoughthislogforcecanbedisabled.HBasedoesnotsynclogupdatestodisk,whichprovideslowlatencyupdatesandhighthroughput.ThisisappropriateforHBase’stargetusecases,whichareprimarilytorunbatchanalyticsoverservingdata,ratherthantoprovideguaranteeddurabilityforsuchdata.Forsuchasystem,highthroughputsequentialreadsandwritesarefavoredoverdurabilityforrandomupdates.Synchronousreplicationensuresfreshnessofreplicas,andisusedinHBaseandCassandra.Cassandraalsosupportsasynchronousreplication,asdoMySQLandPNUTS.Asyn-chronousreplicationsupportswide-areareplicationwithoutaddingsignificantoverheadtotheupdatecallitself.
Columnstorageisadvantageousforapplicationsthatneedonlyaccessasubsetofcolumnswitheachrequest,andknowthesesubsetsinadvance.BigTable,HBase,andCassandraallprovidetheabilitytodeclarecolumngroupsorfami-lies,andaddcolumnstoanyofthem.Eachgroup/familyisphysicallystoredseparately.Ontheotherhand,ifre-queststypicallywanttheentirerow,orarbitrarysubsetsofit,partitioningthatkeepstheentirerowphysicallytogetherisbest.Thiscanbedonewithrowstorage(asinPNUTS),orbyusingasinglecolumngroup/familyinacolumnstorelikeCassandra.
Thesystemswediscussherearerepresentative,ratherthancomprehensive.Avarietyofothersystemsmakedif-ferentdecisions.Whilewecannotsurveythemallhere,wewillpointoutafewinterestingcharacteristics.Dynamo,VoldemortandCassandrauseeventualconsistencytobal-ancereplicationandavailability.Inthismodel,writesareallowedanywhere,andconflictingwritestothesameobjectareresolvedlater.AmazonSimpleDBandMicrosoftAzurearehostedcloudservingstores.Bothprovidetransactionalfunctionsnotfoundinotherservingstores.Thecaveatisthattheusermustpartitiontheirdataintodifferentcontain-ers,bothintermsofsizeandrequestrate.SimpleDBcallsthesecontainersdomains,whileAzurecallsthemdatabases.Thewidevarianceindesigndecisionshassignificantperfor-manceimplications,whichmotivatedustodevelopabench-markforquantitativelyevaluatingthoseimplications.
3.BENCHMARKTIERS
Inthissection,weproposetwobenchmarktiersforeval-uatingtheperformanceandscalabilityofcloudservingsys-tems.Ourintentionistoextendthebenchmarktodealwithmoretiersforavailabilityandreplication,andwediscussourinitialideasfordoingsoinSection7.
3.1Tier1—Performance
ThePerformancetierofthebenchmarkfocusesonthelatencyofrequestswhenthedatabaseisunderload.La-tencyisveryimportantinservingsystems,sincethereisusuallyanimpatienthumanwaitingforawebpagetoload.However,thereisaninherenttradeoffbetweenlatencyandthroughput:onagivenhardwaresetup,astheamountof
SystemPNUTSBigTableHBaseCassandraShardedMySQLRead/WriteoptimizedReadWriteWriteWriteRead
Latency/durability
DurabilityDurabilityLatencyTunableTunableSync/asyncreplicationAsyncSyncAsyncTunableAsync
Row/column
RowColumnColumnColumnRow
Table1:Designdecisionsofvarioussystems.
loadincreases,thelatencyofindividualrequestsincreasesaswellsincethereismorecontentionfordisk,CPU,network,andsoon.Typicallyapplicationdesignersmustdecideonanacceptablelatency,andprovisionenoughserverstoachievethedesiredthroughputwhilepreservingacceptablelatency.Asystemwithbetterperformancewillachievethedesiredlatencyandthroughputwithfewerservers.
ThePerformancetierofthebenchmarkaimstocharac-terizethistradeoffforeachdatabasesystembymeasuringlatencyasweincreasethroughput,untilthepointatwhichthedatabasesystemissaturatedandthroughputstopsin-creasing.IntheterminologyoftheWisconsinBenchmark,apopularearlybenchmarkusedforparalleldatabasesystems,ourmetricissimilartosizeup[19],wherethehardwareiskeptconstantbutthesizeoftheworkloadincreases.
Toconductthisbenchmarktier,weneedaworkloadgen-eratorwhichservestwopurposes:first,todefinethedatasetandloaditintothedatabase;andsecond,toexecuteop-erationsagainstthedatasetwhilemeasuringperformance.WehaveimplementedtheYCSBClient(describedinmoredetailinSections4and5)forbothpurposes.Asetofpa-rameterfilesdefinesthenatureofthedatasetandtheopera-tions(transactions)performedagainstthedata.TheYCSBClientallowstheusertodefinetheofferedthroughputasacommandlineparameter,andreportstheresultingla-tency,makingitstraightforwardtoproducelatencyversusthroughputcurves.
3.2Tier2—Scaling
Akeyaspectofcloudsystemsistheirabilitytoscaleelas-tically,sothattheycanhandlemoreloadasapplicationsaddfeaturesandgrowinpopularity.TheScalingtierofthedatabaseexaminestheimpactonperformanceasmorema-chinesareaddedtothesystem.Therearetwometricstomeasureinthistier:
Scaleup—Howdoesthedatabaseperformasthenumberofmachinesincreases?Inthiscase,weloadagivennumberofserverswithdataandruntheworkload.Then,wedeletethedata,addmoreservers,loadalargeramountofdataonthelargercluster,andruntheworkloadagain.Ifthedatabasesystemhasgoodscaleupproperties,theperformance(e.g.,latency)shouldremainconstant,asthenumberofservers,amountofdata,andofferedthroughputscaleproportionally.Thisisequivalenttothescaleupmetricfrom[19].
Elasticspeedup—Howdoesthedatabaseperformasthenumberofmachinesincreaseswhilethesystemisrunning?Inthiscase,weloadagivennumberofserverswithdataandruntheworkload.Astheworkloadisrunning,weaddoneormoreservers,andobservetheimpactonperformance.Asystemthatoffersgoodelasticityshouldshowaperfor-manceimprovementwhenthenewserversareadded,witha
shortornon-existentperiodofdisruptionwhilethesystemisreconfiguringitselftousethenewserver.Thisissimilartothespeedupmetricfrom[19],withtheaddedtwistthatthenewserverisaddedwhiletheworkloadisrunning.
4.BENCHMARKWORKLOADS
Wehavedevelopedacoresetofworkloadstoevaluatedif-ferentaspectsofasystem’sperformance,calledtheYCSBCorePackage.Inourframework,apackageisacollectionofrelatedworkloads.Eachworkloadrepresentsaparticularmixofread/writeoperations,datasizes,requestdistribu-tions,andsoon,andcanbeusedtoevaluatesystemsatoneparticularpointintheperformancespace.Apackage,whichincludesmultipleworkloads,examinesabroadersliceoftheperformancespace.Whilethecorepackageexaminessev-eralinterestingperformanceaxes,wehavenotattemptedtoexhaustivelyexaminetheentireperformancespace.UsersofYCSBcandeveloptheirownpackageseitherbydefininganewsetofworkloadparameters,orifnecessarybywritingJavacode.Wehopetofosteranopen-sourceefforttocreateandmaintainasetofpackagesthatarerepresentativeofmanydifferentapplicationsettingsthroughtheYCSBopensourcedistribution.TheprocessofdefiningnewpackagesisdiscussedinSection5.
Todevelopthecorepackage,weexaminedavarietyofsystemsandapplicationstoidentifythefundamentalkindsofworkloadswebapplicationsplaceonclouddatasystems.Wedidnotattempttoexactlymodelaparticularapplica-tionorsetofapplications,asisdoneinbenchmarkslikeTPC-C.Suchbenchmarksgiverealisticperformanceresultsforanarrowsetofusecases.Incontrast,ourgoalwastoexamineawiderangeofworkloadcharacteristics,inordertounderstandinwhichportionsofthespaceofworkloadssystemsperformedwellorpoorly.Forexample,somesys-temsmaybehighlyoptimizedforreadsbutnotforwrites,orforinsertsbutnotupdates,orforscansbutnotforpointlookups.Theworkloadsinthecorepackagewerechosentoexplorethesetradeoffsdirectly.
Theworkloadsinthecorepackageareavariationofthesamebasicapplicationtype.Inthisapplication,thereisatableofrecords,eachwithFfields.Eachrecordisidenti-fiedbyaprimarykey,whichisastringlike“user234123”.Eachfieldisnamedfield0,field1andsoon.ThevaluesofeachfieldarearandomstringofASCIIcharactersoflengthL.Forexample,intheresultsreportedinthispaper,weconstruct1,000byterecordsbyusingF=10fields,eachofL=100bytes.
Eachoperationagainstthedatastoreisrandomlychosentobeoneof:
•Insert:Insertanewrecord.
•Updatefield.
:Updatearecordbyreplacingthevalueofone•Readorall:fields.
Readarecord,eitheronerandomlychosenfield•Scanchosen:Scanrecordrecordskey.Theinorder,numberstartingofrecordsatatorandomlyscanisrandomlychosen.
Forscanspecifically,thedistributionofscanlengthsischosenaspartoftheworkload.Thus,thescan()methodtakesaninitialkeyandthenumberofrecordstoscan.Ofcourse,arealapplicationmayinsteadspecifyascaninterval(i.e.,fromFebruary1sttoFebruary15th).Thenumberofrecordsparameterallowsustocontrolthesizeofthesein-tervals,withouthavingtodetermineandspecifymeaningfulendpointsforthescan.(Allofthedatabasecalls,includingscan(),aredescribedinSection5.2.1.)
4.1Distributions
Theworkloadclientmustmakemanyrandomchoiceswhengeneratingload:whichoperationtoperform(Insert,Update,ReadorScan),whichrecordtoreadorwrite,howmanyrecordstoscan,andsoon.Thesedecisionsaregov-ernedbyrandomdistributions.YCSBhasseveralbuilt-indistributions:
•Uniformample,when:Choosechoosinganitemarecord,uniformlyallrecordsatrandom.inthedatabaseForex-areequallylikelytobechosen.
•Zipfiantribution.:ChooseForexample,anitemwhenaccordingchoosingtothearecord,Zipfiansomedis-recordswillbeextremelypopular(theheadofthedistri-bution)whilemostrecordswillbeunpopular(thetail).•Latestmostrecently:LikeinsertedtheZipfianrecordsdistribution,areintheexceptheadofthatthedis-thetribution.
•Multinomialfied.Forexample,:Probabilitieswemightassignforeachaprobabilityitemcanbeofspeci-0.95totheReadoperation,aprobabilityof0.05totheUp-dateoperation,andaprobabilityof0toScanandInsert.Theresultwouldbearead-heavyworkload.
Figure1illustratesthedifferencebetweentheuniform,zipfianandlatestdistributions.Thehorizontalaxesinthefigurerepresenttheitemsthatmaybechosen(e.g.,records)inorderofinsertion,whiletheverticalbarsrepresenttheprobabilitythattheitemischosen.Notethatthelastin-serteditemmaynotbeinsertedattheendofthekeyspace.Forexample,Twitterstatusupdatesmightbeclusteredbyuser,ratherthanbytimestamp,meaningthattworecentlyinserteditemsmaybefarapartinthekeyspace.
AkeydifferencebetweentheLatestandZipfiandistribu-tionsistheirbehaviorwhennewitemsareinserted.UndertheLatestdistribution,thenewlyinserteditembecomesthemostpopular,whilethepreviouslypopularitemsbecomelessso.UndertheZipfiandistribution,itemsretaintheirpopularityevenasnewitemsareinserted,whetherornotthenewlyinserteditemispopular.TheLatestdistributionismeanttomodelapplicationswhererecencymatters;forexample,onlyrecentblogpostsornewsstoriesarepopular,andthepopularitydecaysquickly.Incontrast,theZipfiandistributionmodelsitemswhosepopularityisindependentoftheirnewness;aparticularusermightbeextremelypop-
Uniform:
ytiralupoP0 1 ...
Insertion order
N
Zipfian:
ytiralupoP0 1 ...
Insertion order
N
Latest:
ytiralupoP0 1 ...
Insertion order
N
Figure1:Probabilitydistributions.Horizontalaxesrepresentsitemsinorderofinsertion,andverticalaxesrepresentprobabilityofbeingchosen.
ular,withmanyviewsofherprofilepage,eventhoughshehasjoinedmanyyearsago.
4.2TheWorkloads
Wedefinedtheworkloadsinthecorepackagebyassign-ingdifferentdistributionstothetwomainchoiceswemustmake:whichoperationtoperform,andwhichrecordtoreadorwrite.ThevariouscombinationsareshowninTable2.Althoughwedonotattempttomodelcomplexapplicationsprecisely(asdiscussedabove),welistasampleapplicationthatgenerallyhasthecharacteristicsoftheworkload.
Loadingthedatabaseislikelytotakelongerthananyindividualexperiment.Inourtests,loadstookbetween10-20hours(dependingonthedatabasesystem),whileweraneachexperiment(e.g.,aparticularworkloadataparticulartargetthroughputagainstaparticulardatabase)for30min-utes.Allthecorepackageworkloadsusethesamedataset,soitispossibletoloadthedatabaseonceandthenrunalltheworkloads.However,workloadsAandBmodifyrecords,andDandEinsertrecords.Ifdatabasewritesarelikelytoimpacttheoperationofotherworkloads(e.g.,byfragment-ingtheon-diskrepresentation)itmaybenecessarytore-loadthedatabase.Wedonotprescribeaparticulardatabaseloadingstrategyinourbenchmark,sincedifferentdatabasesystemshavedifferentloadingmechanisms(includingsomethathavenospecialbulkloadfacilityatall).
5.DETAILSOFTHEBENCHMARKTOOL
Wehavedevelopedatool,calledtheYCSBClient,toexecutetheYCSBbenchmarks.Akeydesigngoalofourtoolisextensibility,sothatitcanbeusedtobenchmarknewclouddatabasesystems,andsothatnewworkloadscanbedeveloped.Wehaveusedthistooltomeasuretheperformanceofseveralcloudsystems,aswereportinthenextsection.Thistoolisalsoavailableunderanopensourcelicense,sothatothersmayuseandextendthetool,andcontributenewworkloadsanddatabaseinterfaces.
Workload
A—UpdateheavyB—ReadheavyC—ReadonlyD—ReadlatestE—Shortranges
OperationsRead:50%Update:50%Read:95%Update:5%Read:100%Read:95%Insert:5%Scan:95%Insert:5%
RecordselectionZipfianZipfianZipfianLatest
Zipfian/Uniform*
Applicationexample
SessionstorerecordingrecentactionsinausersessionPhototagging;addatagisanupdate,butmostoperationsaretoreadtags
Userprofilecache,whereprofilesareconstructedelsewhere(e.g.,Hadoop)
Userstatusupdates;peoplewanttoreadthelateststatusesThreadedconversations,whereeachscanisforthepostsinagiventhread(assumedtobeclusteredbythreadid)
*WorkloadEusestheZipfiandistributiontochoosethefirstkeyintherange,andtheUniformdistributiontochoosethenumberofrecordsto
scan.
Table2:Workloadsinthecorepackage
Workload fileCommand line properties! Read/write mix! DB to use
! Record size
! Workload to use! Popularity distribution! Target throughput! ...! Number of threads! ...YCSB ClientCloudServingedcStorearClientaootThreadsfrlureekctyroenaIWx LEStatsBDFigure2:YCSBclientarchitecture
InthissectionwedescribethearchitectureoftheYCSBclient,andexaminehowitcanbeextended.Wealsode-scribesomeofthecomplexitiesinproducingdistributionsfortheworkloads.
5.1Architecture
TheYCSBClientisaJavaprogramforgeneratingthedatatobeloadedtothedatabase,andgeneratingtheop-erationswhichmakeuptheworkload.ThearchitectureoftheclientisshowninFigure2.Thebasicoperationisthattheworkloadexecutordrivesmultipleclientthreads.Eachthreadexecutesasequentialseriesofoperationsbymak-ingcallstothedatabaseinterfacelayer,bothtoloadthedatabase(theloadphase)andtoexecutetheworkload(thetransactionphase).Thethreadsthrottletherateatwhichtheygeneraterequests,sothatwemaydirectlycontroltheofferedloadagainstthedatabase.Thethreadsalsomeasurethelatencyandachievedthroughputoftheiroperations,andreportthesemeasurementstothestatisticsmodule.Attheendoftheexperiment,thestatisticsmoduleaggregatesthemeasurementsandreportsaverage,95thand99thpercentilelatencies,andeitherahistogramortimeseriesofthelaten-cies.
Theclienttakesaseriesofproperties(name/valuepairs)whichdefineitsoperation.Byconvention,wedividethesepropertiesintotwogroups:
•Workloadproperties:Propertiesdefiningthework-
load,independentofagivendatabaseorexperimentalrun.Forexample,theread/writemixofthedatabase,thedistributiontouse(zipfian,latest,etc.),andthesizeandnumberoffieldsinarecord.
•Runtimeperiment.properties:Forexample,Propertiesthedatabasespecificinterfacetoagivenlayerex-touse(e.g.,Cassandra,HBase,etc.),propertiesusedtoinitializethatlayer(suchasthedatabaseservicehost-names),thenumberofclientthreads,etc.
Thus,therecanbeworkloadpropertyfileswhichremainstaticandareusedtobenchmarkavarietyofdatabases(suchastheYCSBcorepackagedescribedinSection4).Incontrast,runtimeproperties,whilealsopotentiallystoredinpropertyfiles,willvaryfromexperimenttoexperiment,asthedatabase,targetthroughput,etc.,change.
5.2Extensibility
AprimarygoalofYCSBisextensibility.Infact,oneofourmotivationswastomakeiteasyfordeveloperstobenchmarktheincreasingvarietyofcloudservingsystems.TheshadedboxesinFigure2showthecomponentswhichcanbeeasilyreplaced.TheWorkloadExecutorcontainscodetoexecuteboththeloadandtransactionphasesoftheworkload.TheYCSBpackageincludesCoreWorkload,astandardworkloadexecutorforthecorepackagedescribedinSection4.UsersofYCSBcandefinenewpackagesintwoways.ThemoststraightforwardistodefineasetofworkloadsthatuseCoreWorkloadbutdefinedifferentwork-loadparameters.Thisallowsuserstovaryseveralaxesofthecorepackage:whichoperationtoperform,theskewinrecordpopularity,andthesizeandnumberofrecords.Thesecondapproachistodefineanewworkloadexecutorclass(e.g.,bywritingJava)andassociatedparameters.Thisap-proachallowsforintroducingmorecomplexoperations,andexploringdifferenttradeoffs,thanthecorepackagedoes;butinvolvesgreatereffortcomparedtotheformerapproach.TheDatabaseInterfaceLayertranslatessimplerequests(suchasread())fromtheclientthreadsintocallsagainstthedatabase(suchasThriftcallstoCassandraorRESTrequeststoPNUTS).TheWorkloadExecutorandDatabaseInterfaceLayerclassestouseforanexperimentarespecifiedasproperties,andthoseclassesareloadeddynamicallywhentheclientstarts.Ofcourse,asanopensourcepackage,anyclassintheYCSBtoolcanbereplaced,buttheWorkloadExecutorandDatabaseInterfaceLayercanbereplacedmosteasily.WenowdiscussinmoredetailhowtheYCSBclientcanbeextendedwithnewdatabasebackendsandworkloads.
5.2.1NewDatabaseBackends
TheYCSBClientcanbeusedtobenchmarknewdatabasesystemsbywritinganewclasstoimplementthefollowingmethods:
•read()turneither—readaspecifiedasinglerecordsetoffieldsfromtheoralldatabase,fields.
andre-••insert()update()—insertasinglerecordintothedatabase.
orreplacing—updatethespecifiedasinglefields.
recordinthedatabase,adding••delete()scan()berofrecords—execute—deleteasinglerecordinthedatabase.
startingarangeatscan,agivenreadingrecordakey.
specifiednum-Theseoperationsarequitesimple,representingthestan-dard“CRUD”operations:Create,Read,Update,Delete,withReadoperationstoreadonerecordortoscanrecords.Despiteitssimplicity,thisAPImapswelltothenativeAPIsofmanyofthecloudservingsystemsweexamined.
5.2.2NewWorkloadExecutors
AusercandefineanewworkloadexecutortoreplaceCoreWorkloadbyextendingtheWorkloadclassoftheYCSBframework.Oneinstanceoftheworkloadobjectiscre-atedandsharedamongtheworkerthreads,whichallowsthethreadstosharecommondistributions,countersandsoon.Forexample,theworkloadobjectcanmaintainacounterusedtogeneratenewuniquerecordidswheninsert-ingrecords.SimilarlytheworkloadobjectcanmaintainacommonLatestGeneratorobject,whichassignshighpop-ularitytothelatestrecordidsgeneratedbythecounter.Foreachoperation,thethreadwilleitherexecutetheworkloadobject’sdoInsert()method(iftheclientisintheloadphase)ortheworkloadobject’sdoTransaction()method(iftheclientisinthetransactionphase).
5.3Distributions
OneunexpectedlycomplexaspectofimplementingtheYCSBtoolinvolvedimplementingtheZipfianandLatestdistributions.Inparticular,weusedthealgorithmforgen-eratingaZipfian-distributedsequencefromGrayetal[23].However,thisalgorithmhadtobemodifiedinseveralwaystobeusedinourtool.Thefirstproblemisthatthepopularitemsareclusteredtogetherinthekeyspace.Inparticular,themostpopularitemisitem0;thesecondmostpopularitemisitem1,andsoon.FortheZipfiandistribution,thepopularitemsshouldbescatteredacrossthekeyspace.Inrealwebapplications,themostpopularuserorblogtopicisnotnecessarilythelexicographicallyfirstitem.
Toscatteritemsacrossthekeyspace,wehashedtheout-putoftheGraygenerator.Thatis,wecalledanextItem()methodtogetthenext(integer)item,thentookahashofthatvaluetoproducethekeythatweuse.Thechoiceofhashfunctioniscritical:theJavabuilt-inString.hashCode()functiontendedtoleavethepopularitemsclustered.Fur-thermore,afterhashing,collisionsmeantthatonlyabout80percentofthekeyspacewouldbegeneratedinthesequence.Thiswastrueevenaswetriedavarietyofhashfunctions(FNV,Jenkins,etc.).Oneapproachwouldbetouseperfecthashing,whichavoidscollisions,withadownsidethatmoresetuptimeisneededtoconstructtheperfecthash(multipleminutesforhundredsofmillionsofrecords)[15].Theap-proachthatwetookwastoconstructaZipfiangeneratorfor
amuchlargerkeyspacethanweactuallyneeded;applytheFNVhashtoeachgeneratedvalue;andthentakemodN(whereNsizeofthekeyspace,thatis,numberofrecordsinthedatabase).Theresultwasthat99.97%ofthekeyspaceisgenerated,andthegeneratedkeyscontinuedtohaveaZipfiandistribution.
Thesecondissuewasdealingwithchangingnumbersofitemsinthedistribution.Forsomeworkloads,newrecordsareinsertedintothedatabase.TheZipfiandistributionshouldresultinthesamerecordsbeingpopular,evenaf-terinsertions,whileintheLatestdistribution,popularityshouldshifttothenewkeys.FortheLatest,wecomputedanewdistributionwhenarecordwasinserted;todothischeaplywemodifiedtheGrayalgorithmof[23]tocomputeitsconstantsincrementally.ForZipfian,weexpandedtheinitialkeyspacetotheexpectedsizeafterinserts.IfadatasethadNrecords,andtheworkloadhadTtotalopera-tions,withanexpectedfractionIofinserts,thenwecon-structedtheZipfiangeneratortodrawfromaspaceofsizeN+T×I+.Weaddedanadditionalfactorsincetheactualnumberofinsertsdependsontherandomchoiceofoperationsduringtheworkloadaccordingtoamultinomialdistribution.Whilerunningtheworkload,ifthegenera-torproducedanitemwhichhadnotbeeninsertedyet,weskippedthatvalueanddrewanother.Then,thepopularitydistributiondidnotshiftasnewrecordswereinserted.
6.RESULTS
Wepresentbenchmarkingresultsforfoursystems:Cas-sandra,HBase,PNUTSandshardedMySQL.WhilebothCassandraandHBasehaveadatamodelsimilartothatofGoogle’sBigTable[16],theirunderlyingimplementationsarequitedifferent—HBase’sarchitectureissimilartoBigTable(usingsynchronousupdatestomultiplecopiesofdatachunks),whileCassandra’sissimilartoDynamo[18](e.g.,usinggos-sipandeventualconsistency).PNUTShasitsowndatamodel,andalsodiffersarchitecturallyfromtheothersys-tems.OurimplementationofshardedMySQL(likeotherimplementationswehaveencountered)doesnotsupportelas-ticgrowthanddatarepartitioning.However,itserveswellasacontrolinourexperiments,representingaconventionaldistributeddatabasearchitecture,ratherthanacloud-orientedsystemdesignedtobeelastic.MoredetailsofthesesystemsarepresentedinSection2.
Inourtests,werantheworkloadsofthecorepackagede-scribedinSection4,bothtomeasureperformance(bench-marktier1)andtomeasurescalabilityandelasticity(bench-marktier2).Herewereporttheaveragelatencyofrequests.The95thand99thpercentilelatenciesarenotreported,butfollowedthesametrendsasaveragelatency.Insummary,ourresultsshow:
•Themizationhypothesizedareapparenttradeoinffpractice:sbetweenCassandrareadandandwriteHBaseopti-havehigherreadlatenciesonareadheavyworkloadthanPNUTSandMySQL,andlowerupdatelatenciesonawriteheavyworkload.
•PNUTSserversandandworkloadCassandraincreasedscaledproportionally.wellasthenumberHBase’sofperformancewasmoreerraticasthesystemscaled.
•Cassandra,callywhiletheHBaseworkloadandPNUTSwasexecuting.wereableHowever,togrowPNUTS
elasti- 70 60Read latency (ms) 50 40 30 20 10 0 0
Update latency (ms)CassandraHBasePNUTSMySQL
80 70 60 50 40 30 20 10 0
CassandraHBasePNUTSMySQL
2000 4000 6000 8000 10000 12000 14000
Throughput (ops/sec)
0 2000 4000 6000 8000 10000 12000 14000
Throughput (ops/sec)
(a)(b)
Figure3:WorkloadA—updateheavy:(a)readoperations,(b)updateoperations.Throughputinthis(andallfigures)representstotaloperationspersecond,includingreadsandwrites.providedthebest,moststablelatencywhileelasticallyrepartitioningdata.
Itisimportanttonotethattheresultswereporthereareforparticularversionsofsystemsthatareundergoingcon-tinuousdevelopment,andtheperformancemaychangeandimproveinthefuture.Evenduringtheintervalfromtheinitialsubmissionofthispapertothecamerareadyversion,bothHBaseandCassandrareleasednewversionsthatsig-nificantlyimprovedthethroughputtheycouldsupport.WeprovideresultsprimarilytoillustratethetradeoffsbetweensystemsanddemonstratethevalueoftheYCSBtoolinbenchmarkingsystems.Thisvalueisbothtousersandde-velopersofcloudservingsystems:forexample,whiletryingtounderstandoneofourbenchmarkingresults,theHBasedevelopersuncoveredabugand,aftersimplefixes,nearlydoubledthroughputforsomeworkloads.
HBaseandPNUTSsystems.ForHBase,weallocated1GBofheaptoHadoop,and5GBtoHBase.ForPNUTSandshardedMySQL,weallocated6GBofRAMtotheMySQLbufferpool.ForCassandra,weallocated3GBofheaptotheJVM,atthesuggestionofCassandradevelopers,sotherestofRAMcouldbeusedfortheLinuxfilesystembuffer.Wedisabledreplicationoneachsystemsothatwecouldbench-markthebaselineperformanceofthesystemitself.Inon-goingworkweareexaminingtheimpactofreplication.ForCassandra,shardedMySQLandPNUTS,allupdatesweresynchedtodiskbeforereturningtotheclient.HBasedoesnotsynctodisk,butreliesonin-memoryreplicationacrossmultipleserversfordurability;thisincreaseswritethrough-putandreduceslatency,butcanresultindatalossonfail-ure.WeranHBaseexperimentswithandwithoutclient-sidebuffering;sincebufferinggaveasignificantthroughputben-efit,wemainlyreportonthosenumbers.Cassandra,andpossiblyPNUTSandshardedMySQL,mayhavebenefitedifwehadgiventhemadedicatedlogdisk.However,toensureafaircomparison,weconfiguredallsystemswithasingleRAID-10arrayandnodedicatedlogdisk.UsersofYCSBarefreetosetupalternativehardwareconfigurationstoseeiftheycangetbetterperformance.
HBaseperformanceissensitivetothenumberoflogstruc-turedfilesperkeyrange,andthenumberofwritesbufferedinmemory.HBaseshrinksthesenumbersusingcompactionsandflushes,respectively,andtheycanbesystemoruser-initiated.Weperiodicallyappliedtheseoperationsduringourexperiments;butHBaseusersmustevaluatehowoftensuchoperationsareneededintheirownenvironment.
Ourdatabaseconsistedof120million1KBrecords,foratotalsizeof120GB.Eachserverthushadanaverageof20GBofdata,morethanitcouldcacheentirelyinRAM.Readoperationsretrievedanentirerecord,whileupdatesmodifiedoneofthefields(outoften).
6.1ExperimentalSetup
Formostexperiments,weusedsixserver-classmachines(dual64-bitquadcore2.5GHzIntelXeonCPUs,8GBofRAM,6diskRAID-10arrayandgigabitethernet)toruneachsystem.WealsoranPNUTSona47serverclustertosuccessfullydemonstratethatYCSBcanbeusedtobench-marklargersystems.PNUTSrequiredtwoadditionalma-chinestoserveasaconfigurationserverandrouter,andHBaserequiredanadditionalmachinecalledthe“masterserver.”Theseserverswerelightlyloaded,andtheresultswereportheredependprimarilyonthecapacityofthesixstorageservers.TheYCSBClientranonaseparate8coremachine.TheClientwasrunwithupto500threads,de-pendingonthedesiredofferedthroughput.Weobservedinourteststhattheclientmachinewasnotabottleneck;inparticular,theCPUwasalmostidleasmosttimewasspentwaitingforthedatabasesystemtorespond.
WeranCassandra0.5.0,HBase0.20.3,andMySQL5.1.24(forPNUTS)and5.1.32(forshardedMySQL).Foroneex-periment(elasticspeedup),weusedCassandra0.6.0-beta2,atthesuggestionoftheCassandradevelopers.ForCas-sandra,weusedtheOrderedPartitionerwithnodetokensevenlyspacedaroundthekeyspace.OurshardedMySQLimplementationusedclient-sidehashingtodeterminewhichserveragivenrecordshouldbestoredon.
Weconfiguredandtunedeachsystemaswellasweknewhow.Inparticular,wereceivedextensivetuningassistancefrommembersofthedevelopmentteamsoftheCassandra,
6.2WorkloadA—UpdateHeavy
First,weexaminedWorkloadA,whichhas50percentreadsand50percentupdates.Figure3showslatencyver-susthroughputcurvesforeachsystemforboththereadandupdateoperations.Ineachcase,weincreasedtheof-feredthroughputuntiltheactualthroughputstoppedin-creasing.Asthefigureshows,forallsystems,operationla-tencyincreasedasofferedthroughputincreased.Cassandra,whichisoptimizedforwrite-heavyworkloads,achievedthe
40 35Read latency (ms) 30 25 20 15 10 5 0 0
Update latency (ms)CassandraHBasePNUTSMySQL
20 15 10 5 0
CassandraHBasePNUTSMySQL
2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
Throughput (ops/sec)Throughput (ops/sec)
(a)(b)
Figure4:WorkloadB—readheavy:(a)readoperations,(b)updateoperations.
bestthroughputandthelowestlatencyforreads.Athighthroughputs,Cassandra’sefficientsequentialuseofdiskforupdatesreducescontentionforthediskhead,meaningthatreadlatencyislowerthanfortheothersystems.PNUTShasslightlyhigherlatencythanMySQL,becauseithasextralogicontopofMySQLfordistributedconsistency.HBasehadverylowlatencyforupdates,sinceupdatesarebufferedinmemory.Withbufferingoff,writesarecommittedtomemoryontheHBaseserver,andlatencywasonly10-50%lowerthanreadlatency.Becauseofefficientsequentialup-dates,Cassandraalsoprovidesthebestthroughput,peakingat11978operations/sec,comparedto7904operations/secforHBase,7283operations/secforshardedMySQL,and7448operations/secforPNUTS.
120 100Read latency (ms) 80 60 40 20 0 0
200 400 600 800 1000 1200 1400 1600
Throughput (ops/sec)
CassandraHBasePNUTS
Figure5:WorkloadE—shortscans.
resultsshow,bothHBaseandPNUTScansustainsimilarmaximumthroughputs(1519ops/secforHBase,and1440ops/secforPNUTS)withroughlyequivalentlatency.Cas-sandraperformsmuchworse.Cassandra’srangescansup-portisrelativelynewinversion0.5.0,andisnotyetheavilyoptimized;futureversionsmayormaynotbeabletopro-videbetterscanlatency.HBaseandPNUTShavesimilarscanperformance,butonlyforshortscans.Werananex-perimentwherewevariedtheaveragerangequeryfrom25to800records.Theresults(notshown)demonstratethatHBasehasbetterscanperformancewhentherangescansretrievemorethan100recordsonaverage.MySQL’sondiskstorageusedaB-Treetosupportlowlatencyreads;butB-treesinherentlyhavesomefragmentation,sothecostofreadingpageswithemptyspaceincreasesthecostofarangescaninPNUTS.HBasestoresdatamorecompactlyondisk,improvingperformanceforlargeranges.Forexam-ple,whentheaveragerangescanisfor800records,HBase’sresponsetimeis3.5timesfasterthanPNUTS.
6.3WorkloadB—ReadHeavy
WorkloadB,thereadheavyworkload,providesadiffer-entpicturethanworkloadA.TheresultsforworkloadBareshowninFigure4.Asthefigureshows,PNUTSandshardedMySQLarenowabletoprovidelowerlatenciesonreadsthaneitherCassandraorHBase.CassandraandHBasestillperformbetterforwrites.TheextradiskI/OsbeingdonebyCassandratoassemblerecordsforreadingdominatesitsperformanceonreads.NotethatCassandraonlybeginstoshowhigherreadlatencyathighthroughputs,indicat-ingthattheeffectsmatterprimarilywhenthediskisclosetosaturation.HBasealsohastoreconstructfragmentsofrecordsfrommultiplediskpages.However,thereadlatencyisrelativelyhigherbecauseofHBase’slog-structuredstor-ageimplementation.HBaseflushesitsmemtablestodiskinseparatefiles,andpotentiallymustsearcheachsuchfileforfragmentsoftherecord,evenifonlysomecontainrelevantfragments.Infact,weobserveworselatencyandthrough-putasthenumberoffilesgrows,andimprovementwhenthenumbershrinksthroughcompaction.Whenthereisagreatdealoffragmentation(forexample,afteralargenum-berofwrites),throughputdropstoaslowas4800opera-tions/secduetotheexpenseofreconstruction.Futureim-provements,suchasBloomfilters,maycutdownsuchfalsepositivesearches.
6.5OtherWorkloads
Next,werantheothercoreYCSBworkloads.There-sultsofworkloadC(readonly)weresimilartothoseoftheread-heavyworkloadB:PNUTSandshardedMySQLachievedthelowestlatencyandhighestthroughputforthereadoperations.WorkloadD(readlatest)alsoshowedsim-ilarresultstoworkloadB.Althoughthepopularityofitemsshiftedovertime,thedominanteffectwasthatPNUTSandMySQLweremostefficientforreads,andworkloadDisdominatedbyreadrequests.NotethatinworkloadDtherecentlyinsertedrecordsarenotnecessarilyclusteredonthe
6.4WorkloadE—ShortRanges
WeranworkloadE(shortranges)usingHBase,CassandraandPNUTS,withrangesupto100records.OurshardedMySQLimplementationisahashtableanddoesnotsup-portrangescans.TheresultsareshowninFigure5.Asthe
70Cassandra 60PNUTS
HBase)sm 50( ycn 40etal 30daeR 20 10 0 0
2
4
6
8 10 12
Servers
Figure6:Readperformanceasclustersizein-
creases.
sameserver.Suchaclusteringschemewouldnotlikelybeusedinpractice,asitwouldresultinaseverehotspotononeserverwhiletheotherserverswereunderutilized.“Readlatest”applicationsinsteadtypicallyconstructtherecordkeystoclusterrecordsbysomeotherattributetoavoidthishotspot,andwedidthisaswell.ThismimicsthedesignofanapplicationlikebloggingorTwitter,whererecentup-datesaremostpopularbutarelikelyclusteredbyuserratherthanbytime.WehaveomittedthegraphsforworkloadsCandDforlackofspace.
Wealsorana“read-modify-write”workloadthatreflectsthefrequentlyusedpatternofreadingarecord,modifyingit,andwritingthechangesbacktothedatabase.Thiswork-loadissimilartoworkloadA(50/50read/writeratio)exceptthattheupdatesare“read-modify-write”ratherthanblindwrites.Theresults(notshown)showedthesametrendsasworkloadA.
6.6Scalability
Sofar,allexperimentshaveusedsixstorageservers.(Asmentionedabove,wedidrunoneexperimentwithPNUTSon47servers,toverifythescalabilityofYCSBitself).How-ever,cloudservingsystemsaredesignedtoscaleout:moreloadcanbehandledwhenmoreserversareadded.WefirsttestedthescaleupcapabilityofCassandra,HBaseandPNUTSbyvaryingthenumberofstorageserversfrom2to12(whilevaryingthedatasizeandrequestrateproportion-ally).TheresultingreadlatencyforworkloadBisshowninFigure6.Astheresultsshow,latencyisnearlyconstantforCassandraandPNUTS,indicatinggoodelasticscalingproperties.Incontrast,HBase’sperformancevariesalotasthenumberofserversincreases;inparticular,performanceisbetterforlargernumbersofservers.AknownissueinHBaseisthatitsbehavioriserraticforverysmallclusters(lessthanthreeservers.)
6.7ElasticSpeedup
WealsoexaminedtheelasticspeedupofCassandra,HBaseandPNUTS.ShardedMySQLisinherentlyinelastic.Inthiscase,westartedtwoservers,loadedwith120GBofdata.Wethenaddedmoreservers,oneatatime,untilthereweresixserversrunning.Afteraddingeachserver,weattemptedtorunlongenoughtoallowthesystemtostabilizebeforeaddingthenextserver(althoughinsomecasesitwasdiffi-culttodetermineifthesystemhadtrulystabilized.)Dur-ingtheentirerun,wesettheYCSBclienttoofferthesamethroughput;inparticular,theofferedthroughputwas80
percentofthatachievablewith6servers.Thismodelsasituationwhereweattempttoelasticallyexpandanover-loadedclustertoasizewhereitcanhandletheofferedload,asituationthatoftenoccursinpractice.
First,weshowasliceofthetimeseriesofreadlatenciesforCassandrainFigure7(a).Inthisfigure,thefirsttenminutesrepresentfiveservers,afterperformancehasstabilized;then,asixthserverisadded.Asthefigureshows,thisresultsinasharpincreaseinlatency,aswellasawidevarianceinthelatency.Thisperformancedegradationresultsfrommovingdatatothe6thserver;regularservingrequestscompetefordiskandnetworkresourceswiththedatarepartitioningpro-cess,resultinginhighandhighlyvariablelatency.Infact,underload,ittakesmanyhoursforCassandratostabilize.Inourtest,wehadtostoptheYCSBworkloadafter5.5hourstoallowthesystemtocompleteitsrepartitioningandquiesce.ThehighcostofrepartitioningisaknownissuewithCassandra0.5.0andisbeingoptimizedinongoingdevelop-ment.Aftercompletingitsrepartitioning,theperformanceofthesystemunderloadmatchedthatofasystemthathadstartedwith6servers,indicatingthateventually,theelasticexpansionoftheclusterwillresultingoodperformance.ResultsforHBaseareshowninFigure7(b).Asbefore,thisfigurerepresentsjustonesliceofthetotalexperiment.Asthefigureshows,thereadlatencyspikesinitiallyafterthesixthserverisadded,beforethelatencystabilizesatavalueslightlylowerthanthelatencyforfiveservers.Thisre-sultindicatesthatHBaseisabletoshiftreadandwriteloadtothenewserver,resultinginlowerlatency.HBasedoesnotmove2existingdatatothenewserveruntilcompactionsoccur.TheresultislesslatencyvariancecomparedtoCas-sandrasincethereisnorepartitioningprocesscompetingforthedisk.However,thenewserverisunderutilized,sinceexistingdataisservedofftheoldservers.
Asimilarsliceofthetimeseries(addingasixthserverattime=10minutes)isshownforPNUTSinFigure7(c).PNUTSalsomovesdatatothenewserver,resultinginhigherlatencyafterthesixthserverisadded,aswellaslatencyvariability.However,PNUTSisabletostabilizemorequickly,asitsrepartitioningschemeismoreoptimized.Afterstabilizingattime=80minutes,thereadlatencyiscomparabletoaclusterthathadstartedwithsixservers,indicatingthatPNUTSprovidesgoodelasticspeedup.
7.FUTUREWORK
Inadditiontoperformancecomparisons,itisimportanttoexamineotheraspectsofcloudservingsystems.Inthissection,weproposetwomorebenchmarktierswhichwearedevelopinginongoingwork.
7.1Tier3—Availability
Aclouddatabasemustbehighlyavailabledespitefailures,andtheAvailabilitytiermeasurestheimpactoffailuresonthesystem.Thesimplestwaytomeasureavailabilityistostartaworkload,killaserverwhiletheworkloadisrunning,andobserveanyresultingerrorsandperformanceimpact.However,inarealdeployedsystem,avarietyoffailurescanoccuratvariousotherlevels,includingthedisk,thenetwork,2
toIttotheispossiblenewservers,torunbutthethisHDFSloadbalancertoforcedataareservestored.
datapartitionsfromgreatlythesamedisruptsserversHBase’sonwhichabilitythey 800 700Read latency (ms) 600 500 400 300 200 100 0 0
50
CassandraRead latency (ms) 250 200 150 100 50 0
HBase
Read latency (ms) 120 100 80 60 40 20
PNUTS 100 150 200 250 300 350Duration of test (min)
0 5 10 15 20 25
0
0 20 40 60 80 100 120
Duration of test (min)Duration of test (min)
(a)Cassandra(b)HBase(c)PNUTS
Figure7:Elasticspeedup:Timeseriesshowingimpactofaddingserversonline.
andthewholedatacenter.Thesefailuremaybetheresultofhardwarefailure(forexample,afaultynetworkinterfacecard),poweroutage,softwarefaults,andsoon.Aproperavailabilitybenchmarkwouldcause(orsimulate)avarietyofdifferentfaultsandexaminetheirimpact.
Atleasttwodifficultiesarisewhenbenchmarkingavail-ability.First,injectingfaultsisnotstraightforward.Whileitiseasytosshtoaserverandkillthedatabaseprocess,itismoredifficulttocauseanetworkfaultforexample.Oneapproachistoalloweachdatabaserequesttocarryaspecialfaultflagthatcausesthesystemtosimulateafault.Thisapproachisparticularlyamenabletobenchmarkingbe-causetheworkloadgeneratorcanaddtheappropriateflagfordifferentteststomeasuretheimpactofdifferentfaults.A“fault-injection”headercanbeinsertedintoPNUTSre-quests,buttoourknowledge,asimilarmechanismisnotcurrentlyavailableintheothersystemswebenchmarked.AnalternateapproachtofaultinjectionthatworkswellatthenetworklayeristouseasystemlikeModelNet[34]orEmulab[33]tosimulatethenetworklayer,andinserttheappropriatefaults.
Theseconddifficultyisthatdifferentsystemshavedif-ferentcomponents,andthereforedifferent,uniquefailuremodes.Thus,itisdifficulttodesignfailurescenariosthatcoverallsystems.
Despitethesedifficulties,evaluatingtheavailabilityofsys-temsisimportantandwearecontinuingtoworkondevel-opingbenchmarkingapproaches.
cascanpotentiallybeusedtospreadoutload,improvingperformance.
•Availabilitycostorbenefit—whatistheimpacttoavailabilityasweincreasethereplicationfactoronacon-stantamountofhardware?Insomesystems,thereadavailabilitymayincreasebutthewriteavailabilitymaydecreaseifallreplicasmustbelivetocommitanupdate.•Freshness—arereplicasconsistent,oraresomereplicasstale?Howmuchofthedataisstaleandhowstaleisit?Thisisprimarilyanissueforsystemsthatuseasyn-chronousreplication.
•Wideareaperformance—howdoesreplicationper-formbetweendatacentersingeographicallyseparatelo-cations?Somereplicationschemesareoptimizedtobeusedwithinthedatacenterorbetweennearbydatacen-ters,whileothersworkwellingloballydistributeddata-centers.
8.RELATEDWORK
Benchmarking
Benchmarkingiswidelyusedforevaluatingcomputersys-tems,andbenchmarksexistforavarietyoflevelsofab-straction,fromtheCPU,tothedatabasesoftware,tocom-pleteenterprisesystems.Ourworkismostcloselyrelatedtodatabasebenchmarks.Graysurveyedpopulardatabasebenchmarks,suchastheTPCbenchmarksandtheWiscon-sinbenchmark,in[22].Healsoidentifiedfourcriteriaforasuccessfulbenchmark:relevancetoanapplicationdomain,portabilitytoallowbenchmarkingofdifferentsystems,scala-bilitytosupportbenchmarkinglargesystems,andsimplicitysotheresultsareunderstandable.Wehaveaimedtosatisfythesecriteriabydevelopingabenchmarkthatisrelevanttoservingsystems,portabletodifferentbackendsthroughourextensibilityframework,scalabletorealisticdatasizes,andemployingsimpletransactiontypes.
Despitetheexistenceofdatabasebenchmarks,wefeltitwasnecessarytodefineanewbenchmarkforcloudservingsystems.First,mostcloudsystemsdonothaveaSQLin-terface,andsupportonlyasubsetofrelationaloperations(usually,justtheCRUDoperations),sothatthecomplexqueriesofmanyexistingbenchmarkswerenotapplicable.Second,theusecasesofcloudsystemsareoftendifferentthantraditionaldatabaseapplications,sothatnarrowdo-mainbenchmarks(suchasthedebit/creditstylebenchmarks
7.2Tier4—Replication
Cloudsystemsusereplicationforbothavailabilityandperformance.Replicascanbeusedforfailover,andtospreadoutreadload.Somesystemsarealsodesignedtosplitwriterequestsacrossreplicastoimproveperformance,althoughaconsistencymechanism(suchaseventualconsistencyinCassandra,ortimelineconsistencyinPNUTS)isneededtoavoidcorruptionduetowrite-writeconflicts.
Replicationmaybesynchronousorasynchronous.HBase,forexample,writessynchronouslytomultiplereplicas,whilePNUTSperformsreplicationasynchronously.Thus,thefol-lowingmeasuresareimportantwhenevaluatingreplication:•Performancecostorbenefit—whatistheimpacttoperformanceasweincreasethereplicationfactoronaconstantamountofhardware?Theextraworktomain-tainreplicasmayhurtperformance,buttheextrarepli-
likeTPC-AorE-commercebenchmarkslikeTPC-W)maynotmatchtheintendedusageofthesystem.Furthermore,ourgoalwastodevelopabenchmarkingframeworkthatcouldbeusedtoexploretheperformancespaceofdifferentsystems,ratherthantomeasureasingleperformancenum-berrepresentingaparticularapplication.Itisforsimilarreasonsthatnewbenchmarkshavebeendevelopedforothernon-traditionaldatabasesystems(suchasXMarkforXMLsystems[28]andLinearRoadforstreamsystems[14]).Designinganaccurateandfairbenchmark,andusingittogatheraccurateresults,isnon-trivial.Seltzeretal[30]arguethatmanymicroandmacrobenchmarksdonoteffec-tivelymodelrealworkloads.Oneapproachtheypropose(thevectorapproach)istomeasuretheperformanceofsys-temoperations,andcomputetheexpectedperformanceforaparticularapplicationthatusessomespecifiedcombinationofthoseoperations.Ourapproachissimilartothis,exceptthatwedirectlymeasuretheperformanceofaparticularcombinationofoperations;thisallowsustoaccuratelymea-suretheimpactofthingslikediskorcachecontentionwhentheoperationsareusedtogether.Shivametal[31]describeaworkbenchtoolforefficientlyrunningmultiplebenchmarkteststoachievehighconfidenceresults.Theirtoolinter-faceswithaworkloadgenerator,liketheYCSBClient,toexecuteeachrun.Weareexaminingthepossibilityofusingtheirworkbenchtorunourbenchmark.
Cloudsystems
Theterm“cloud”hasbeenusedforavarietyofdifferentkindsofsystemsandarchitectures.AspecialissueoftheDataEngineeringBulletin[25]showcasedseveralaspectsofdatamanagementinthecloud.WehavefocusedonservingsystemslikePNUTS,Cassandra,andothers.Incontrast,batchsystemsprovidenear-lineorofflineanalysis,butarenotappropriateforonlineserving.Pavloetal[26]havebenchmarkedcloudsystemslikeHadoopagainstmoretra-ditionalrelationalsystemsandrelationalcolumnstoreslikeVertica/C-Store[32].Somebatchsystemsmightusethesamedatabasesystemsthatwouldbeusedinaservingenvi-ronment.Forexample,HBasecanbeusedbothasaservingstoreandasastoragebackendforHadoop,andisreportedlyusedthiswayatStumbleUpon,oneofthemajordevelopersofHBase[27].
9.CONCLUSIONS
WehavepresentedtheYahoo!CloudServingBenchmark.Thisbenchmarkisdesignedtoprovidetoolsforapples-to-applescomparisonofdifferentservingdatastores.Onecon-tributionofthebenchmarkisanextensibleworkloadgener-ator,theYCSBClient,whichcanbeusedtoloaddatasetsandexecuteworkloadsacrossavarietyofdataservingsys-tems.Anothercontributionisthedefinitionoffivecoreworkloads,whichbegintofilloutthespaceofperformancetradeoffsmadebythesesystems.Newworkloadscanbeeasilycreated,includinggeneralizedworkloadstoexaminesystemfundamentals,aswellasmoredomain-specificwork-loadstomodelparticularapplications.Asanopen-sourcepackage,theYCSBClientisavailablefordeveloperstouseandextendinordertoeffectivelyevaluatecloudsystems.Wehaveusedthistooltobenchmarktheperformanceoffourcloudservingsystems,andobservedthattherearecleartradeoffsbetweenreadandwriteperformancethatresultfromeachsystem’sarchitecturaldecisions.Theseresults
highlighttheimportanceofastandardframeworkforexam-iningsystemperformancesothatdeveloperscanselectthemostappropriatesystemfortheirneeds.
10.ACKNOWLEDGEMENTS
Wewouldliketothankthesystemdeveloperswhohelpedustunethevarioussystems:JonathanEllisfromCassan-dra,RyanRawsonandMichaelStackfromHBase,andtheSherpaEngineeringTeaminYahoo!CloudComputing.
11.[1]AmazonREFERENCES
SimpleDB.http://aws.amazon.com/simpledb/.
[2]ApacheCassandra.http://incubator.apache.org/cassandra/.[3]ApacheCouchDB.http://couchdb.apache.org/.[4]ApacheHBase.http://hadoop.apache.org/hbase/.
[5]
DynomiteFramework.http://wiki.github.com/cliffmoon/-dynomite/dynomite-framework.
[6]GoogleAppEngine.http://appengine.google.com.[7]Hypertable.http://www.hypertable.org/.[8]mongodb.http://www.mongodb.org/.
[9]ProjectVoldemort.http://project-voldemort.com/.[10]SolarisFileBench.
http://www.solarisinternals.com/wiki/index.php/FileBench.[11]SQLDataServices/AzureServicesPlatform.http://www.microsoft.com/azure/data.mspx.[12]StoragePerformanceCouncil.
http://www.storageperformance.org/home.
[13]Yahoo!QueryLanguage.http://developer.yahoo.com/yql/.[14]A.Arasuetal.LinearRoad:astreamdatamanagementbenchmark.InVLDB,2004.
[15]F.C.Botelho,D.Belazzougui,andM.Dietzfelbinger.
Compress,hashanddisplace.InProc.ofthe17thEuropeanSymposiumonAlgorithms,2009.
[16]F.Changetal.Bigtable:Adistributedstoragesystemforstructureddata.InOSDI,2006.
[17]B.F.Cooperetal.PNUTS:Yahoo!’shosteddataservingplatform.InVLDB,2008.
[18]G.DeCandiaetal.Dynamo:Amazon’shighlyavailablekey-valuestore.InSOSP,2007.
[19]D.J.DeWitt.TheWisconsinBenchmark:Past,presentandfuture.InJ.Gray,editor,TheBenchmarkHandbook.MorganKaufmann,1993.
[20]I.Eure.LookingtothefuturewithCassandra.http://blog.digg.com/?p=966.
[21]S.GilbertandN.Lynch.Brewer’sconjectureandthefeasibilityofconsistent,available,partition-tolerantwebservices.ACMSIGACTNews,33(2):51–59,2002.
[22]J.Gray,editor.TheBenchmarkHandbookForDatabaseandTransactionProcessingSystems.MorganKaufmann,1993.[23]J.Grayetal.Quicklygeneratingbillion-recordsyntheticdatabases.InSIGMOD,1994.
[24]A.Lakshman,P.Malik,andK.Ranganathan.Cassandra:AstructuredstoragesystemonaP2Pnetwork.InSIGMOD,2008.
[25]B.C.OoiandS.Parthasarathy.Specialissueondatamanagementoncloudcomputingplatforms.IEEEDataEngineeringBulletin,vol.32,2009.
[26]A.Pavloetal.Acomparisonofapproachestolarge-scaledataanalysis.InSIGMOD,2009.
[27]R.Rawson.HBaseintro.InNoSQLOakland,2009.[28]A.Schmidtetal.Xmark:AbenchmarkforXMLdatamanagement.InVLDB,2002.
[29]R.Sears,M.Callaghan,andE.Brewer.Rose:Compressed,log-structuredreplication.InVLDB,2008.
[30]M.Seltzer,D.Krinsky,K.A.Smith,andX.Zhang.Thecaseforapplication-specificbenchmarking.InProc.HotOS,1999.[31]P.Shivametal.Cuttingcorners:Workbenchautomationforserverbenchmarking.InProc.USENIXAnnualTechnicalConference,2008.
[32]M.Stonebrakeretal.C-store:acolumn-orientedDBMS.InVLDB,2005.
[33]B.Whiteetal.Anintegratedexperimentalenvironmentfordistributedsystemsandnetworks.InOSDI,2002.
[34]
K.Yocumetal.Scalabilityandaccuracyinalarge-scalenetworkemulator.InOSDI,2002.
因篇幅问题不能全部显示,请点此查看更多更全内容