ronment [J]. Machine Tool & Hydraulics,2019,47(6) :98 - 103.An improved Apriori mining algorithm based on Hadoop framework in big data environmentYi ZENG1' , Xiang-zhen ZHOU2(1 Guangxi University Xingjiart College of Science and Liberal Arts Computer and
Information Engineering Department, Nanning 530005, China) (2Chinese Academy of Social Sciences, Beijing 100102, China)Abstract: Aiming at the efficiency of user behavior big data mining under Hadoop framework, an improved Apriori mining algorithm for association rules is proposed in this paper. Firstly, the modeling of item sets classification under Hadoop framework is realized. Then through the analysis of the mining steps of the traditional association rule Apriori algorithm, the generation method of the candidate item set is improved, and the useless information is removed with the flag information, which could effectively reduce the number of transactions and projects, thereby shorten the task processing time. In the specific implementation process, Map Reduce processing is performed on the improved Apriori algorithm flow. Simulation experiments show that the improved Apriori mining algorithm has higher execution efficiency than that of the traditional Apriori algorithm.
Key words: Big data, Hadoop framework, Apriori, litem set classification, Execution efficiency doi:10.3969/j. issn. 1001-3881.2019.06.017 Document code: A CLC number: TP393With the rapid development of computer technology ,various new things derived from the Internet are emerging, such as cloud computing technology [ 1 -5 ]. With the continuous promotion of the above technologies, various large amounts of data information have been generated and are continuously increasing. Big data has become the most popular technology at pres・ ent. Many research institutions and research scholars have already taken big data as the focus of research [6]. The analysis of big data in various IT industries has huge market prospects [7-9]. The most important method in big data analysis is data mining. The essen・ tial goal of data mining is to extract the hidden higher- level knowledge with the required meaning from the massive data.As a main research direction in data mining algorithms ,the association rule problem has gradually become a hot topic in data mining research [ 10 ]. At this stage, the association mles have many algorithms that generate frequent item sets. However, when these algorithms generate frequent item sets, there are many times to connect the database to generate candidate i・ tern sets, which will reduce the execution efficiency and increase the time cost.As one of the three major distributed computing systems ,Hadoop can easily complete the collection of data of different structure types. It can provide a dis・ tributed storage computing environment across computer clusters [ 11-12]. Hadoop has unique advantages in data analysis. The information resources under the big
Received date: 2018 -04 -25Foundation item: Science and Technology Key Project of Henan Province ( 182102110277) , The Basis of Scientific Research Ability for Young Teachers Improve Project, Guangxi (2018 KY0783,2019 KY0960)* Corresponding author: Yi ZENG, Lecturer. E-mail: 19908103@ qq. comAn improved Apriori mining algorithm based on Hadoop framework in big data environment99data environment have open characteristics. In addi・ tion, because of the frequent uploading and download・ ing of big data, it is particularly suitable for managing on the Hadoop platform. And considering the big data throughput problem, the fluency of resource interac・ tion is particularly important in the user behavior data mining process.This paper uses the Hadoop distributed file system to manage the big data in order to achieve item set classification, thereby improving the efficiency of association rules execution. For user behavior big data analysis, this paper proposes an improved Apriori mining algorithm. The proposed algorithm first analy
zes the time weights of user behaviors and behaviors,
and weights the user behaviors according to the weights of the two, so as to realize the modeling of i- tem sets classification under the Hadoop framework. Then by optimizing the orderliness of the item set to determine whether the database connection is needed, and using the flag information to eliminate useless transactions, the number of transactions and items could be effectively reduced, and the processing time will be shortened.1 Item set classification under Hadoop
FrameworkUnder the big data environment, the user behavior statistics are used as the modeling basis to better extract user behavior data and achieve effective big data mining.First, the function F is used to represent the function mapping between user and user behavior. In order to describe the relationship between user and behavior, this paper uses the number of occurrences of the
user as the description object, F represents the number of times the user generates the behavior, suppose i is the user name, j is the behavior category, F(i, j) represent the number of times that the i-th user has taken 7-type actions. For convenience of statistics and analysis, F(i, j) is normalized in formula (1 ).H =- Max(F(i,j))F(i, j) - Min(F(i, -M j)) ()加(F(i, j)) ')
Where, Max ( F (i, j) ) and Min ( F(i, j) ) are the maximum number and the minimum number of j-type behaviors of the /-th user in a certain period time respectively. In order to establish a database of behavior categories for different users and recommend resources
suitable for users, it is necessary to perform weight nalysis on the behavior categories of users. Weight a・ nalysis can clearly show the preference of different users before the different types of behaviors, and more intuitively reflect the user's behavior. In order to reflect the weight value, the concept of entropy in information theory [8] is introduced in this paper, and the entropy of the user? s behavior is calculated. Let the weight be 出,and the proportion of behavior j in all the behaviors of the user is PR-, then the entropy cab culation of behavior j is as follows:Ij =-》log(P(/j)) *P(H,j)
i = l⑵i = lWhere, Hi represents the number of times that i-th user has behavior j, and n represents the total number of all behavior categories of the i-th user. According to the above formula, one could obtain the weight of behavior j, the calculation method is as follows:叫=——1 一厶 (3)/j = lWhere, m represents the total number of all activity categories. From the above equation one can get that the entropy, and the weight of behavior j are inversely proportional to each other, and there is:S WB] = 1
(4)j = lAlthough the user behavior preferences described a- bove can be described in terms of user behavior weights, this method only considers the user' s behavior characteristics in a certain period of time. Accord・ ing to the user behavior weights in the current time period ,resources that are the same as or similar to the behavior may be recommended for the user. However, considering the timeliness of user online learning, the time factor must be taken into account in the classifi・ cation of user behavior [9-10]. The following will analyze the timeliness of user behavior in order to be ble to recommend the most suitable resource service for the current user based on the latest behavior of the user. Suppose the weight of i-th user'sj-class behav・ ior is x, which could be expressed as follows:laat(i. j)
j)WTltj = e t(i)
(5)Where, L(i) is calculated as follows:LG) = TprcscM - first(0
(6)
100Where, last (i, j) represents the time at which user i last generated behavior j, 71 p65432
resent y represents the current system time, and first (i, j) z represents the time at which user i first generated behavior j.Based on the analysis of user behavior and behavioral time weight, user behavior can be weighted according to their weights. Assume that the behavior m generated by user i is S(i ,m) , then the weight estimation is:MRB-j =工 (7)m = 1andu(: 二 二 Min(F(i,m)) (R)''丿 ~ Max(F(i,m)) - Min( F( f ,/n) ) k 7 98
Where, H(i is the normalization of F(i ,m).2 Association rules and related algorithms2. 1 Introduction to association rulesAssociation rules represent some hidden relationship between two things under one rule. The purpose of da・ ta mining algorithms is to discover these hidden relationships. Association rules can be represented by X —where X represents the premise of the association rule and Y represents the follow-up of the association mle. In addition, the association rule algorithm has a definition of support and confidence, and the calcula・ tion formula is as shown in formula (9):Let the set be (兀】,衍,,…,%”),and the estimated parameter is (p, then the likelihood function could be expressed as follows:P(A I B) =
(9)Where, A and B represent different events, respectively.The correspondence between support and confidence can be calculated through probability formula (10).conf( Y I X)二 P(YI X)P(X)The number of XY occurrences The number of X occurrences(10)2. 2 Traditional Apriori algorithmAs well known, the main idea of the typical Apriori algorithm is to use the prior knowledge of frequent i・ tern sets to complete iterative calculations through lay- er・by・layer search [ 13-14]. The general mining steps
Yi ZENG, et al.
for the typical Apriori algorithm to implement association analysis are:(1 ) Set the required minimum support minSup and minimum Confidence minConf ;(2) Connect and scan the data set in turn, determine the number of supports for each item, and select the frequent item set 1 that meets the requirement through the minimum support level set in step 1 ;(3) Random combination of two frequent item sets 1 to generate candidate frequent item sets 2, then connect the datasets in sequence and complete the support degree calculation of candidate frequent item set 2, and finally filter the frequent item set 2 according to step 2;(4) Repeat step 3 until an empty highestrder frequent item set is generated ;(5) Output the result and the algorithm ends.2. 3 Improved apriori algorithm flowThrough the analysis of the above traditional Apriori algorithm steps, it can be seen that the database needs to be connected and scanned during the repeated exe・ cution of step 2 and 3, which will significantly reduce the execution efficiency. Therefore, to reduce the number of database connections, try to generate fewer useless candidate item sets. This paper has improved the process of generating a candidate item set Ck (con・ nection step) , as shown in Table 1.Table 1 Process that produces association rules in parallel1 for each lxLk_};2
Ifor each lyLb 】3
Iif仏[1] = Zy[l] AZJ2]詁[2] AJO 2] = ly[k-2]AZJA-1] C& = c U Cg ;I ;7 else break;II8 end for9 return ^//Returns the set of items generated by the con nection ;An improved Apriori mining algorithm based on Hadoop framework in big data environment1012. 4 Implementation of Apriori algorithm in Hadoop frameworkIn order to implement the proposed improved Aprio・ ri algorithm in the cloud platform Hadoop framework. In this paper, Map Reduce processing is performed to improve the Apriori algorithm flow, as shown in Fig. 1.Fig. 1 improved Apriori algorithm flow based on Map Reduce3 Simulation experimentsIn order to verify the performance of the algorithm, the improved algorithm is compared with the tradition・ al Apriori algorithm. The algorithm was written using the C# development language in Visual Studio2010. In order to simulate the cloud platform operating environment, 10 computing nodes were used. Each node has an Intel i5 processor with a CPU clocked at 3. 2 GHz and 6GB of memory. The database is SQL Server 2000. The version of Hadoop installed on the compute nodes is 1. 1.0, and the JDK version is 1.7.3. 1 Evaluation indexPrecision rate ( Precision) and recall rate ( Recall) are commonly used evaluation indexes of recommendation algorithms for probabilistic statistics. Let user u get the resource set R(u) that is suitable for the user through the recommendation algorithm, and the user' s actual behavior set is /(zz). Then:x I R(u) n Ku) iRecall = —-------------------------ueUX I (11)IM Ix I R(u) n IPrecision = -------------------------ueUX I (12)RM IIn order to evaluate the rationality and validity of the algorithm more accurately, considering the accuracy and recall rate of the algorithm together, the scoring values of Precision and Recall are calculated.2 * Precision * Recall Precision + Recall(13)3. 2 Test resultsThe transaction database used in the experiment is shown in Table 2. The minimum support is set to 20% , so for the 10 transactions in the table, the minimum number of transactions supported is 2.Table 2 Transaction data sheetNo. TID ID listA , ?2,厶,‘4,?5,?6人仏厶,厶,厶A,【2,【3,6 T64【4,【5,【6【2,厶,【4,厶厶,厶10 &4According to formula ( 3 ) , the user' s behavior weights can be calculated, and the weights are estimated to be 0. 33, 0. 35 , 0. 17, and 0. 15 , respective・ ly. Then use formula (8) for normalization to get the user-behavior matrix.Then read the SQL database and save the two-dimensional array to the SQL database, as shown in Table 3. Connect and complete the two-dimensional ar ray scan, get厶】={人,厶,厶,厶,厶,厶1,and set the T6 flag M equal to 1, as shown in Table 4, then it will not be scanned again. Due to limited space, only one of the steps was shown. The remaining steps are 102the same and will not be repeated here.Table 3Transaction data sheet 1 with mark-bitNo. TID M 人 I2 /3 /4 /5 401 11 11 12 T2 0113 r3 01 - 1 -4 r4 01 111 1 1-1--7T-, 0---11 11 111110^10 01Table 4 Transaction data sheet 2 with mark-bitNo. TID M /I I2 I3 /4 A h10 1111 1 1201 -1 1301- 1401 115011 11 - 101 1 1011 101 - 11001Then select 1/2 of the user-behavior set to verify the mining algorithm. The remaining 1/2 set is used as the test standard comparison set. The simulation suits are shown in Fig. 2.Number of recommended behaviors NFig. 2 Comparison of accuracy resultsIn Fig. 2, the abscissa indicates the recommended Yi ZENG, et al.number of behaviors N ; the ordinate indicates the ac・ curacy. The final execution result of the simulation experiment is shown in Table 5. It can be seen from Table 4 that the improved Apriori algorithm reduces the number of linker comparisons and effectively a・ voids the generation of useless transactions (candidate item sets) , thereby it effectively improves the exec tion efficiency.Table 5 Comparison of the algorithm before and after improvementKScan dataConnection comNumber of candidate base sizeparison timessets generatedf8/80/06/628/628/1014/1438/56/48/448/33/32/058/20/00/04 ConclusionFor the big data mining efficiency problem, this paper proposes an improved Apriori mining algorithm. Firstly, the modeling of item sets classification under Hadoop framework is realized. Then through the analysis of the mining steps of the traditional association rule Apriori algorithm, the generation method of the candidate item set is improved, and the useless information is removed with the flag information, which effectively reduces the number of transactions and projects ,thereby shortening the task processing time. In the specific implementation process, Map Reduce processing is performed on the improved Apriori algorithm flow. The experimental results show that the improved Apriori algorithm could effectively reduce the process・ ing time and improve the execution efficiency.References[1 ] WEI L, WEI Y J, GAO C Y. Improvement of Apriori Algo rithm Based on Bigtable and MapReduce [ J ]. Computer Sci ence, 2015, 42(10):208-210.[2] HUANG J, Li M Q, GUO W Q. Research on Improved Apriori Algorithm Based on Hadoop [ J ]. Computer Science, 2017, 44(7): 262-266.[3] YING Y, REN K, LIU Z T. Data Mining Based on Cloud Computing Technology [ J ]. Microelectronics & Computer, An improved Apriori mining algorithm based on Hadoop framework in big data environment2013, 30(2):161-164.[4] LEE Y. Toward scalable internet traffic measurement and a- 103networks [ J ]. Data Mining & Knowledge Discovery, 2013, 27 (27) :294-320.ACHAR[10]nalysis with Hadoop [ J ]. Acm Sigcomm Computer Communication Review, 2012, 43( 1) :5-13.[5] GERONIMO L D, FERRUCCI F, MUROLO A, et al. A A, LAXMAN S, SASTRY P S. A unified view of the apriori-based algorithms for frequent episode discovery [ J ]. Knowledge & Information Systems, 2012, 31(2) :223-250.[11JACHAR A, LAXMAN S, SASTRY P S. A unified view of the apriori-based algorithms for frequent episode discovery [ J ]. Parallel Genetic Algorithm Based on Hadoop MapReduce for the Automatic Generation of JUnit Test Suites [ J ]. Cognitive Computation ,2012, 8(1) :52-68.[6] REN Z, WAN J, SHI W, et al. Workload Analysis, Impli Knowledge & Information Systems, 2012, 31(2) :223-250.[12]MODI C N, PATEL D R, PATEL A, et al. Integrating Signature Apriori based Network Intrusion Detection System (NIDS) in Cloud Computing [J]. Procedi a Technology, 2012, cations ,and Optimization on a Production Hadoop Cluster: A Case Study on Taobao [J]. IEEE Transactions on Services Com puting, 2014, 7(2) :307-321.[7] ADDAIR T G, DODGE D A, WALTER W R, et al. Large- 6(4) :905-912.[13 ] JIAO Y. Research of an Improved Apriori Algorithm in Data Mining Association Rules [ J ]. International Journal of Computer & Communication Engineering, 2013, 2( 1) :25-27.scale seismic signal analysis with Hadoopf J]. Computers & Geosciences ,2014, 66(2): 145-154.[8] KANNOUJIA S, KOSTA A. Hybrid Apriori- An Improvement [ J ]. Gut, 2015, 31(5) :504-8.[14] KHALILI A, SAMI A. SysDetect: A systematic approach to critical state determination for Industrial Intrusion Detection [9] BERLINGERIO M, PINELLI F, CALABRESE F. ABACUS: Apriori-BAsed Community discovery in multidimensional Systems using Apriori algorithm [J]. Journal of Process Control, 2015, 32(11) :154-160.大数据环境下基于Hadoop框架的改进Apriori挖掘算法曾毅「,周湘贞$1 .广西大学行健文理学院理工学部计算机与信息工程系,南宁5300052.中国社会科学院,北京100028摘要:针对Hadoop框架下的用户行为大数据挖掘效率问题,提出了一种改进的关联规则Apriori挖掘算法。该算法首先实 现了 Hadoop框架下的项集分类建模。然后通过传统关联规则Apriori算法的挖掘步骤分析,对候选项目集丝生成方式进 行了改进,并结合标志位信息实现无用事务去除,有效压缩了事务和项目的数量,从而缩短了任务处理时间。在具体实现 过程中对改进Apriori算法流程进行了 Map Reduce处理。仿真实验表明:相比于传统Apriori算法,改进后的Apriori挖掘 算法具有更高的执行效率。关键词:大数据;Hadoop框架;Apriori;项集分类;执行效率^-X 3KK: :XX:3►--«3KX: 9-0« XX»O«: »-'>« »■'« :KX: YI基金项目:河南省科技厅2018年度科技攻关项目(182102110277);广西壮族自治区教育厅2018年度广西高校中青年教师基础能力!J提升项目(2018KY0783);广西壮族自治区教育厅2019年度广西高校中青年教师科研基础能力提升项目(2019KY0960)»本文引用格式:曾毅,周湘贞.大数据环境下基于Hadoop框架的改进Apriori挖掘算法[J].机床与液压,2019,47(6) :98 -103. 因篇幅问题不能全部显示,请点此查看更多更全内容