Summary摘要
数据挖掘是信息处理的一项重要课题,分类是数据挖掘领域的一项重要任务,在电信、银行、保险、零售、医疗等诸多行业领域被广泛应用。决策树算法以其速度快、精度高、规则容易理解,在分类领域被广泛地研究和应用。
Data mining is an important topic in information processing , classification is an important task in the field of data mining , has been widely used in telecommunications, banking , insurance, retail, healthcare and many other industries . Decision Tree algorithm for its high speed, high precision , easy to understand rules , have been extensively studied and applied in classification field .
本文针对汽车质量评价分类的实际需求,重点研究了决策树的核心技术:决策树建树算法和决策树剪枝算法。在建树算法上重点介绍了ID3、C4.5等6种典型算法的建树过程以及性能分析等,最后总结和分析全面比较每种建树算法的优缺点。在决策树剪枝算法上重点介绍了CCP、REP等5种剪枝算法的剪枝策略,通过比较总结了每种剪枝策略的特点。
In this paper, the actual demand for automotive quality assessment and classification , focusing on the core technology of the decision tree : tree pruning algorithm achievements algorithms and decision trees . Algorithm focuses on the achievements and the achievements of the process performance analysis ID3, C4.5 and other six kinds of typical algorithms , etc. Finally, a comprehensive summary and analysis of comparative advantages and disadvantages of each contribution algorithm . On the tree pruning algorithm focuses on the CCP, REP 5 kinds of pruning algorithm pruning strategy , summarized by comparing the characteristics of each pruning strategy.
本文根据前人研究的成果分析出其中的不足,提出了“基于规则信息量和前件化简阈值的规则后剪枝算法”,该算法以REP剪枝策略为指导,使用规则信息量和可信度从原始规则中提取可信度高的有用规则,使用属性重要性和前件化简阈值对规则的前件进行简化。通过与REP算法的比较分析出了该算法具有提取规则少、规则可信度高、分类精度高、算法稳定性高、可灵活控制等优点,克服了噪声数据和过度拟合的问题。
This paper analyzes the results of previous studies, according to which the lack of a proposed " rules based on the amount of information and the antecedent rule simplification threshold pruning algorithm," The algorithm REP pruning strategy as a guide , the amount of information and the use of rules high reliability extracting useful rules from the credibility of the original rule , the use of attribute importance and former member of the simplification threshold antecedent rule be simplified. Comparative analysis with the REP algorithm out that the algorithm has less extraction rules , rules high reliability , high classification accuracy , high stability algorithms , flexible control , etc., to overcome the noise data and overfitting problems.#p#分页标题#e#
(2 ) Rules ; tree scale (1) C4.5 than the ID3 algorithm to construct small, but slightly less affected by the data classification accuracy : Finally, the use of C # programming decision tree classifier through 4 set of experiments the following conclusions pruning pruning get less than REP optimization rules , while a high classification accuracy ; ( 3 ) rules confidence in [ 0.3, 0.7 ] interval, the former member simplification threshold when [ 0.8,1.4 ] interval, the extracted rules best performance ; rule ( 4 ) optimized to meet the needs of the automotive quality assessment classification.
1. background and significance选题背景和意义
Classification in data mining is a very important task , the current classification and prediction techniques in data mining has been the subject of extensive research and has been widely used in many industries as telecommunications, banking , insurance, retail, health care , etc., for the future have a profound impact on business development and people 's production and life . The purpose of classification is to learn a classification function or classification model, which is a data item in the database can be mapped to a given category one . The model can predict the classification after using historical data records automatically derive data for a given promotional description , and thus to predict future data .
In data mining classification and prediction , there are decision trees , Bayesian classification, back propagation classification, based on concepts from association rule mining classification methods such as nearest neighbor classification , in which the most widely used decision tree algorithm . Decision tree is similar to a flowchart of a directed graph , it is the expression of knowledge . Decision Tree algorithm to get through the training set by learning a decision tree , to find some potentially valuable information . Decision Tree algorithm has the following advantages: First , decision tree algorithm complexity is small, you can quickly generate a decision tree with speed ; Second , the decision tree algorithm anti-noise ability, many decision tree algorithm can handle noisy data , missing value data, etc. ; third , the rules extracted decision tree algorithm is simple, easy to understand , is conducive to classify forecast ; fourth, the decision tree algorithm is scalable , can handle both small data sets can also handle vast amounts of data , classification to meet the projected demand reality . Because of this, the decision tree algorithm in the clinical diagnosis of the disease , equity investment, product quality evaluation, the company's operations and market control has a very good decision-making role , has a high practical value research decision tree classification algorithm.
Previous studies on the decision tree algorithm have been many achievements , different decision tree algorithm has its own advantages and disadvantages, have their field of application . Therefore, according to results of previous studies to learn and master the core knowledge of the decision tree algorithm to master a variety of achievements and common decision tree pruning process and summarize and compare their characteristics is necessary to predict the actual classification application has an important role . In this paper, the actual needs of the automotive quality assessment , decision tree classifier through programming , quality assessment and classification of automotive forecasting , and the corresponding lack of research to improve the existing methods of the original algorithm , thereby reducing the size of the decision tree to improve classification and prediction accuracy. Theoretical studies and practical applications of data mining classification allows us to predict a deeper understanding of the field .#p#分页标题#e#
Decision tree algorithms are widely used classification methods, the first decision tree learning system developed by Hunt et al in 1966 a concept learning system . Achievements in the decision tree algorithm, 1979 JRQuinlan made famous ID3 algorithm, due to the ID3 algorithm can only deal with discrete -valued attributes , while data is more sensitive to noise , then , in order to adapt to different data characteristics and the actual classification of forecast demand , 1984 in L.Breiman et al proposed to attribute the minimum Gini index value of the property as a test CART algorithm. Compared to the ID3 algorithm , CART is generated binary tree to improve the classification accuracy. 1992 K.Kira and L.Rendell RELIEF algorithm is proposed based on inter- dependence of the properties . 1993 , JRQuinlan also proposed C4.5 algorithms ID3 algorithm overcomes the deficiencies in the application , C4.5 algorithm uses the information gain ratio as a measure of property division can handle missing values of continuous attributes and attribute values in the inherited ID3 advantages while expanding the scope of application of the decision tree . 1996 M.Mehta and R.Agrawal et al proposed SLIQ algorithm suitable for high-speed scalable large-scale data processing , 1996 J.Shafer and R.Agrawal et al proposed SPRINT decision tree algorithm scalable parallel induction , SLIQ and SPRINT algorithm to generate all binary , with good scalability and parallelism, classification and prediction accuracy is high ; 1998 and later have made PUBLIC and rainForest ( rain ) and other commonly used decision tree algorithm achievements .
In the decision tree pruning algorithm, 1984 L.Breiman and J.Freidman proposed error complexity pruning (ECP) algorithm and the first consideration after pruning algorithm complexity pruning (CCP); same year T.Niblett proposed the minimum error rate pruning (MEP) algorithm ; 1987 J.Mingers and JRQuinlan have raised the threshold pruning (CVP) algorithm and reduce errors pruning (REP) algorithm. Tree pruning algorithm described above mainly after pruning algorithm, in practical application after pruning algorithm is also used more often. Traditional tree pruning is carried out on the basis of the original decision tree pruning on . In recent years, many researchers have proposed the rule pruning methods , namely the establishment of a good original tree after extracting all the rules , using rough set theory attributes the equivalent simple rule sets . After the use of information theory and fuzzy mathematics as well as rules related to the amount of information to measure the extent of the former rules and a rear piece so the rules pruning , pruning achieved good results, more streamlined and simplified the rules , has a good classification prediction.
The main contents of this thesis is the core technology of the decision tree algorithms , including decision tree construction algorithm and tree pruning algorithm, through the study of various decision tree algorithms former master method to build decision trees and optimization , to form a more comprehensive summary and Comparison , thus capturing the characteristics of the various algorithms. While the actual demand for automotive quality assessment and classification of building and tree pruning through programming optimization , classification forecasting functions. Therefore, the main contents of this paper is summarized as follows :#p#分页标题#e#
( 1 ) acquire the relevant theoretical knowledge tree algorithm ;
( 2 ) understand the achievements of previous studies of typical decision tree algorithm and tree pruning algorithm, learn about the latest decision tree algorithm optimization , master tree establishment and optimization of the process ;
( 3 ) According to the algorithm the deficiencies of previous studies , the corresponding optimization program , and the formation of the algorithm ;
( 4 ) programming decision tree classifier , and experiment , summarize the results of experiments and applications.
The structure of the full text summarized as follows :
The first chapter . Brief background and significance of the topic , research status , and the main content of the paper to study.
The second chapter the theoretical basis of the decision tree . From the macroscopic description of decision tree technology involves the theoretical basis for the decision tree algorithm behind specific research foundation .
Chapter III study comparing the performance of the decision tree and core technology . Detailed decision tree algorithm and pruning algorithm typical achievements of previous studies , highlighting achievements thinking ID3 and C4.5 algorithms on achievements algorithms, including the contribution of various decision tree algorithms related concepts related mathematical calculations and algorithms contribution process , advantages and disadvantages of the algorithm , and finally the achievements of the typical tree algorithm described summarize comparison. Pruning algorithm focuses on the REP and other pruning algorithm, and a typical tree pruning algorithm described summarize comparison.
Chapter IV tree pruning improved optimization algorithm. Summarize the results of previous studies , and from analysis of its shortcomings , presents a more reasonable and effective improvement programs to improve the performance of the decision tree algorithm .
Chapter decision tree classifier design and core algorithm performance comparison test . Using C # technology achievements and tree pruning algorithm, and using the experimental data on the core algorithms for comparison, the experimental results. And optimize the use of decision tree rules applied to the classification forecasting applications.
Chapter VI Summary and Outlook . Research work on this article summarize , talk about their research experience , and future research work .
2.Definitions and Notation tree2.决策树的定义及表示法
A decision tree is a tree structure similar to a flowchart , wherein each internal node represents an attribute of the test , each branch representing a test output , each leaf node represents a class or class distributions . Topmost node of the tree is the root node . The purpose is to identify the relationship between the structure of the decision tree attributes and categories , use it to predict the unknown class record category. This system has a function called predictive decision tree classifier.#p#分页标题#e#
In Figure 1 , for example, it represents the concept buys_computer, it predicts whether the customer is likely to buy a computer . Internal node represents a rectangle , and a leaf node is represented by an ellipse. In order to classify the unknown sample , the attribute values of samples tested in the decision tree . Path from the root to the leaf nodes of the sample predicted storage. [ 1 ]
≤30 31~40 >40
no yes excellent fair
Tree notation refers to it from a group of no order , no inference rule instances represent classification rules in the form of a decision tree . Tree representation is a greedy algorithm, using a top-down recursive way to compare property values in the tree internal nodes , and depending on the property value to determine the downward branch from the node in the tree leaves node conclusion . Tree from the root to the leaf node on a path corresponds to a conjunctive rule , whole grain tree corresponds with a disjunctive expression rules . [ 2 ]
A top-down decision tree is generated , the process of divide and rule , and the fundamental differences between decision tree algorithm is different attributes selection criteria and, therefore, choose to test the properties of the sample is generated and how to divide the key to the decision tree . [ 3 ]
Known as ID3 algorithm , for example, the decision tree generation process can be summarized as follows: starting from the root of the tree of all of the training samples , select a property to divide these samples. Each attribute value to generate a branch. The corresponding subset of the sample is moved to the component property values to generate a new child node and the above process is applied recursively for each child node , until all samples are divided on a node to a class. [ 4 ]
Construct a decision tree using the attribute selection measure to select the best tuples divided into different classes of property to determine the topology of each attribute by attribute selection between metric. The key step is the splitting attribute tree structure , i.e., a node of a structure divided according to the different characteristics of the different branches of the property , the objective is to get as much of each sub- division "pure ." [ 5 ] split the property into three different situations:
1 attribute is discrete values and does not require the generation of binary decision tree . Each time with the property division as a branch ;#p#分页标题#e#
2 discrete values and attributes are required to generate binary decision tree . In this case the use of a subset of property division tested , according to " belong to this sub- set" and " does not belong to this subset " is divided into two branches ;
3 is a continuous attribute values. At this point to determine a value for the split point split_point, according to > split_point and ≤ split_point generate two branches. [ 6 ]
Under normal circumstances , or to have greater probability that the smaller the predictive ability of the tree is the tree of the stronger . To construct a decision tree as small as possible , the key is attribute selection measure . It can only take a good choice of logic judgment or property with a heuristic strategy . The choice depends on a variety of attributes for example a subset impurity metrics . Not purity metrics include information gain , information gain ratio , Gini index , distance measure , J-measure, G statistics , statistics, weight of evidence , the minimum description length (MLP), the orthogonal method, relevance and Relief. Different metrics have different effects , especially for multi-value property. [ 7 ]
When the tree is created , because the noise in the data and isolated points , many branches reflect the training set of data anomalies , in order to deal with this problem too fit the data , using the tree pruning methods , most do not cut the tree reliable branch , reducing the size of the decision tree , so you can achieve a faster classification , independent of the data to improve the tree 's ability to correctly classified . Here are two kinds of common tree pruning methods: pre- pruning and post- pruning .
Pre- pruning method is entirely correct classification before the training set , earlier stopped by certain rules and structure of the tree of the tree " pruning " Once stopped, the nodes become leaves, the leaves may be a subset of the sample holder of the most frequent class , or probability distributions of these samples . When the tree structure , the use of statistical information like the gain measure of assessment of the fine division , if the node is divided samples will result in less than the predetermined threshold split , the given subset is further divided stops. This algorithm is relatively simple , high efficiency, suitable for large-scale problems . [ 8 ]
After pruning methods from " fully grown " unreliable tree branches cut through branches delete nodes , and cut the tree node. It uses a training set of sample data sample or pruning , inspection decisions subtree prediction accuracy of the target variable , and calculate the corresponding error rate. Users can pre-set a maximum allowable error rate. When pruning reaches a certain depth , the error rate calculated in the error rate when the maximum allowed , stop pruning , or continue to prune . Finally got one with the smallest expected error rate of the decision tree . [ 8 ]#p#分页标题#e#
Decision Tree algorithm has good scalability means algorithm has the ability to handle large amounts of data or expedited data mining process , from the large amounts of data quickly and accurately focus on discovering hidden in one of the major classification rules . Scalability, including enhanced decision tree algorithm , and scalability . The main reason for Decision Tree algorithm scalability are the following:
( 1 ) enhance the accuracy and effectiveness of decision tree classification . For example , if the property has a whole range of discrete or continuous values , based on all of the properties are classified or discrete ID3 algorithm needs to be improved to make it adapt to the new property features ; When the property is multi-valued attribute , ID3 algorithm used information gain as a measure to select the test attributes will affect the accuracy and validity of the classification , the information gain ratio can be used at this time , other options metric Gini index , G- statistics assurance algorithms for multi -valued attribute characteristics.
( 2 ) algorithm for space complexity requirements. Common ID3 and C4.5 algorithms are algorithms to limit the sample reside in main memory , due to the training sample between main memory and cache change into swapped out , the decision tree algorithm may become inefficient.
( 3 ) the time complexity of the algorithm required . Number of samples is facing a lot , the number of the number of properties for each sample and the values of each property are many and complex, and decision tree algorithms often require multiple scans the entire database , thus improving the algorithm scalability , reducing algorithm time is critical. [ 9 ]
Often people will use relevant indicators for the evaluation of a decision tree algorithm to reflect the merits of the different algorithms. Evaluation decision tree algorithm has the following main aspects: tree size , the accuracy of the algorithm , the algorithm complexity algorithm robustness , scalability algorithms , algorithms interpretability .
( A ) the size tree . When you build a decision tree , in the case of guaranteed performance requirements , always try to reduce the size of the decision tree , there are two kinds of tree size metrics : the number of nodes and leaf nodes . The more simple decision tree generated classification and forecasting ability is stronger . [ 3 ]
( 2 ) the accuracy of the algorithm . The so-called decision tree algorithm is the accuracy of the classification accuracy. It was recorded in the test set percentage of the total misclassified records to measure . Improve the accuracy of decision tree pruning techniques , mainly through increased. [ 5 ]
( 3) The complexity of the algorithm . Computational complexity depends on the specific implementation details and hardware environment , in data mining , the operation object is a massive database , and therefore the complexity of the problem space and time is a very important aspect.#p#分页标题#e#
( 4 ) the robustness of the algorithm . When the data tree algorithm robustness mainly involved given noisy data or data vacancies value , whether the decision tree algorithm can effectively deal with these data , in order to ensure the correctness of classification and prediction .
( 5 ) Scalability algorithm . Scalability decision tree algorithm mainly involves huge amounts of data , whether the decision tree algorithm can quickly and accurately find hidden in one of the major classification rules from a large data set.
( 6 ) algorithm interpretability . A decision tree is an expression of knowledge extracted from the generated decision tree classification rules should be clear and easy to understand , the only way a decision tree algorithm makes sense . [ 5 ]
Around the tree of knowledge involved in the whole process of this chapter introduced the basic concepts of the decision tree algorithm , related terms, such as thinking , decision tree algorithms have a preliminary understanding, and laid the theory focuses on the following core technology tree foundation.
3. Core technology research and compare the performance of the decision tree3.决策树核心技术的研究及性能比较
ID3 algorithm is a decision tree algorithm is a typical representative , it is a Quinlan proposed by the decision tree algorithm based on information gain , the vast majority decision tree algorithms are improved and implemented on the basis of the algorithm . ID3 algorithm always choose the attributes with the highest information gain as a test attribute of the current node . This property makes the results into the minimum amount of information needed to classify samples , the smallest division of randomness and reflect or " impure ." This information-theoretic approach makes for a desired number of tests required to achieve the minimum target classification , and try to ensure as far as possible to find a simple tree .
Let S be a set of data samples s . Assumed that attribute class label having different values of m , m different classes defined Ci (i = 1,2, ..., m). Let si is the number of samples in class Ci . For a given sample classification entropy is given by the following :
.
Wherein , pi is the probability of any part Ci of the sample , generally available si / s estimated . Note the logarithmic function to the base 2 , because the information in binary encoding.
Let v attribute A having different values {a1, a2, ..., av}, S can be divided into the attribute v A subset of {S1, S2, ..., Sv}, wherein , Sj contained in this S some samples having a value of aj are the a. If A as a test attribute ( ie, best split attribute ) , these correspond to the node that contains a subset of the set S grow out of the branch.
Let Sj is the number of samples in a subset of the class Ci . A partition according to a subset of the entropy is given by the weighted sum :#p#分页标题#e#
.
Here , j serves as the first subset of the rights , and is equal to the subset ( i.e., A is aj) divided by the total number of samples in the sample S in . Entropy weighted sum value , the higher the purity of the subset division . Note that , depending on the desired information calculation formulas given above , for a given subset Sj, entropy is calculated from a subset of the formula:
.
Wherein the probability that the sample belongs to the class Ci .
Entropy of the desired information can be obtained corresponding to the gain value information . For information on the A branch of the gain obtained by the following equation:
.
ID3 algorithm to calculate the information gain of each attribute , and select the attributes with the highest information gain given set S as a test property. Was selected to test the properties of a node is created , and the property tag , the value of each property to create a branch , and accordingly divided the sample . [ 9 ]
Assume that the training samples samples, represented by a discrete value of the property ; collection attribute_list candidate attributes,
Methods Generate_decision_tree (samples, attribute_list):
( 1 ) Create a root node N;
(2) if samples are in the same class C then
( 3 ) Return N as a leaf node labeled with the class C ;
(4) if attributes_list empty then
( 5 ) Return N as a leaf node , labeled samples of the most common classes ;/ / majority voting
( 6 ) Select attributes_list with the highest information gain property best_attribute;
( 7 ) N is a node labeled best_attribute;
(8) foreach best_attribute the known value ai / / divide samples
( 9 ) grow from the node N a condition best_attribute = ai branch ;
( 10 ) Let si are samples of best_attribute = ai sample collection ;/ / a classified
(11) if si is empty then
( 12 ) with a leaf labeled samples is the most common class ;
(13) else plus one of Generate_decision_tree (si, attribute_list-best_attribute) return nodes ;
ID3 algorithm recursively top-down decision tree constructed to form a group like If ... Then classification rules . [ 9 ]
ID3 decision tree algorithm as the most typical algorithm has many advantages:
( 1 ) contained in the ID3 algorithm hypothesis space in all the tree , it is about the value of existing properties of finite discrete functions of a complete space , thus avoiding the risk of search incomplete hypothesis space ;
(2) ID3 algorithm using comprehensive training data , so you can take advantage of the statistical properties of the entire training set of decision-making, thereby reducing the impact of noise data brought ;#p#分页标题#e#
(3) ID3 decision tree algorithm to generate classification rules can be drawn easily understood ;
(4) ID3 algorithm uses a top-down strategy, part of the whole search space , so that the number of tests done to ensure that fewer classification faster ;
(5) ID3 decision tree algorithm can be generated to show which attributes more important than the clear . [ 3 ]
ID3 algorithm there is also a lot of disadvantages:
(1) ID3 algorithm can only deal with discrete attributes , but can not handle continuous attributes ;
Fewer legal classification (2) ID3 decision tree algorithm to construct each node , and only use a single attribute as a branch test attribute ;
The quality of the training set dependence (3) ID3 of strong contribution to the quality of the amount of data susceptible to noise and data ;
(4) ID3 algorithm is a multi-tree structure , the number of child tree nodes depending on the number of branches of the property value , which is not conducive to processing branch attribute values larger number of cases ;
( 5 ) In the case without rebuilding the entire tree , ID3 decision tree can not easily be changed . [ 10 ]
Anyway , ID3 algorithm theory because of its clear , simple method , learning ability , suitable for handling large-scale learning problems, has wide application in the field of machine learning and data mining in .
Quinlan C4.5 algorithm is proposed in 1993 , it evolved from the ID3 algorithm , in addition to the function of ID3 algorithm , but also introduced new methods and adds new features.
( 1 ) Select Test Properties with information gain ratio
Let v attribute A having different values {a1, a2, ..., av}, S can be divided into the attribute v A subset of {S1, S2, ..., Sv}, which contains S, so some of Sj sample : they have value aj on a, if we are to attribute A is the reference sample is divided , SplitI (A) is in front of the entropy concept.
Information gain ratio is developed on the concept of information gain , an information gain ratio is given by the following attributes :
.
Among them,
.
C4.5 algorithm first calculate the information gain of each attribute , then only those attributes applied information gain ratio is higher than the average gain of test information .
( 2 ) may handle a continuous attribute values
If the attribute value of A is a continuous attribute , in the training set can be arranged in ascending order {a1, a2, ... am} (m is the number of the training set ) . A total of n different values , if , for each of the values vj (j = 1,2, ..., n) of all the records are divided . These records are divided into two parts: vj falls within the range, while the other part is greater than vj. Calculated for each vi (i = 1,2, ..., n) information gain in C4.5 algorithm respectively , select the value with the highest information gain as a division of the interval threshold vj, put the original interval is divided into A ≤ vj and A> vj two intervals . Information gain ratio were calculated for each division subset choice to gain maximum discrete classification of the corresponding property .#p#分页标题#e#
( 3 ) handling training samples containing unknown attribute values
C4.5 algorithm processing of the sample may contain unknown attribute values , its approach is : instead of using the most commonly used value or values of the most commonly used in the same class . Specifically probabilistic methods known value based on attributes , each attribute , and a probability value to obtain the probability to obtain the probabilities depend on the known values of the property . [ 11 ]
( 4) the contribution process tree pruning
After generating a decision tree , C4.5 classification algorithm to take to correct the error over tree pruning techniques suitable for the problem , which can not cut the tree to improve branch prediction accuracy rate . If misclassification extra branch node N N in all samples will be classified as a class as a result of the classification error , then N without classification , which would cut child node N . [ 12 ]
Assume that the training samples samples, represented by a discrete value of the property ; attribute_list for the set of candidate attributes ( attributes can have continuous values ) , methods Generate_decision_tree (samples, attribute_list):
( 1 ) Create a root node N;
( 2 ) If the training set is empty , the return node N labeled Failure;
( 3 ) If all of the records in the training set belong to the same class , then the class mark node N;
( 4 ) If the candidate attribute is empty, return N as a leaf node labeled training set of the most common class ;
(5) foreach candidate attribute attribute_list
(6) if the candidate attribute is continuous then
( 7 ) the discrete attributes ;
( 8 ) to select the candidate with the highest property attribute_list information gain ratio attribute D;
( 9 ) as an attribute tag node N D;
(10) foreach attribute di known value D
( 11 ) grown by the condition of the node N D = di branch ;
( 12 ) Let D = si training set is a collection of training samples of di ;
(13) if si is empty
( 14 ) with a leaf node is marked as the most common class training set ;
(15) else with one of Generate_decision_tree (R-{D}, si) to return node ; [ 3 ]
Compared to the ID3 algorithm , C4.5 algorithm based on ID3 algorithm, with a lot of improvements , and its advantages are:
(1) C4.5 algorithm uses information gain ratio attribute selected as the standard test to prove that the practice , it is more than the use of information gain better prediction results, the algorithm is more robust ;
(2) C4.5 algorithm overcomes the shortcomings of ID3 algorithm can deal directly with the continuity of the value of property that can also handle the training sample containing an unknown attribute values ;#p#分页标题#e#
(3) C4.5 decision tree algorithm generates fewer branches , a strong predictor of performance , and help generate classification rules easier to understand.
(4) C4.5 algorithm with incremental learning function. [ 13 ]
However , C4.5 algorithm , there are some shortcomings:
(1) C4.5 algorithm although an advantage in terms of speed and forecasting accuracy , but in the smaller aspects of constructing a decision tree is at a disadvantage ;
(2) C4.5 decision tree algorithm repeatedly constructed as the data is divided into smaller and smaller parts , prone to fragmentation problems , making the robustness of the decision tree is reduced prediction accuracy rate decreased ; [ 14 ]
(3) C4.5 algorithm in the continuous attribute discretization algorithm for calculating the rate of information gain for each split point and find the optimal segmentation threshold property , when property values are more time-consuming method , but if the information entropy very small , the information gain ratio will be unstable ;
(4) C4.5 main tree is based on the evaluation of the error rate of the decision tree , the tree depth , without regard to the number of nodes , the average depth of the tree of the predicted impact velocity , the number of nodes of the tree represent the size of the tree . [ 3 ] ,
ID3 and C4.5 algorithms are typically more focus on decision tree algorithm achievements in the field of decision trees, there are many previous studies on the achievements of the algorithm. These algorithms are primarily in property selection criteria and metrics differ on treatment strategies for different types of data , so that these algorithms can handle more practical problems , complexity and classification accuracy of the algorithm has a good performance , to meet different application areas classified demand. Following is a brief rest four kinds of common achievements algorithm.
(1) CART algorithm : Classification and Regression Trees CART (Classification and Regression Trees) is a technique to produce a binary decision tree . CART algorithm was first proposed by Breiman et al and has been widely used in the field of statistics . It has the smallest Gini coefficient values to select the properties as a test attribute , when the smaller the Gini value , the higher the purity of the sample , the better the effect of division . CART algorithm uses a recursive binary segmentation , the current sample set into two subsets of the sample , such that the generated decision tree for each non- leaf node has two branches. Thus , CART decision tree algorithm to generate a simple binary decision tree structure . [ 15 ]
(2) CHAID algorithm : CHAID algorithm name is " chi-squared Automatic Interaction Detection" , in 1980, proposed by Kass, analysis applies to classification and ordinal level data is based on an optimal target with target selection , variable selection and decision tree algorithm clustering capabilities . [16] CHAID algorithms to deal with the raw data as a starting point , the first goal of the variables selected category , then select the target classification index and classification cross-classification variables , resulting in a series of two-dimensional frequency tables were generated by calculating the two-dimensional classification table statistics or likelihood estimator to compare the size of the statistics , with the greatest amount of two-dimensional statistical table as the best initial classification. Continue to use the classification index based on the best two-dimensional classification of the target variable for classification, repeat the process until the classification conditions are met. [17] CHAID tree generated good performance multi-tree . [ 18 ]#p#分页标题#e#
(3) SLIQ algorithm : SLIQ algorithm uses three kinds of data structures to construct decision trees , which are: the attribute table , class table and the class histograms. Attribute table contains two fields: property values and sample number, which points to the index class table. Class table also contains two fields : Sample category labels and sample belongs leaf node pointer. The first class table records corresponding to k the k-th sample in the training set ( sample No. k), so you can create an association between the table and the class attribute table . Class table can always indicate the sample was divided , it must be permanent memory . Each attribute has an attribute table can reside disk. Class histogram subsidiaries on a leaf node , an attribute used to describe the distribution of the class node. Describe the distribution of continuous attributes , it consists of a set of tuples < category , the number of samples in this category > composition ; describe the distribution of discrete attributes , it consists of a set of triplets < property value , category , that category to take the property value number of samples > components. With the implementation of the algorithm , the class histogram value is constantly updated. [ 22 ]
SLIQ algorithm achievements phase, adopted a strategy spanning the pre- sorting technology and breadth-first combination of continuous attributes , Praying set to take fast algorithm for discrete attributes determine the division of the conditions . SLIQ generated is a binary decision tree . In the decision tree pruning stage , SLIQ MDL principle -based algorithm to take pruning , that is a decision tree pruning based on a consideration of the size of the encoding , in order to find the minimum cost tree. [ 12 ]
(4) SPRINT algorithm : SPRINT algorithm is SLIQ algorithm improvements , because the class table SLIQ algorithm requires the presence of memory , so when the training set increases lead class table not fit into memory , the algorithm can not be carried , which limits the processing of data SLIQ the largest . [ 19 ] Thus , SPRINT defines two data structures , namely the attribute table and histogram . Property sheet by the property value, class attributes and sample No. 3 fields , with the expansion of its nodes and division, and is attached to the corresponding child nodes. When a node is split into two sub- node, attribute lists are also split into two lists , because the continuous attribute list is sorted in advance , so the list is still split after ordered . SPRINT does not have a class list , a list of all the properties can be swapped in or swapped out of memory , the size of the data set , you can almost say that there is no limit . Histogram subsidiaries on a node , an attribute used to describe the distribution of the class node. When describing the properties of the class of continuous distribution of nodes associated with the two histograms Cbelow and Cabove, describing the former category have been processed sample distribution , which describes the distribution of the untreated sample classes , both values are updated with the algorithm ; when describing the properties of the class of discrete distributions , the node associated with only one histogram count matrix. [ 12 ]#p#分页标题#e#
Contribution process SPRINT algorithm is : ( 1 ) to generate the root node , and for all the properties to create a property sheet , while the pre- sorting property sheet continuous attributes ; ( 2 ) If the node in the sample can be classified as a class , the algorithm stops , otherwise turn ( 3 ) ; ( 3 ) the use of the property sheet to find division has the smallest Gini values as the best partition plan ;
( 4 ) According to the classification scheme to generate two child nodes of the node N1 and N2; (5) dividing each property sheet on the node , so that related to the N1 or N2; (6) were transfected ( 2 ) investigate N1, N2 nodes. [ 12 ]
The above describes the common achievements of six typical method of decision tree algorithms , these algorithms on the attribute data types can be handled using the key technologies and the performance of the decision tree generated are different , the following data collection and through Research on various algorithms , with the list in the form of six algorithms for comparison.
From the above six kinds of decision tree algorithms compare achievements can be seen in the different achievements decision tree algorithm on the data processing required for classification and prediction , and the contribution of core technology and performance of the generated tree has its own characteristics and advantages : C4.5 algorithm can handle after continuous value attribute ; in dealing with large data , using information gain ratio , Gini coefficient and statistics algorithm has better performance than the ID3 ; [18] SLIQ and SPRINT algorithm after the introduction of the corresponding data structure , with good scalability and parallelism , the practical application more stable performance ; types generated from the tree view, easy to produce binary decision tree data fragmentation , greater accuracy . [ 12 ] Therefore, in practical applications , according to the specific circumstances , to choose the right decision tree algorithm , in order to better achieve the practical application of the classification and prediction purposes.
CCP algorithm is the first post- pruning algorithm, used by Breiman et al in 1984 in the famous CART system. " Consideration complexity" is defined as: the total number of misclassified samples factor in the process of pruning a tree leaf nodes Tt is called the consideration of alternatives increases , the sub-tree pruning leaves Tt reduce the number of nodes is called complexity.
Let showing the relationship between complexity and costs between the tree after pruning , is defined as: where | Nt | Tt the subtree leaf nodes ; R (t) represents the sub- tree pruning and substituting Tt is a leaf node t of the resulting error samples ; formula , r (t) t misclassification node samples ; p (t) for all the samples fall within the nodes t ; R (Tt) is not subtree pruning time Tt misclassification sample size calculation formula , i is the sub-tree Tt leaf node .#p#分页标题#e#
CCP pruning algorithm of pruning result is a mistake of the number of samples and the size of the tree with a compromise decision tree pruning its two-step process :
( A ) the value of each non- leaf node in the decision tree is completely calculated Tmax , the minimum cycle subtree having cut until only the root node ; obtained in this step, a series of tree pruning {T0, T1, T2 , ..., Tm}, where T0 is Tmax, Ti +1 is the result of pruning Ti obtained ;
( 2 ) with a new set of pruning steps ( 1 ) to assess and identify the Ti one of the best pruned tree as a result. [ 20 ]
PEP algorithm uses the training set to be pruned , so no separate pruning set. Since the decision tree and pruning are used to generate the same training set , so the resulting sub- sample error rate r (t) is biased , which Quinlan introduced a continuous correction formula based on the binomial distribution to training wrong sample rate of concentrate produced r (t) is corrected , you can get a more reasonable mistake corrected after sub samples.
Assuming for the training set of samples was wrong r (t), is calculated as: wrong after correction of continuous sample rate , calculated as follows: .
For simplicity , in the following calculation of the number of samples used in the wrong instead of misclassification sample rate to illustrate the problem . After continuous correction, prune nodes t misclassification resulting sample size was not the fault of pruning the number of samples becomes , s ∈ {T1 sub-tree all the leaf nodes } . Further introduction of sub- standard error obedience binomial tree T1 is calculated as :
.
PEP algorithm uses a top-down order tree traversal completely Tmax, for each internal node t and the size of each comparison , when the condition is satisfied , the pruning cut to the root of the subtree t as T1 and substituting is a leaf node , the node categories identified by the " majority principle " OK . [ 20 ]
For k class problem, define the sample to reach the desired node t and the probability of belonging to class i : . Where pai come by the training set are a priori probability class i , m represents a priori probability of expected probability Pi (t) the degree of influence reflects the extent pruning , for simplicity m were considered for all categories same .
When a new sample arrives at the node t to be classified , the expected error rate of the node t is expressed as : When m = k , and when .
In the MEP , the bottom-up calculation for each internal node t desired error rate , called static error STE (t),. Tree T1 expectations error is called dynamic error DYE (t), DYE (t) is wrong and its child nodes weighted . Assuming there are t t1, t2, ..., tn child nodes , n t is the number of nodes of the child , then , wherein , p1, p2, ..., pn , respectively, of the sample t falls child node t1, t2, ..., tn the ratio .#p#分页标题#e#
MEP algorithm is entirely bottom-up tree traversal sequence Tmax and calculate the static error STE T1 (t) and dynamic error DYE (t), when the sub-tree T1 static and dynamic errors error met, pruning and use a leaf node instead of the node class identified by the " majority principle " OK . [ 20 ]
REP pruning algorithm is proposed by the Quinlan simplest bottom-up after pruning algorithm, and successfully applied to C4.5 system. The algorithm requires concentration as part of the data to extract the pruning set accuracy for calculating the sub- tree from the training. Not to build a decision tree pruning set , so a smaller bias when assessing the error rate.
From the bottom up , for each sub- tree tree T S, making it the leaf node , the node identified categories from " most of the principles " OK , generate a new tree . If the pruning set, new trees can get a smaller or equal misclassification , and subtree S does not contain sub-tree of the same nature , then S is removed, replaced with a leaf node . Repeat this process until replaced by any one of the sub- tree leaf node without increasing its error in the classification of pruning set up .
This makes the training set of coincidences because regularity and join nodes are likely to be removed, because the same is unlikely to occur in coincidence pruning set. Repeatedly comparing the error rate , the maximum that may be invariably selected to improve the tree delete a node in the set of the accuracy pruning pruning until further pruning reduces the accuracy of the decision tree pruning set up . [ 21 ]
MDL algorithm is a minimum description length principle tree pruning strategy based on its tree pruning carried out according to the size of the decision tree coding cost , the goal is to make the most of the data is consistent with this tree training samples , the samples do not meet the data as an exception coding, to make the following two minimum : examples of coding bits required for the decision tree required minimum bits and encoding the exception min.
MDL pruning process involves four types of code: data encoding , model coding, coding and tree branches encoding. Of which there are three kinds of tree coding scheme , Code1: node has two sub- sub-tree or no tree , you need to 1bit;
Code2: nodes can have two sub- trees, no sub- tree , only the left subtree or right subtree only need 2bits; Code3: only the coding within the node , then the node may have two sub-tree , the left subtree or right subtree , need to bits.
MDL pruning algorithm to assess the length of the encoded within each node in the decision tree to determine whether the node is converted to a leaf node , or delete its left (right ) subtree , or remain the same node . To be selected , the code length can be calculated about four situations :
(1) (t is a leaf , the leaf nodes of the classification error , i.e., at the node t of t inconsistent class label number of samples . )#p#分页标题#e#
( 2 ) ( coding in consideration of any internal node of the test , the cost of the left subtree t1 reserved for the right subtree t2 reservation price . )
( 3 ) ( to be deleted right subtree t2 price. )
( 4 ) ( to delete the left subtree t1 price. )
According to the calculations, may have the following 3 pruning strategies:
( 1 ) complete pruning : If , then deleting the left and right nodes , making t become a leaf node , then coded using Code1.
( 2 ) of pruning : the results of calculation of the four kinds , the options with the shortest code length , when encoding using Code2.
( 3 ) mixed pruning : Pruning will be divided into two steps: First, using a fully pruning choose a smaller tree, then consider the front ( 2 ) , ( 3 ) , ( 4 ) further pruning . [ 22 ]
The main issues involved in the decision tree pruning the size and precision of the decision tree . The purpose is to prune the tree as much as possible without reducing the accuracy of the premise , to simplify the original tree , so that the smaller the size of the decision tree , this transformation over the classification rules are more easily understood. The above five kinds of major post- pruning method in detail
Estimates and computational complexity aspects have their own characteristics. Generally , REP pruning method is the easiest method , but it is not suitable for small sample data sets ; PEP method is considered to be the current tree pruning algorithm highest precision arithmetic ; CPP algorithm obtained REP smaller than the size of the tree ; tree algorithm optimization MEP scale larger than the PEP, but lower accuracy ; MDL low computational complexity of the algorithm does not require an independent pruning set , its philosophy is the simplest solution is the most desirable . [ 21 ]
Therefore , in practice , there are differences in the degree of association and the integrity of data collected for each sample data sets between different pruning algorithm for different data types sample data sets , or what kinds of specific choice of what kind of pruning The method , according to the specific circumstances.
Conclusion结论
This chapter focuses on the core concepts of decision tree technology achievements in the ID3 and C4.5 algorithms , contribution strategies and performance analysis , and the rest of typical algorithms through research , comparative analysis of the performance characteristics of various algorithms ; simultaneously studied CCP, REP etc. a typical tree pruning algorithm , and from the point of view of the performance of the four characteristics of various pruning strategies. By studying and comparing the performance of the decision tree core technology , a more comprehensive understanding of the building process and the application of the decision tree method , do a good theoretical basis for the optimization and realization of a decision tree classifier below .#p#分页标题#e#
After the decision tree pruning algorithm optimization algorithm is commonly used technique , however, there are some shortcomings of these post- pruning algorithm : ( 1 ) the decision tree pruning algorithm itself are pruning, trimming the result is either completely remove a trim meets the conditions of the decision nodes , either retains its original state, if the root of a subtree is deleted , and the subtree below it needs to be preserved, how to re- organize the tree is a problem [ 21 ] ; ( 2 ) after the pruning process often use temporary tree pruning training set itself for testing, to determine its accuracy on the training set to decide whether to prune , if the training data set itself is relatively sparse , this will cause excessive pruning , rules excessive generalization , but lower accuracy ; ( 3 ) due to the importance of the decision tree itself antecedent rule of non- leaf nodes corresponding leaf node from the root to turn lower , so the use of top-down rule pruning often pruned before the more important pieces of property , and the use of bottom-up pruning although guarantee each front piece is a rule to delete the less important attribute , but because there are limits to stop trimming conditions , often resulting in premature pruning process stop , trim excess foliage that has not been trimmed ; ( 4 ) determine the conditions after pruning algorithm controls the trim size, but after pruning algorithm does not effectively consider reducing the number of unnecessary rules and improve the credibility of the rules .
After a rule is a rule set pruning algorithm to simplify the rules , it uses rules to optimize the amount of information as a standard rule . Compared to the decision tree pruning itself , rule itself, the antecedent is conjunct direct rule set is optimized for even more flexible. But also has better overall performance optimization , while ensuring pruning scale while ensuring improved classification accuracy . After pruning rules basic steps can be summarized as follows:
( 1 ) create a rule tree from the root to the leaf node of each path , get the equivalent set of rules ;
( 2 ) evaluation of the rule set , for each rule , if it is omitted, the remaining rules on the training set classification error does not increase , then delete the rule ;
( 3 ) the estimated precision trimmed rules to sort them , and then applied in order to classify such later examples.
Aiming REP deficiencies, combined with the rules after pruning strategy , proposed a " rule-based information content and the rule antecedent simplification threshold pruning algorithm " , the algorithm on REP pruning pruning strategy as a guide, with the rules of the amount of information useful to filter out the rules in the rule set , with the credibility of the rules to filter out the high degree of confidence in the rule , the rule with the attribute importance as a former member of the order of deletion , with the former antecedent simplification threshold control parts simplification scale .#p#分页标题#e#