Data Mining: Classification Classification and Prediction




Download 120.44 Kb.
TitleData Mining: Classification Classification and Prediction
Page1/2
Date22.06.2013
Size120.44 Kb.
TypePresentations
  1   2


Data Mining: Classification


Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Classification vs. Prediction

  • Classification:

    • predicts categorical class labels
    • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data
  • Prediction:

    • models continuous-valued functions, i.e., predicts unknown or missing values
  • Typical Applications

    • credit approval
    • target marketing
    • medical diagnosis
    • treatment effectiveness analysis


Classification—A Two-Step Process

  • Model construction: describing a set of predetermined classes

    • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute
    • The set of tuples used for model construction: training set
    • The model is represented as classification rules, decision trees, or mathematical formulae
  • Model usage: for classifying future or unknown objects

    • Estimate accuracy of the model
      • The known label of test sample is compared with the classified result from the model
      • Accuracy rate is the percentage of test set samples that are correctly classified by the model
      • Test set is independent of training set, otherwise over-fitting will occur


Classification Process (1): Model Construction



Classification Process (2): Use the Model in Prediction



Supervised vs. Unsupervised Learning

  • Supervised learning (classification)

    • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations
    • New data is classified based on the training set
  • Unsupervised learning (clustering)

    • The class labels of training data is unknown
    • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data


Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Issues (1): Data Preparation

  • Data cleaning

    • Preprocess data in order to reduce noise and handle missing values
  • Relevance analysis (feature selection)

    • Remove the irrelevant or redundant attributes
  • Data transformation

    • Generalize and/or normalize data


Issues (2): Evaluating Classification Methods

  • Predictive accuracy

  • Speed and scalability

    • time to construct the model
    • time to use the model
  • Robustness

    • handling noise and missing values
  • Scalability

    • efficiency in disk-resident databases
  • Interpretability:

    • understanding and insight provded by the model
  • Goodness of rules

    • decision tree size
    • compactness of classification rules


Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Classification by Decision Tree Induction

  • Decision tree

    • A flow-chart-like tree structure
    • Internal node denotes a test on an attribute
    • Branch represents an outcome of the test
    • Leaf nodes represent class labels or class distribution
  • Decision tree generation consists of two phases

    • Tree construction
      • At start, all the training examples are at the root
      • Partition examples recursively based on selected attributes
    • Tree pruning
      • Identify and remove branches that reflect noise or outliers
  • Use of decision tree: Classifying an unknown sample

    • Test the attribute values of the sample against the decision tree


Training Dataset



Output: A Decision Tree for “buys_computer”



Algorithm for Decision Tree Induction

  • Basic algorithm (a greedy algorithm)

    • Tree is constructed in a top-down recursive divide-and-conquer manner
    • At start, all the training examples are at the root
    • Attributes are categorical (if continuous-valued, they are discretized in advance)
    • Examples are partitioned recursively based on selected attributes
    • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
  • Conditions for stopping partitioning

    • All samples for a given node belong to the same class
    • There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
    • There are no samples left


Attribute Selection Measure

  • Information gain (ID3/C4.5)

    • All attributes are assumed to be categorical
    • Can be modified for continuous-valued attributes
  • Gini index (IBM IntelligentMiner)

    • All attributes are assumed continuous-valued
    • Assume there exist several possible split values for each attribute
    • May need other tools, such as clustering, to get the possible split values
    • Can be modified for categorical attributes


Information Gain (ID3/C4.5)

  • Select the attribute with the highest information gain

  • Assume there are two classes, P and N

    • Let the set of examples S contain p elements of class P and n elements of class N
    • The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as


Information Gain in Decision Tree Induction

  • Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv}

    • If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is
  • The encoding information that would be gained by branching on A



Attribute Selection by Information Gain Computation

  • Class P: buys_computer = “yes”

  • Class N: buys_computer = “no”

  • I(p, n) = I(9, 5) =0.940

  • Compute the entropy for age:



Gini Index (IBM IntelligentMiner)

  • If a data set T contains examples from n classes, gini index, gini(T) is defined as

  • where pj is the relative frequency of class j in T.

  • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as

  • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).



Extracting Classification Rules from Trees

  • Represent the knowledge in the form of IF-THEN rules

  • One rule is created for each path from the root to a leaf

  • Each attribute-value pair along a path forms a conjunction

  • The leaf node holds the class prediction

  • Rules are easier for humans to understand

  • Example

    • IF age = “<=30” AND student = “no” THEN buys_computer = “no
    • IF age = “<=30” AND student = “yes” THEN buys_computer = “yes
    • IF age = “31…40” THEN buys_computer = “yes
    • IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes
    • IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no


Avoid Overfitting in Classification

  • The generated tree may overfit the training data

    • Too many branches, some may reflect anomalies due to noise or outliers
    • Result is in poor accuracy for unseen samples
  • Two approaches to avoid overfitting

    • Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold
      • Difficult to choose an appropriate threshold
    • Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees
      • Use a set of data different from the training data to decide which is the “best pruned tree”


Approaches to Determine the Final Tree Size

  • Separate training (2/3) and testing (1/3) sets

  • Use cross validation, e.g., 10-fold cross validation

  • Use all the data for training

    • but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution
  • Use minimum description length (MDL) principle:

    • halting growth of the tree when the encoding is minimized


Enhancements to basic decision tree induction

  • Allow for continuous-valued attributes

    • Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals
  • Handle missing attribute values

    • Assign the most common value of the attribute
    • Assign probability to each of the possible values
  • Attribute construction

    • Create new attributes based on existing ones that are sparsely represented
    • This reduces fragmentation, repetition, and replication


Classification in Large Databases

  • Classification—a classical problem extensively studied by statisticians and machine learning researchers

  • Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed

  • Why decision tree induction in data mining?

    • relatively faster learning speed (than other classification methods)
    • convertible to simple and easy to understand classification rules
    • can use SQL queries for accessing databases
    • comparable classification accuracy with other methods


Scalable Decision Tree Induction Methods in Data Mining Studies

  • SLIQ (EDBT’96 — Mehta et al.)

    • builds an index for each attribute and only class list and the current attribute list reside in memory
  • SPRINT (VLDB’96 — J. Shafer et al.)

    • constructs an attribute list data structure
  • PUBLIC (VLDB’98 — Rastogi & Shim)

    • integrates tree splitting and tree pruning: stop growing the tree earlier
  • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti)

    • separates the scalability aspects from the criteria that determine the quality of the tree
    • builds an AVC-list (attribute, value, class label)


Data Cube-Based Decision-Tree Induction

  • Integration of generalization with decision-tree induction (Kamber et al’97).

  • Classification at primitive concept levels

    • E.g., precise temperature, humidity, outlook, etc.
    • Low-level concepts, scattered classes, bushy classification-trees
    • Semantic interpretation problems.
  • Cube-based multi-level classification

    • Relevance analysis at multi-levels.
    • Information-gain analysis with dimension + level.


Presentation of Classification Results



Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Bayesian Classification: Why?

  • Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems

  • Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

  • Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities

  • Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured



Bayesian Theorem

  • Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem

  • MAP (maximum posteriori) hypothesis

  • Practical difficulty: require initial knowledge of many probabilities, significant computational cost



Naïve Bayes Classifier (I)

  • A simplified assumption: attributes are conditionally independent:

  • Greatly reduces the computation cost, only count the class distribution.



Naive Bayesian Classifier (II)

  • Given a training set, we can compute the probabilities



Bayesian classification

  • The classification problem may be formalized using a-posteriori probabilities:

  • P(C|X) = prob. that the sample tuple X= is of class C.

  • E.g. P(class=N | outlook=sunny,windy=true,…)

  • Idea: assign to sample X the class label C such that P(C|X) is maximal



Estimating a-posteriori probabilities

  • Bayes theorem:

  • P(C|X) = P(X|C)·P(C) / P(X)

  • P(X) is constant for all classes

  • P(C) = relative freq of class C samples

  • C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum

  • Problem: computing P(X|C) is unfeasible!



Naïve Bayesian Classification

  • Naïve assumption: attribute independence

  • P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)

  • If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C

  • If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function

  • Computationally easy in both cases



Play-tennis example: estimating P(xi|C)



Play-tennis example: classifying X

  • An unseen sample X =

  • P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582

  • P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286

  • Sample X is classified in class n (don’t play)



The independence hypothesis…

  • … makes computation possible

  • … yields optimal classifiers when satisfied

  • … but is seldom satisfied in practice, as attributes (variables) are often correlated.

  • Attempts to overcome this limitation:

    • Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes
    • Decision trees, that reason on one attribute at the time, considering most important attributes first


Bayesian Belief Networks (I)



Bayesian Belief Networks (II)

  • Bayesian belief network allows a subset of the variables conditionally independent

  • A graphical model of causal relationships

  • Several cases of learning Bayesian belief networks

    • Given both network structure and all the variables: easy
    • Given network structure but only some variables
    • When the network structure is not known in advance


Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Neural Networks

  • Advantages

    • prediction accuracy is generally high
    • robust, works when training examples contain errors
    • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes
    • fast evaluation of the learned target function
  • Criticism

    • long training time
    • difficult to understand the learned function (weights)
    • not easy to incorporate domain knowledge


A Neuron

  • The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping



Network Training

  • The ultimate objective of training

    • obtain a set of weights that makes almost all the tuples in the training data classified correctly
  • Steps

    • Initialize weights with random values
    • Feed the input tuples into the network one by one
    • For each unit
      • Compute the net input to the unit as a linear combination of all the inputs to the unit
      • Compute the output value using the activation function
      • Compute the error
      • Update the weights and the bias


Multi-Layer Perceptron



Network Pruning and Rule Extraction

  • Network pruning

    • Fully connected network will be hard to articulate
    • N input nodes, h hidden nodes and m output nodes lead to h(m+N) weights
    • Pruning: Remove some of the links without affecting classification accuracy of the network
  • Extracting rules from a trained network

    • Discretize activation values; replace individual activation value by the cluster average maintaining the network accuracy
    • Enumerate the output from the discretized activation values to find rules between activation value and output
    • Find the relationship between the input and activation value
    • Combine the above two to have rules relating the output to input


Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Association-Based Classification

  • Several methods for association-based classification

    • ARCS: Quantitative association mining and clustering of association rules (Lent et al’97)
      • It beats C4.5 in (mainly) scalability and also accuracy
    • Associative classification: (Liu et al’98)
      • It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label
    • CAEP (Classification by aggregating emerging patterns) (Dong et al’99)
      • Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another
      • Mine Eps based on minimum support and growth rate


Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Other Classification Methods

  • k-nearest neighbor classifier

  • case-based reasoning

  • Genetic algorithm

  • Rough set approach

  • Fuzzy set approaches



Instance-Based Methods

  • Instance-based learning:

    • Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified
  • Typical approaches

    • k-nearest neighbor approach
      • Instances represented as points in a Euclidean space.
    • Locally weighted regression
      • Constructs local approximation
    • Case-based reasoning
      • Uses symbolic representations and knowledge-based inference


The k-Nearest Neighbor Algorithm

  • All instances correspond to points in the n-D space.

  • The nearest neighbor are defined in terms of Euclidean distance.

  • The target function could be discrete- or real- valued.

  • For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq.

  • Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.



Discussion on the k-NN Algorithm

  • The k-NN algorithm for continuous-valued target functions

    • Calculate the mean values of the k nearest neighbors
  • Distance-weighted nearest neighbor algorithm

    • Weight the contribution of each of the k neighbors according to their distance to the query point xq
      • giving greater weight to closer neighbors
    • Similarly, for real-valued target functions
  • Robust to noisy data by averaging k-nearest neighbors

  • Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes.

    • To overcome it, axes stretch or elimination of the least relevant attributes.


Case-Based Reasoning

  • Also uses: lazy evaluation + analyze similar instances

  • Difference: Instances are not “points in a Euclidean space”

  • Example: Water faucet problem in CADET (Sycara et al’92)

  • Methodology

    • Instances represented by rich symbolic descriptions (e.g., function graphs)
    • Multiple retrieved cases may be combined
    • Tight coupling between case retrieval, knowledge-based reasoning, and problem solving
  • Research issues

    • Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases


Remarks on Lazy vs. Eager Learning

  • Instance-based learning: lazy evaluation

  • Decision-tree and Bayesian classification: eager evaluation

  • Key differences

    • Lazy method may consider query instance xq when deciding how to generalize beyond the training data D
    • Eager method cannot since they have already chosen global approximation when seeing the query
  • Efficiency: Lazy - less time training but more time predicting

  • Accuracy

    • Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function
    • Eager: must commit to a single hypothesis that covers the entire instance space


Genetic Algorithms

  • GA: based on an analogy to biological evolution

  • Each rule is represented by a string of bits

  • An initial population is created consisting of randomly generated rules

    • e.g., IF A1 and Not A2 then C2 can be encoded as 100
  • Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings

  • The fitness of a rule is represented by its classification accuracy on a set of training examples

  • Offsprings are generated by crossover and mutation



Rough Set Approach

  • Rough sets are used to approximately or “roughly” define equivalent classes

  • A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not belonging to C)

  • Finding the minimal subsets (reducts) of attributes (for feature reduction) is NP-hard but a discernibility matrix is used to reduce the computation intensity



Fuzzy Sets

  • Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership (such as using fuzzy membership graph)

  • Attribute values are converted to fuzzy values

    • e.g., income is mapped into the discrete categories {low, medium, high} with fuzzy values calculated
  • For a given new sample, more than one fuzzy value may apply

  • Each applicable rule contributes a vote for membership in the categories

  • Typically, the truth values for each predicted category are summed



Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



What Is Prediction?

  • Prediction is similar to classification

    • First, construct a model
    • Second, use model to predict unknown value
      • Major method for prediction is regression
        • Linear and multiple regression
        • Non-linear regression
  • Prediction is different from classification

    • Classification refers to predict categorical class label
    • Prediction models continuous-valued functions


Predictive Modeling in Databases

  • Predictive modeling: Predict data values or construct generalized linear models based on the database data.

  • One can only predict value ranges or category distributions

  • Method outline:

    • Minimal generalization
    • Attribute relevance analysis
    • Generalized linear model construction
    • Prediction
  • Determine the major factors which influence the prediction

    • Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc.
  • Multi-level prediction: drill-down and roll-up analysis



Regress Analysis and Log-Linear Models in Prediction

  • Linear regression: Y =  +  X

    • Two parameters ,  and  specify the line and are to be estimated by using the data at hand.
    • using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….
  • Multiple regression: Y = b0 + b1 X1 + b2 X2.

    • Many nonlinear functions can be transformed into the above.
  • Log-linear models:

    • The multi-way table of joint probabilities is approximated by a product of lower-order tables.
    • Probability: p(a, b, c, d) = ab acad bcd


Prediction: Numerical Data



Prediction: Categorical Data



Classification and Prediction

  • What is classification? What is prediction?

  • Issues regarding classification and prediction

  • Classification by decision tree induction

  • Bayesian Classification

  • Classification by backpropagation

  • Classification based on concepts from association rule mining

  • Other Classification Methods

  • Prediction

  • Classification accuracy

  • Summary



Classification Accuracy: Estimating Error Rates

  1   2

Welcome to add document to your blog or website

Related:

Data Mining: Classification Classification and Prediction iconCoding and classification of causes of death in accordance with the...

Data Mining: Classification Classification and Prediction iconGeometric Classification Rotator Cuff Tears Goals of a Classification System

Data Mining: Classification Classification and Prediction iconClassification Classification assessment of inmate

Data Mining: Classification Classification and Prediction iconDimacs/portia workshop on Privacy Preserving Data Mining Data Mining & Information Privacy

Data Mining: Classification Classification and Prediction iconPrivacy by Design in data Mining Outline Data Mining and Data Privacy

Data Mining: Classification Classification and Prediction iconBiodiversity Classification

Data Mining: Classification Classification and Prediction iconClassification of fires

Data Mining: Classification Classification and Prediction iconCS5525 Data Analytics Crime Data Mining

Data Mining: Classification Classification and Prediction iconData Mining: Data Warehousing Dr. Hany Saleeb

Data Mining: Classification Classification and Prediction iconClassification of Myofiber types Classification of Myofiber types

Place this button on your site:
shrdocs.com


The database is protected by copyright © 2013
send message
shrdocs.com
Main page