Search for notes by fellow students, in your own course and all over the country.

Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.

My Basket

You have nothing in your shopping cart yet.

Title: Data Mining - Knowledge Discovery
Description: Data Mining - Knowledge Discovery

Document Preview

Extracts from the notes are below, to see the PDF you'll receive please use the links above


Data Mining - Knowledge Discovery
What is Knowledge Discovery?
Some individuals do not differentiate between data mining and
knowledge discovery, however others consider data mining to
be an essential step in the knowledge discovery process
...

Data Integration − In this step, multiple data sources
are combined
...

Data Transformation − In this step, data is transformed
or consolidated into forms appropriate for mining by
performing summary or aggregation operations
...

Pattern Evaluation − In this step, data patterns are
evaluated
...


The following diagram shows the process of knowledge
discovery −

Data Mining - Systems
A wide range of data mining systems are on the market
...


Classification Based on the Databases Mined
A data mining system can be categorized based on the types of
databases it mines
...
Additionally, the
data mining system can be categorized appropriately
...

Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of
knowledge mined
...
We can describe these techniques according to
the degree of user interaction involved or the methods of
analysis employed
...
These applications are as follows −
Finance
• Telecommunications
• DNA
• Stock Markets
• E-mail
Integrating a Data Mining System with a DB/DW System


There will be no system to communicate with if a database or
data warehouse system is not connected with a data mining
system
...

The development of efficient and effective algorithms for mining
the given data sets is the primary emphasis of this strategy
...
It fetches the data from a particular source
and processes that data using some data mining
algorithms
...

Loose Coupling − In this scheme, the data mining
system may use some of the functions of database and
data warehouse system
...
It then stores the





mining result either in a file or in a designated place in
a database or in a data warehouse
...

Tight coupling − In this coupling scheme, the data
mining system is smoothly integrated into the
database or data warehouse system
...


Data Mining - Query Language
Han, Fu, Wang, and others suggested the Data Mining Query
Language (DMQL) for the DBMiner data mining system
...
Ad hoc and interactive data
mining can be supported via data mining query languages
...

Databases and data warehouses can both be used with the
DMQL
...
In
particular, we look at how to define data marts and warehouses
in DMQL
...

Characterization
The syntax for characterization is −
mine characteristics [as pattern_name]
analyze {measure(s) }
The analyze clause, specifies aggregate measures, such as count,
sum, or count%
...

mine characteristics as customerPurchasing
analyze count%
Discrimination
The syntax for Discrimination is −
mine comparison [as {pattern_name]}
For {target_class } where {t arget_condition }

{versus {contrast_class_i }
where {contrast_condition_i}}
analyze {measure(s) }
For example, a user may define big spenders as customers who
purchase items that cost $100 or more on an average; and
budget spenders as customers who purchase items at less than
$100 on an average
...
price) ≥$100
versus budgetSpenders where avg(I
...

Classification
The syntax for Classification is −
mine classification [as pattern_name]
analyze classifying_attribute_or_dimension

For example, to mine patterns, classifying customer credit rating
where the classes are determined by the attribute credit_rating,
and
mine
classification
is
determined
as
classifyCustomerCreditRating
...
, 39} < level1: young
level3: {40,
...
, 89} < level1: senior
-operation-derived hierarchies

define hierarchy age_hierarchy for age on customer as
{age_category(1),
...
05
with confidence threshold = 0
...

display as
For Example −
display as table

Full Specification of DMQL
As a market manager of a company, you would like to
characterize the buying habits of customers who can purchase
items priced at no less than $100; with respect to the customer's
age, type of item purchased, and the place where the item was
purchased
...
In particular, you are only interested
in purchases made in Canada, and paid with an American Express
credit card
...

use database AllElectronics_db
use hierarchy location_hierarchy for B
...
age,I
...
place_made
from customer C, item I, purchase P, items_sold S, branch B
where I
...
item_ID and P
...
cust_ID and
P
...
address = "Canada" and I
...

Improves interoperability among multiple data mining
systems and functions
...

Promotes the use of data mining systems in industry
and society
...
These two forms are as follows −



Classification
Prediction

Classification models predict categorical class labels; and
prediction models predict continuous valued functions
...

What is classification?
Following are the examples of cases where the data analysis task
is Classification −




A bank loan officer wants to analyze the data in order
to know which customer (loan applicant) are risky or
which are safe
...


In both of the above examples, a model or classifier is
constructed to predict the categorical labels
...

What is prediction?
Following are the examples of cases where the data analysis task
is Prediction −
Suppose the marketing manager needs to predict how much a
given customer will spend during a sale at his company
...
Therefore
the data analysis task is an example of numeric prediction
...

Note − Regression analysis is a statistical methodology that is
most often used for numeric prediction
...

The Data Classification process includes two steps −
Building the Classifier or Model
• Using Classifier for Classification
Building the Classifier or Model
• This step is the learning step or the learning phase
...

• The classifier is built from the training set made up of
database tuples and their associated class labels
...
These tuples can also be
referred to as sample, object or data points
...
Here the test
data is used to estimate the accuracy of classification rules
...


Classification and Prediction Issues
The major issue is preparing the data for Classification and
Prediction
...
The noise is
removed by applying smoothing techniques and the
problem of missing values is solved by replacing a
missing value with most commonly occurring value for
that attribute
...
Correlation analysis is used to
know whether any two given attributes are related
...

o Normalization − The data is transformed
using normalization
...
Normalization is used when in the
learning step, the neural networks or the
methods involving measurements are used
...
For this purpose we can use the
concept hierarchies
...


Comparison of Classification and Prediction Methods
Here is the criteria for comparing the methods of Classification
and Prediction −










Accuracy − Accuracy of classifier refers to the ability of
classifier
...

Speed − This refers to the computational cost in
generating and using the classifier or predictor
...

Scalability − Scalability refers to the ability to construct
the classifier or predictor efficiently; given large
amount of data
...


Data Mining - Decision Tree Induction
A decision tree is a structure that includes a root node, branches,
and leaf nodes
...
The topmost node in the tree is the
root node
...
Each internal node represents a test on an
attribute
...


The benefits of having a decision tree are as follows −
It does not require any domain knowledge
...

• The learning and classification steps of a decision tree
are simple and fast
...
Ross Quinlan in 1980 developed
a decision tree algorithm known as ID3 (Iterative Dichotomiser)
...
5, which was the successor of ID3
...
5 adopt a greedy approach
...

Generating a decision tree form training tuples of data partition
D
Algorithm : Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples
and their associated class labels
...

Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes
...

Output:
A Decision Tree
Method
create a node N;
if tuples in D are all of the same class, C then
return N as leaf node labeled with class C;
if attribute_list is empty then
return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)
to find the best splitting_criterion;
label node N with splitting_criterion;
if splitting_attribute is discrete-valued and
multiway splits allowed then // no restricted to binary trees
attribute_list = splitting attribute; // remove splitting attribute
for each outcome j of splitting criterion
// partition the tuples and grow subtrees for each partition
let Dj be the set of data tuples in D satisfying outcome j; // a
partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the
training data due to noise or outliers
...

Tree Pruning Approaches
There are two approaches to prune a tree −

Pre-pruning − The tree is pruned by halting its
construction early
...

Cost Complexity


The cost complexity is measured by the following two
parameters −



Number of leaves in the tree, and
Error rate of the tree
Title: Data Mining - Knowledge Discovery
Description: Data Mining - Knowledge Discovery