Search for notes by fellow students, in your own course and all over the country.
Browse our notes for titles which look like what you need, you can preview any of the notes via a sample of the contents. After you're happy these are the notes you're after simply pop them into your shopping cart.
Title: Data Mining - Bayesian Classification
Description: Data Mining - Bayesian Classification
Description: Data Mining - Bayesian Classification
Document Preview
Extracts from the notes are below, to see the PDF you'll receive please use the links above
Data Mining - Bayesian Classification
The Bayes Theorem is the cornerstone of Bayesian
categorization
...
The likelihood that a given tuple belongs to a certain
class is one example of a class membership probability that
Bayesian classifiers can forecast
...
There are two
types of probabilities −
•
•
Posterior Probability [P(H/X)]
Prior Probability [P(H)]
where X is data tuple and H is some hypothesis
...
They are also known as Belief Networks, Bayesian
Networks, or Probabilistic Networks
...
It provides a graphical model of causal relationship on
which learning can be performed
...
There are two components that define a Bayesian Belief
Network −
Directed acyclic graph
• A set of conditional probability tables
Directed Acyclic Graph
• Each node in a directed acyclic graph represents a
random variable
...
• These variables may correspond to the actual
attribute given in the data
...
The diagram's arc enables causal knowledge to be represented
...
Given that the patient has lung cancer and we are aware
of this, it is important to note that the variable PositiveXray is
independent of the patient's smoking status or family history of
lung cancer
...
We can express a rule in the following from −
IF condition THEN conclusion
Let us consider a rule R1,
R1: IF age = youth AND student = yes
THEN buy_computer = yes
Points to remember −
•
•
•
The IF part of the rule is called rule
antecedent or precondition
...
The antecedent part the condition consist of one or
more attribute tests and these tests are logically
ANDed
...
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent
is satisfied
...
Points to remember −
To extract a rule from a decision tree −
One rule is created for each path from the root to the
leaf node
...
• The leaf node holds the class prediction, forming the
rule consequent
...
We don't need to initially
create a decision tree
...
Sequential Covering Algorithms include AQ, CN2, and RIPPER,
among others
...
A tuple covered by a rule is eliminated after each
time it is learned, and the procedure is repeated for the
remaining tuples
...
Note − The Decision tree induction can be considered as learning
a set of rules simultaneously
...
When learning a rule from a
class Ci, we want the rule to cover all the tuples from class C only
and no tuple form any other class
...
Output: A Set of IF-THEN rules
...
The rule may perform well on training
data but less well on subsequent data
...
The rule is pruned by removing conjunct
...
FOIL is one of the simple and effective method for rule pruning
...
Note − This value will increase with the accuracy of R on the
pruning set
...
Miscellaneous Classification Methods
Here we will discuss other classification methods such as Genetic
Algorithms, Rough Set Approach, and Fuzzy Set Approach
...
In a genetic algorithm, the starting population is initially
produced
...
Each rule can be represented by a string of
bits
...
Additionally,
this particular training set includes the classes C1 and C2
...
In this bit representation, the two leftmost bits
represent the attribute A1 and A2, respectively
...
Note − If the attribute has K values where K>2, then we can use
the K bits to encode the attribute values
...
Points to remember −
•
•
•
•
•
Based on the notion of the survival of the fittest, a new
population is formed that consists of the fittest rules in
the current population and offspring values of these
rules as well
...
The genetic operators such as crossover and mutation
are applied to create offspring
...
In mutation, randomly selected bits in a rule's string
are inverted
...
Note − This approach can only be applied on discrete-valued
attributes
...
The Rough Set Theory is based on the establishment of
equivalence classes within the given training data
...
It means the
samples are identical with respect to the attributes describing
the data
...
We can
use the rough sets to roughly define such classes
...
Upper Approximation of C − The upper approximation
of C consists of all the tuples, that based on the
knowledge of attributes, cannot be described as not
belonging to C
...
This theory was
proposed by Lotfi Zadeh in 1965 as an alternative the two-value
logic and probability theory
...
It also provides us the means for
dealing with imprecise measurement of data
...
For example, being a member of a set of high incomes is in
exact (e
...
if $50,000 is high then what about $49,000 and
$48,000)
...
For example, the income value $49,000 belongs to both the
medium and high fuzzy sets but to differing degrees
...
15 and mhigh_income($49k)=0
...
This
notation can be shown diagrammatically as follows −
Data Mining - Cluster Analysis
A cluster is a collection of items from the same class
...
What is Clustering?
Clustering is the process of making a group of abstract objects
into classes of similar objects
...
While doing cluster analysis, we first partition the set
of data into groups based on data similarity and then
assign the labels to the groups
...
Applications of Cluster Analysis
• Clustering analysis is broadly used in many applications
such as market research, pattern recognition, data
analysis, and image processing
...
And they can
characterize their customer groups based on the
purchasing patterns
...
• Clustering also helps in identification of areas of similar
land use in an earth observation database
...
• Clustering also helps in classifying documents on the
web for information discovery
...
• As a data mining function, cluster analysis serves as a
tool to gain insight into the distribution of data to
observe characteristics of each cluster
...
• Ability to deal with different kinds of attributes −
Algorithms should be capable to be applied on any kind
of data such as interval-based (numerical) data,
categorical, and binary data
...
They should not be
bounded to only distance measures that tend to find
spherical cluster of small sizes
...
• Ability to deal with noisy data − Databases contain
noisy, missing or erroneous data
...
• Interpretability − The clustering results should be
interpretable, comprehensible, and usable
...
Each
partition will represent a cluster and k ≤ n
...
Each object must belong to exactly one group
...
• Then it uses the iterative relocation technique to
improve the partitioning by moving objects from one
group to other
...
We can classify hierarchical methods on the
basis of how the hierarchical decomposition is formed
...
In this,
we start with each object forming a separate group
...
It
keep on doing so until all of the groups are merged into one or
until the termination condition holds
...
In this,
we start with all of the objects in the same cluster
...
It
is down until each object in one cluster or the termination
condition holds
...
e
...
Approaches to Improve Quality of Hierarchical Clustering
Here are the two approaches that are used to improve the
quality of hierarchical clustering −
Perform careful analysis of object linkages at each
hierarchical partitioning
...
Density-based Method
•
This method is based on the notion of density
...
e
...
Grid-based Method
In this, the objects together form a grid
...
Advantages
The major advantage of this method is fast processing
time
...
Model-based methods
•
In this method, a model is hypothesized for each cluster to find
the best fit of data for a given model
...
It reflects spatial
distribution of the data points
...
It therefore yields robust clustering methods
...
A constraint refers to
the user expectation or the properties of desired clustering
results
...
Constraints can be
specified by the user or the application requirement
...
They
collect these information from several sources such as news
articles, books, digital libraries, e-mail messages, web pages, etc
...
In many of the text databases, the data is
semi-structured
...
But along with the
structure data, the document also contains unstructured text
components, such as abstract and contents
...
Users require tools to compare the documents
and rank their importance and relevance
...
Information Retrieval
Information retrieval deals with the retrieval of information from
a large number of text-based documents
...
Examples of
information retrieval system include −
•
•
•
Online Library catalogue system
Online Document Management Systems
Web Search Systems etc
...
This kind of user's query consists of some keywords
describing an information need
...
This is appropriate
when the user has ad-hoc information need, i
...
, a short-term
need
...
This kind of access to information is called Information Filtering
...
Basic Measures for Text Retrieval
We need to check the accuracy of a system when it retrieves a
number of documents on the basis of user's input
...
The set of documents
that are relevant and retrieved can be denoted as {Relevant} ∩
{Retrieved}
...
Precision can be defined as −
Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|
Recall
Recall is the percentage of documents that are relevant to the
query and were in fact retrieved
...
The information
retrieval system often needs to trade-off for precision or vice
versa
...
Challenges in Web Mining
The web poses great challenges for resource and knowledge
discovery based on the following observations −
•
•
•
•
•
The web is too huge − The size of the web is very huge
and rapidly increasing
...
Complexity of Web pages − The web pages do not have
unifying structure
...
There are huge amount
of documents in digital library of web
...
Web is dynamic information source − The information
on the web is rapidly updated
...
, are
regularly updated
...
These users have
different backgrounds, interests, and usage purposes
...
Relevancy of Information − It is considered that a
particular person is generally interested in only small
portion of the web, while the rest of the portion of the
web contains the information that is not relevant to
the user and may swamp desired results
...
The DOM structure refers to a tree like
structure where the HTML tag in the page corresponds to a node
in the DOM tree
...
The HTML syntax is flexible therefore,
the web pages does not follow the W3C specifications
...
The DOM structure was initially introduced for presentation in
the browser and not for description of semantic structure of the
web page
...
Vision-based page segmentation (VIPS)
• The purpose of VIPS is to extract the semantic structure
of a web page based on its visual presentation
...
In this tree each node corresponds to a
block
...
This value is called the
Degree of Coherence
...
• The VIPS algorithm first extracts all the suitable blocks
from the HTML DOM tree
...
• The separators refer to the horizontal or vertical lines
in a web page that visually cross with no blocks
...
The following figure shows the procedure of VIPS algorithm −
Data Mining - Applications & Trends
Data mining is widely used in diverse areas
...
In this tutorial, we will discuss
the applications and the trend of data mining
...
Some of the typical cases are as follows
−
Design and construction of data warehouses for
multidimensional data analysis and data mining
...
• Classification and clustering of customers for targeted
marketing
...
Retail Industry
•
Data Mining has its great application in Retail Industry because
it collects large amount of data from on sales, customer
purchasing history, goods transportation, consumption and
services
...
Data mining in retail industry helps in identifying customer
buying patterns and trends that lead to improved quality of
customer service and good customer retention and satisfaction
...
• Multidimensional
analysis of sales, customers,
products, time and region
...
• Customer Retention
...
Telecommunication Industry
•
Today the telecommunication industry is one of the most
emerging industries providing various services such as fax, pager,
cellular phone, internet messenger, images, e-mail, web data
transmission, etc
...
This is the reason why data mining is become
very important to help and understand the business
...
Here is the list of examples for which data mining improves
telecommunication services −
•
•
•
•
•
Multidimensional Analysis of Telecommunication data
...
Identification of unusual patterns
...
Mobile Telecommunication services
...
Biological Data Analysis
•
In recent times, we have seen a tremendous growth in the field
of biology such as genomics, proteomics, functional Genomics
and biomedical research
...
Following are the aspects in
which data mining contributes for biological data analysis −
Semantic integration of heterogeneous, distributed
genomic and proteomic databases
...
• Discovery of structural patterns and analysis of genetic
networks and protein pathways
...
• Visualization tools in genetic data analysis
...
Huge amount of data have been collected from
scientific domains such as geosciences, astronomy, etc
...
Following are the applications of data mining in the field of
Scientific Applications −
•
•
Data Warehouses and data preprocessing
...
Visualization and domain specific knowledge
...
In this
world of connectivity, security has become the major issue
...
Here is the list of areas in which data mining
technology may be applied for intrusion detection −
Development of data mining algorithm for intrusion
detection
...
• Analysis of Stream data
...
• Visualization and query tools
...
The new data mining systems
and applications are being added to the previous systems
...
Choosing a Data Mining System
The selection of a data mining system depends on the following
features −
•
Data Types − The data mining system may work with
relational data, record-based data, and structured text
...
Consequently, we need determine the precise format
that the data mining machine can handle
...
One data mining system may run on only one operating
system or on several
...
Data Sources − Data sources refer to the data formats
in which data mining system will operate
...
Data mining
system should also support ODBC connections or OLE
DB for ODBC connections
...
Coupling data mining with databases or data
warehouse systems − Data mining systems need to be
coupled with a database or a data warehouse system
...
Here are the
types of coupling listed below −
o No coupling
o Loose Coupling
o Semi tight Coupling
o Tight Coupling
Scalability − There are two scalability issues in data
mining −
o Row (Database size) Scalability − A data
mining system is considered as row scalable
when the number or rows are enlarged 10
times
...
o Column
(Dimension) Salability − A data
mining system is considered as column
scalable if the mining query execution time
increases linearly with the number of
columns
...
Unlike relational database systems, data
mining systems do not share underlying data mining
query language
...
Scalable and interactive data mining methods
...
SStandardization of data mining query language
...
New methods for mining complex types of data
...
Data mining and software engineering
...
Distributed data mining
...
Multi database data mining
...
Title: Data Mining - Bayesian Classification
Description: Data Mining - Bayesian Classification
Description: Data Mining - Bayesian Classification