Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Types of Knowledge to Be Mined
Classification/prediction
Background Knowledge
Concept hierarchies
- schema hierarchy
- Ex. street < city < province_or_state < country
- set-grouping hierarchy
- Ex. {20-39} = young, {40-59} = middle_aged
- operation-derived hierarchy
- e-mail address, login-name < department < university < country
- rule-based hierarchy
- low_profit (X) <= price(X, P1) and cost (X, P2)
and (P1 - P2) < $50
Pattern Interestingness Measurements
Certainty
Ex. confidence, P(A\B) = Card(A ? B)/ Card (B)
Utility
potential usefulness
Ex. Support, P(A?B) = Card(A ? B) / # tuples
Novelty
not previously known, surprising
Visualization of Discovered Patterns
Different background/purpose may require different form of representation
- Ex., rules, tables, crosstabs, pie/bar chart, etc.
Concept hierarchies is also important
- discovered knowledge might be more understandable when represented at high concept level.
- Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data.
Different knowledge requires different representation.
Data Mining Operations Outline
What is the motivation for ad-hoc mining process?
What defines a data mining task?
Can we define an ad-hoc mining language?
A Data Mining Query Language (DMQL)
Motivation
- A DMQL can provide the ability to support ad-hoc and interactive data mining.
- By providing a standardized language like SQL, we hope to achieve the same effect that SQL have on relational database.
Design
- DMQL is designed with the primitives described earlier.
Syntax for specification of
- task-relevant data
- the kind of knowledge to be mined
- concept hierarchy specification
- interestingness measure
- pattern presentation and visualization
Putting it all together -- a DMQL query
Syntax for Task-relevant Data Specification
use database database_name,
or use data warehouse data_warehouse_name
from relation(s)/cube(s)??[where condition]
in relevance to att_or_dim_list
Syntax for Specifying the Kind of Knowledge to be Mined
mine characteristics [as pattern_name] analyze measure(s)
Discrimination
mine comparison [as pattern_name] for target_class??where target_condition?? {versus contrast_class_i??where contrast_condition_i}?? analyze measure(s)
Syntax for Specifying the Kind of Knowledge to be Mined
mine associations [as pattern_name]
- Classification
mine classification [as pattern_name] analyze classifying_attribute_or_dimension
- Prediction
mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}}
Syntax for Concept Hierarchy Specification
To specify what concept hierarchies to use
use hierarchy <hierarchy>??for <attribute_or_dimension>
We use different syntax to define different type of hierarchies
- schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
- set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level2: {40, ..., 59} < level1: middle_aged
level2: {60, ..., 89} < level1: senior
- operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age)
- rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
if (price - cost) ? $50
level_1: medium-profit_margin < level_0: all
if ((price - cost) > $50) and ((price - cost) ? $250))
level_1: high_profit_margin < level_0: all
if (price - cost) > $250
Syntax for Interestingness Measure Specification
Interestingness measures and thresholds can be specified by the user with the statement:
with <interest_measure_name> ??threshold = threshold_value
Example:
with support threshold = 0.05
with confidence threshold = 0.7??
Syntax for Pattern Presentation and Visualization Specification
We have syntax which allows users to specify the display of discovered patterns in one or more forms.
To facilitate interactive viewing at different concept levels, the following syntax is defined:
Multilevel_Manipulation?? ::= ?? roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension
Putting It All Together: the Full Specification of a DMQL Query
use database OurVideoStore_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerRenting
analyze count%
in relevance to C.age, I.type, I.place_made
from customer C, item I, rentals R, items_rent S, works_at W, branch
where I.item_ID = S.item_ID and S.trans_ID = R.trans_ID
and R.cust_ID = C.cust_ID and R.method_paid = ``Visa''
and R.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = "Alberta" and I.price >= 100
with noise threshold = 0.05
display as table