Task-Relevant Data

  • Database or data warehouse name

  • Database tables or data warehouse cubes

  • Condition for data selection

  • Relevant attributes or dimensions

  • Data grouping criteria

      Types of Knowledge to Be Mined

    • Characterization

    • Discrimination

    • Association

    • Classification/prediction

    • Clustering

    • Outlier analysis

    • and so on ???

    Background Knowledge

  • Concept hierarchies

    • schema hierarchy
      • Ex. street < city < province_or_state < country
    • set-grouping hierarchy
      • Ex. {20-39} = young, {40-59} = middle_aged
    • operation-derived hierarchy
      • e-mail address, login-name < department < university < country
    • rule-based hierarchy
      • low_profit (X) <= price(X, P1) and cost (X, P2)
        and (P1 - P2) < $50

    Pattern Interestingness Measurements

  • Simplicity

      Ex. rule length
  • Certainty

      Ex. confidence, P(A\B) = Card(A ? B)/ Card (B)
  • Utility

      potential usefulness

      Ex. Support, P(A?B) = Card(A ? B) / # tuples
  • Novelty

      not previously known, surprising

      Visualization of Discovered Patterns

    • Different background/purpose may require different form of representation

      • Ex., rules, tables, crosstabs, pie/bar chart, etc.
    • Concept hierarchies is also important

      • discovered knowledge might be more understandable when represented at high concept level.
      • Interactive drill up/down, pivoting, slicing and dicing provide different perspective to data.
    • Different knowledge requires different representation.

      Data Mining Operations Outline

    • What is the motivation for ad-hoc mining process?

    • What defines a data mining task?

    • Can we define an ad-hoc mining language?

      A Data Mining Query Language (DMQL)

    • Motivation

      • A DMQL can provide the ability to support ad-hoc and interactive data mining.
      • By providing a standardized language like SQL, we hope to achieve the same effect that SQL have on relational database.
    • Design

      • DMQL is designed with the primitives described earlier.

      Syntax for DMQL

    • Syntax for specification of

      • task-relevant data
      • the kind of knowledge to be mined
      • concept hierarchy specification
      • interestingness measure
      • pattern presentation and visualization
    • Putting it all together -- a DMQL query

    Syntax for Task-relevant Data Specification

  • use database database_name,

    or use data warehouse data_warehouse_name

  • from relation(s)/cube(s)??[where condition]

  • in relevance to att_or_dim_list

  • order by order_list

  • group by grouping_list

  • having condition

    Syntax for Specifying the Kind of Knowledge to be Mined

  • Characterization

    mine characteristics [as pattern_name] analyze measure(s)

  • Discrimination

      mine comparison [as pattern_name] for target_class??where target_condition?? {versus contrast_class_i??where contrast_condition_i}?? analyze measure(s)

    Syntax for Specifying the Kind of Knowledge to be Mined

  • Association

    mine associations [as pattern_name]

  • Classification
    mine classification [as pattern_name] analyze classifying_attribute_or_dimension
  • Prediction
      mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}}

    Syntax for Concept Hierarchy Specification

  • To specify what concept hierarchies to use

      use hierarchy <hierarchy>??for <attribute_or_dimension>
  • We use different syntax to define different type of hierarchies

    • schema hierarchies
        define hierarchy time_hierarchy on date as [date,month quarter,year]
    • set-grouping hierarchies
      define hierarchy age_hierarchy for age on customer as
      level1: {young, middle_aged, senior} < level0: all
      level2: {20, ..., 39} < level1: young
      level2: {40, ..., 59} < level1: middle_aged
      level2: {60, ..., 89} < level1: senior

  • operation-derived hierarchies
      define hierarchy age_hierarchy for age on customer as

      {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age)
  • rule-based hierarchies
      define hierarchy profit_margin_hierarchy on item as

      level_1: low_profit_margin < level_0: all
      if (price - cost) ? $50
      level_1: medium-profit_margin < level_0: all
      if ((price - cost) > $50) and ((price - cost) ? $250))
      level_1: high_profit_margin < level_0: all
      if (price - cost) > $250

    Syntax for Interestingness Measure Specification

  • Interestingness measures and thresholds can be specified by the user with the statement:

      with <interest_measure_name> ??threshold = threshold_value
  • Example:

      with support threshold = 0.05

      with confidence threshold = 0.7??

    Syntax for Pattern Presentation and Visualization Specification

  • We have syntax which allows users to specify the display of discovered patterns in one or more forms.

      display as <result_form>
  • To facilitate interactive viewing at different concept levels, the following syntax is defined:



      Multilevel_Manipulation?? ::= ?? roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension

    Putting It All Together: the Full Specification of a DMQL Query

    use database OurVideoStore_db
    use hierarchy location_hierarchy for B.address
    mine characteristics as customerRenting
    analyze count%
    in relevance to C.age, I.type, I.place_made
    from customer C, item I, rentals R, items_rent S, works_at W, branch
    where I.item_ID = S.item_ID and S.trans_ID = R.trans_ID
      and R.cust_ID = C.cust_ID and R.method_paid = ``Visa''

      and R.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = "Alberta" and I.price >= 100
    with noise threshold = 0.05
    display as table