User Tools

Site Tools


algorithm

Approaches to Structured Data Mining

Statistical Relational Learning (SRL)

  • PRMs (probabilistic relational models, for associative mining)
    • Bayesian networks
  • FOPL (First Order Probabilistic Languages)
  • Mr-SBC (2003 Ceci)
  • Graph-NB (2005, Hongyan, “cutting off” weakly linked tables)
  • SRG (solves statistical skew in Graph-NB)
  • Tuffy
  • Felix

Inductive Logic Programming (ILP)

  • WARMR (Dehaspe, 1998), association mining, exploits monotonicity
  • FARMER, RADAR (extensions of WARMR)
  • FOIL (Quinlain, 1990), top-down approach (generic-to-specific)
  • nFOIL (FOIL + Naive Bayes, dynamic propositionalization), creates only attributes improving classification accuracy
  • QuickFOIL (Qiang, 2014)
  • LIME (McGreath and Sharma, 1998)
  • GOLEM (Muggleton, 1990), specific-to-generic search
  • PROGOL (Muggleton, 1995), inverse resolution
  • TILDE (Blockeel and De Raedt, 1997), downloaded, tree induction
  • ALEPH (Srinivasan, 2003), Prolog, downloaded
  • SCART, tree like?
  • HiFi
  • ICL
  • GBI, FSG (Association mining)
  • SAYU (Davis, 2005)
  • Rminer (De Bie, Association mining)
  • XMuSer (2011, sequential discovery - works with time)
  • Castor (2016, denormalizes the database on the fly to make tables wider)
  • First Order Logic, Descriptive Logic

Graph Mining

  • Gaston (C++, downloaded)
  • MolFea (Kramer, 2001, molecule mining)
  • Subdue (C, downloaded)
  • NetKit-SRL (I am not sure what is here the difference between graph and statistical mining)
  • AGN (association mining)
  • FSG (association mining)
  • GBI (association mining, heuristic techniques)

Relational decision trees

  • TILDE-RT (old)
  • MRDTL/MRDTL-2 (Multi-Relational Decision Tree Learning) algorithm implemented by Leiva (2003)
  • Mr-SMOTI (regression tree) by Appice (2003)
  • Relational decision tree (RDC)
  • Mr.G-Tree
  • FORF (2013, Schulte)

Regression

  • Structural Logistic Regression by Ungar (2003) - uses a hashing function to identify duplicate features

Relational distance methods

  • weighted-vote Relational Neighbor classifier (KNN)
  • TRANSC
  • RIBL (1996)
  • simple relational neighbor (RN) classifier (Provost, 2003)

Relational apriory

  • Mr-Radix (smart representation in Radix tree, avoids repeated joins with ItemMaps)

Kernel based

  • Horváth
  • SVILP
  • CHEM
  • MIK
  • PLS

Emerging Pattern based

  • JEP-Classifier
  • Mr-CAEP
  • DeEPs
  • ConsEPMiner
  • Mr.-EP
  • Mr-PEPC

Ensembles

  • MVC/MVC-IM (Hongyu, 2008)
  • Bagging Aleph (Jiang, 2006) (96% on Mutagenesis)
  • Boosted FFoil
  • Boosted WeakILP
  • MRC (Thakkar, 2012)

Propositionalization

  • MIDOS (predecessor of RELAGGS)
  • Linus (Lavrač, 1991) and its successors Dinus (Lavrač, 1994) and Sinus (Lavra4, 2001), but no aggregates
  • stochastic propositionalization (Kramer et al., 1998)
  • STILL, (Sebag, 1997, good accuracy)
  • REPART (Zucker, 1998)
  • Tolkien (Brockhausen, 1996, SQL)
  • RollUp/Polka/Safarii (Knobbe et al. 2001, repeated aggregation)
  • RELAGGS (Krogel and Wrobel 2001, propagates key)
  • Proper (Reutemann, 2004, extension of rellaggs)
  • DBMiner (1996, possibly the first classifier working with SQL and OLAP cubes, provides both, classification and regression)
  • CrossMine (Xiaoxin 2004, Tuple propagation, logic based)
  • ACORA (Perlich, 2005, aggregate categorical sets)
  • RSD (Zelezny, 2006)
  • PRORED (Gjorgjioski, ~2007, extension of RollUp, stochastic selection)
  • CLAMF (Frank, 2007, slope & correlation features)
  • DARA (Rayner, 2007, clustering of set of bag)
  • eDARA (2013, extension of DARA by ensembles)
  • MrCAR (Yingqin, 2009, Association Rules + Tuple propagation)
  • Wordification (Perovšek, 2012, TF-IDF → feature selection is unsupervised like in DARA)
  • Deep Feature Synthesis (Kanter, 2015)
  • Dataconda (Samorani, 2015)
  • Featuretools (2016)

Algorithm agnostic (aggregate & propagation methods)

  • SESP (Guo, 2007, table pruning method)
  • MODL (Boullé, 2014, stochastic aggregate selection)
  • FARS (Jun, 2010, table and column selection)
  • FBE (Struyf, 2006, approximate look-ahead)

Comparison by accuracy

  • MIDOS < Linus < Dinus, Sinus < RSD < RollUp (but faster than RELAGGS) < RELAGGS
  • FOIL < Tilde < Propal < RDBC < GDBI < RELAGGS < Crossmine < RIBL< MVC < MVC-IM < MRDTL < MrCAR
  • FOIL < Tilde < Graph-NB < Crossmine < Entropy-based
  • FOIL < Progol < ICL
  • Boosted WeakILP < CHEM < MIK < PLS < SVILP < STILL < Bagging Aleph

Comparison by functionality

Algorithm Time Sampling N targts Regr Heu MIL Scales Transduct BK N db GUI
RollUp No No No Yes No No ? No No No Yes
RELAGGS No No No Yes No Yes ? Yes Yes No Yes
Crossmine No Yes No ? No No No No ? No ?
MVC No No (manual) No Yes No ? Yes ? No No ?
IMP No No Yes Yes No ? Yes No ? No ?
CoTReC No No No Yes No ? ? Yes ? No ?
CoMoVi Yes No No Yes No ? ? Yes Yes No ?
RSCC No ? No ? Yes ? ? ? No No ?
CLAMF Yes ? ? ? ? ? ? ? No No ?
  • one-class classification (Khot)
  • multi-label (MLCC)
  • multi-instance
  • multi-task (multiple targets) (kFOIL)
  • hierarchical (kFOIL)
  • stream mining (Dzerovsky)
  • recommendation
  • unstructured data (like in DARA)
  • segmentation (like in DARA)
  • active/proactive learning (like FLIP)
  • transductive (like TRANS)
  • refinement (like )
  • unbalanced (like R-NB)
  • parallerized (like pCRIS)
  • collective classification (like kLog)
  • reinforcement (Elkan)
  • Inverse Reinforcement Learning (Inverse Reinforcement Learning in Rel)
  • ordinal class classifier
  • cost sensitive
  • updatable
  • learn to rank
  • PU learning (positive and mixed classes)
  • Preference learning (object ranking)
  • Sequence labeling (in part of speech taging)
  • one-shot learning
  • zero-shot learning

Comparison by accuracy on datasets

Algorithm Financial Run time Source
Predictor Factory 98% 21 s
CLAMF 100% (precision!) not published 11
RollUp 100% (retrospective!) not published 3
PRORED 100% (retrospective!) not published 9
RELAGGS 97% (retrospective!) 30 s 1, 2, 13, 14
RSD 95% 5 s 2
DARA 95% not published 8, 12
MRC 93% 6s 13
Crossmine 91% 15 s 4, 5, 13, 15
PIC 91% 28 s 10
TILDE 89% 650 s 4, 8, 13, 15
MVC 88% 6 s 6
TreeLiker-Poly 88% 41 s 7
TreeLiker-Relf 87% 62 s 7
Graph-NB 85% 2 s 4
RDC 83% not published 15
FOIL 80% 3479 s 4, 8, 15
Algorithm Mutagenesis Run time Source
Predictor Factory 90% 1 s
DARA 97% not published 8
Bagging Aleph 96% not published 11
RSD 94% 300 s 2
STILL 94% 28 s 10
MIK 93% not published 11
DRM 90% not published 9
RollUp 89% not published 3
RELAGGS 89% 30 s 1, 2
MrCAR 89% 3 s 11
Crossmine 89% 1 s 4, 5
Aleph 89% not published 11
MMV 89% not published 12
ICL 88% not published 11
MRDTL-2 88% not published 16
MBN 87% not published 17
SVILP 87% not published 9
MVC 87% 3 s 6
Graph-NB 86% 1 s 4
TILDE 85% 142 s 4, 8
RUMBLE 84% not published 12
Mr-SBC 82% not published 16
MRDTL 81% not published 16
kFOIL 81% not published 9
PROGOL 79% 40 000 s 9, 10, 11
nFOIL 75% not published 9
MBN 68% 6 s 15
FOIL 61% 1 s 4, 8, 11