A Comparison of Mining Incomplete and Inconsistent Data

Patrick G. Clark, Cheng Gao, Jerzy Grzymala-Busse

Abstract


We present experimental results on a comparison of incom-pleteness and inconsistency. We used two interpretations of missing at-tribute values: lost values and "do not care" conditions. Our experimentswere conducted on 204 data sets, including 71 data sets with lost val-ues, 71 data sets with "do not care" conditions and 62 inconsistent datasets, created from eight original numerical data sets. We used the Modified Learning from Examples Module version 2 (MLEM2) rule inductionalgorithm for data mining, combined with three types of probabilisticapproximations: lower, middle and upper. We used an error rate, com-puted by ten-fold cross validation, as the criterion of quality. There isexperimental evidence that incompleteness is worse than inconsistencyfor data mining (two-tailed test, 5% level of signicance). Additionally,lost values are better than "do not care" conditions, again, with regardsto the error rate, and there is a little dierence in an error rate betweenthree types of probabilistic approximations.

DOI: http://dx.doi.org/10.5755/j01.itc.46.2.17330


Keywords


Incomplete data, lost values, \do not care" conditions, in- consistent data, rough set theory, probabilistic approximations, MLEM2 rule induction algorithm.

Full Text: PDF

Print ISSN: 1392-124X 
Online ISSN: 2335-884X