A Comparison of Mining Incomplete and Inconsistent Data

  • Patrick G. Clark
  • Cheng Gao
  • Jerzy Grzymala-Busse University of Kansas
Keywords: Incomplete data, lost values, \do not care" conditions, in- consistent data, rough set theory, probabilistic approximations, MLEM2 rule induction algorithm.


We present experimental results on a comparison of incom-pleteness and inconsistency. We used two interpretations of missing at-tribute values: lost values and "do not care" conditions. Our experimentswere conducted on 204 data sets, including 71 data sets with lost val-ues, 71 data sets with "do not care" conditions and 62 inconsistent datasets, created from eight original numerical data sets. We used the Modified Learning from Examples Module version 2 (MLEM2) rule inductionalgorithm for data mining, combined with three types of probabilisticapproximations: lower, middle and upper. We used an error rate, com-puted by ten-fold cross validation, as the criterion of quality. There isexperimental evidence that incompleteness is worse than inconsistencyfor data mining (two-tailed test, 5% level of signicance). Additionally,lost values are better than "do not care" conditions, again, with regardsto the error rate, and there is a little dierence in an error rate betweenthree types of probabilistic approximations.

DOI: http://dx.doi.org/10.5755/j01.itc.46.2.17330