Discretization of continuous features

In statistics and machine learning, discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals. This can be useful when creating probability mass functions – formally, in density estimation. It is a form of discretization in general and also of binning, as in making a histogram. Whenever continuous data is discretized, there is always some amount of discretization error. The goal is to reduce the amount to a level considered negligible for the modeling purposes at hand.

Typically data is discretized into partitions of K equal lengths/width (equal intervals) or K% of the total data (equal frequencies).^[1]

Mechanisms for discretizing continuous data include Fayyad & Irani's MDL method,^[2] which uses mutual information to recursively define the best bins, CAIM, CACC, Ameva, and many others^[3]

Many machine learning algorithms are known to produce better models by discretizing continuous attributes.^[4]

Software

This is a partial list of software that implement MDL algorithm.

discretize4crf tool designed to work with popular CRF implementations (C++)
mdlp in the R package discretization
Discretize in the R package RWeka

References

^ Clarke, E. J.; Barton, B. A. (2000). "Entropy and MDL discretization of continuous variables for Bayesian belief networks" (PDF). International Journal of Intelligent Systems. 15: 61–92. doi:10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O. Retrieved 2008-07-10.
^ Fayyad, Usama M.; Irani, Keki B. (1993) "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning" (PDF). 29 July 2023. hdl:2014/35171., Proc. 13th Int. Joint Conf. on Artificial Intelligence (Q334 .I571 1993), pp. 1022-1027
^ Dougherty, J.; Kohavi, R.; Sahami, M. (1995). "Supervised and Unsupervised Discretization of Continuous Features". In A. Prieditis & S. J. Russell, eds. Work. Morgan Kaufmann, pp. 194-202
^ Kotsiantis, S.; Kanellopoulos, D (2006). "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering. 32 (1): 47–58. CiteSeerX 10.1.1.109.3084.

This statistics-related article is a stub. You can help Wikipedia by adding missing information.

[clarke-1] Clarke, E. J.; Barton, B. A. (2000). "Entropy and MDL discretization of continuous variables for Bayesian belief networks" (PDF). International Journal of Intelligent Systems. 15: 61–92. doi:10.1002/(SICI)1098-111X(200001)15:1<61::AID-INT4>3.0.CO;2-O. Retrieved 2008-07-10.

[2] Fayyad, Usama M.; Irani, Keki B. (1993) "Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning" (PDF). 29 July 2023. hdl:2014/35171., Proc. 13th Int. Joint Conf. on Artificial Intelligence (Q334 .I571 1993), pp. 1022-1027

[3] Dougherty, J.; Kohavi, R.; Sahami, M. (1995). "Supervised and Unsupervised Discretization of Continuous Features". In A. Prieditis & S. J. Russell, eds. Work. Morgan Kaufmann, pp. 194-202

[4] Kotsiantis, S.; Kanellopoulos, D (2006). "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering. 32 (1): 47–58. CiteSeerX 10.1.1.109.3084.

[1]

[2]

[3]

[4]

Info Pages

Talk Pages

Special Pages

Discretization of continuous features

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Discretization of continuous features

Software

See also

References

Add your contribution

Related Hubs

Contribute something

History

Discretization of continuous features

Recent from talks

Recent from talks

Contribute something

Contribute something

Media Pages

Timelines

Articles

Notes collections

Notes

Notes

Days in Chronicle

Discretization of continuous features

Software

See also

References

Add your contribution

Related Hubs

Contribute something