Friday, April 23, 2010

Top 10 Algorithms in Data Mining

The authors here invited ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners to each nominate up to 10 best-known algorithms in data mining, including the algorithm name, justification for nomination, and a representative publication reference. The list was voted on by other IEEE and ACM award winners to narrow this down to a top 10 list. These algorithms are used for association analysis, classification, clustering, statistical learning, and much more.You can read the paper here.

Here are the winners:
  1. C4.5
  2. The k-Means algorithm
  3. Support Vector Machines
  4. The Apriori algorithm
  5. Expectation-Maximization
  6. PageRank
  7. AdaBoost
  8. k-Nearest Neighbor Classification
  9. Naive Bayes
  10. CART (Classification and Regression Trees)
The 2007 paper gives a brief overview of what the method is commonly used for and how it works, along with lots of references. It also has a much more detailed description of how these winners were selected than what I've said here.

The exciting thing is I've seen nearly all of these algorithms used for mining genetic data for complex patterns of genetic and environmental exposures that influence complex disease. See some recent papers at EvoBio and PSB. Further, lots of these methods are implemented in several R packages.

Top 10 Algorithms in Data Mining (PDF)


  1. This comment has been removed by the author.

  2. Interesting article, but are you sure that you've linked to the correct file? The PDF that you refer to was published in 2007...

  3. sir, please answer my question..

    why QUEST Algorithm called Quick?
    why QUEST Algorithm called Unbiased?
    why QUEST Algorithm called Efficient?


    best Regard.

  4. What about the recently famous symbolic regression? (google "Introducing Robo-Scientist" if you haven't heard)

  5. Number One is Logistic Regression! All scorecards are based on logistic regression. Furthermore, logistic regression is a simple version of Neural Network.

  6. While I wouldn't go so far as to say that Logistic Regression is "Number One" (meant tongue in cheek, no doubt), I was surprised to not see it on the list...

  7. This is top 10 by popularity, not by efficiency. Hierarchical Bayes and Markov Fields are superior (by far) to Naive Bayes. Also there's no mention about statistical algorithms such as EM, Logistic Regression etc.


Creative Commons License
Getting Genetics Done by Stephen Turner is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported License.