Posted by: Chris Brew | March 26, 2011

How to do comparisons between machine learning schemes

Nice paper comparing 16 model selection and weighting schemes. Includes 58 benchmark datasets. The data analysis was done in the following way – for each dataset, rank the schemes. Then average the ranks. – use the Friedman test to test whether ranks are all equal ( – if ranks are not all equal, use the Nemenyi test (covered in papers by Demsar, Garcia et al Ying Yang, Geoffrey I. Webb, Jesús Cerquides, Kevin B. Korb, Janice R. Boughton, Kai Ming Ting: To Select or To Weigh: A Comparative Study of Linear Combination Schemes for SuperParent-One-Dependence Estimators. IEEE Trans. Knowl. Data Eng. 19(12): 1652-1665 (2007), ISSN: 1041-4347



  1. So, when I read this post, the first thing I thought was: why do a comparison between machine learning systems? In fact, just the other day I was in a pre-comps committee meeting, and the grad student in question was proposing using five different classification algorithms and comparing their performance on the task of predicting…well, I don’t want to spill her dissertation topic, but it involves something medical, and textual data. I asked her why she was intent on doing the work of comparing five different algorithms–what hypothesis *about prediction of something medical from textual data* could she possibly test by running her data through five (mostly arbitrarily chosen) classification algorithms? I kinda stick by that, but I see the point of doing such a comparison better after reading the abstract–comparing running times without a formal analysis isn’t very scientific, but I guess I do have a post-doc working on a problem that’s entirely motivated by crappy processing times of someone else’s system. So, yeah–it’s not exactly hypothesis-driven science, but not much of what we do is…

    • The performance mentioned in the paper is not the speed of the algorithm, but the quality of the results. The point of the paper is to advocate for sensible ways of answering the question “is there any important difference in accuracy between these algorithms?”. If the answer to this question is “No”, then we can choose on the basis of speed, implementation difficulty, or any other reasonable criterion. The authors are selling the idea that when you compare a lot of different alternatives, you need to apply the standard stats methods that correct for multiple comparisons. If you don’t you will too often delude yourself into believing in differences between the algorithms, when there really are none.

      In an applied machine learning context, a scattershot comparison of five algorithms is definitely not something that would excite me. It could make sense if, for example, everyone believes that algorithms A-D (which, let’s say, require a supercomputer) is the only ones that will deliver acceptable results, but you think that
      algorithm E (which runs on a cheap laptop) will do the job faster, and at least as well.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s


%d bloggers like this: