No Free Lunch Theorem
The No Free Lunch theorem states that, averaged over all possible data-generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points.
It can also be stated simply as there is no one model that works best for every problem. The assumptions which works amazingly well for one problem may not work for other.
The computational cost of finding a solution, averaged over all problems in the class, is the same for any solution method. No solution, offers a short cut. There are no universal solutions to ML problems. If we do not put assumptions on the input data then all ML approaches are equally good.
Does this mean that a random search for solution is a good approach?
It is very important for us to understand the space in which our models will operate. If we choose the right norm space, constraints and the well-suited objective function then we can handle the rest via optimization.
The another aspect of this theorem is that there is no search/optimization algo that is expected to perform better than any other search/optimization algorithm(averaged over all problems). The performance is measured via the number of candidate solutions that needs scanning before finding a solution.
Does this mean Gradient Descent and Adam performs similar? What about superconvergence then?
Well, NFL to me give me a framework to choose between two learning algorithms by running cross-validation on a dataset. In other words, which algorithm will generalize better.
"The contribution of NFL is that it tells us choosing an appropriate algorithm requires making assumptions about the kinds of target functions the algorithm is being used for. With no assumptions, no "meta-algorithm", such as the scientific method, performs better than random choice." - wikipedia
To me NFL is interesting, not sure whether it is useful or not.
This means that the goal of machine learning research is not to seek a universal learning algorithm or the absolute best learning algorithm. Instead, our goal is to understand what kinds of distributions are relevant to the “real world” that an AI agent experiences, and what kinds of machine learning algorithms perform well on data drawn from the kinds of data-generating distributions we care about. - deeplearningbook.org
However, NFL directs me to think about Zermelo–Fraenkel set theory which set mathematics free from Russell's paradox.