Building a data-driven product differs in many ways from how one would create a more conventional software product. A Machine Learning system is still a software system, but the process to develop the system is different. These differences are very important to understand for all the stakeholders and it is key to have a common view on this for a project to be successful. With this post I will briefly try to explain the Machine Learning process and why it calls for a different approach and mindset.
In order to adopt this mindset, the most important thing to understand is that developing a Machine Learning system is much like a scientific process instead of a traditional software development process. However, the whole solution still requires a lot of software engineering practices. Let’s see how the processes differ.
Machine Learning vs traditional programming
In traditional programming, you write down all the rules that the program needs to have for it to perform and accomplish a specific task and produce the desired result. The program takes some data as input and this data is then processed as stated by the rules, and will hopefully, in the end, return the correct result. On the contrary, a machine learning system is instead trained rather than being programmed explicitly. The input to such a system is not just the data but also the expected result for that data and the output will then be a set of rules (this is also called a model in Machine Learning vocabulary).
An iterative process
The main difference from implementing a software feature is that all the abilities of the Machine Learning system need to be learned along the way, you cannot design it upfront. It is this characteristic that forces the need for another style of working. It is a very empirical and experiment oriented procedure in that we need to try and test things out in an iterative fashion before we know if the result is good enough.
There are many potential obstacles along the way. For example, there is always the risk that the data you have does not contain any predictive power, i.e. there is no signal that can be used to train the system. Then we need to go back to see if it is possible to find other data sources that could be used. But even if the data shows some promising results, we still need to iterate through different algorithms, different versions of the data, different settings of the model parameters, and so on in order to find the best possible model. All this is done by trial and error; you have a hypothesis and you try it out, you learn something new from the results and this generates a new set of ideas to try out in the next iteration.
One could argue that this process is basically an application of the scientific method. Therefore, it is often impossible to be able to say upfront if the trained Machine Learning system will be able to produce the desired outcome. This also makes it important to decide early how progress should be measured.
The CRISP-DM is one process model that describes the iterative process of building Machine Learning models. Kenneth Jensen / CC BY-SA (https://creativecommons.org/licenses/by-sa/3.0)
There are several aspects that bring uncertainty into a Machine Learning project. First, you have the data itself that can vary in quality and be incomplete with missing data points for example. How the data was gathered and if the sample is a good representation of the problem domain also matters. Then, of course, the models themself are imperfect because they were built from such data.
Regarding the uncertainty in any time estimate, you will most certainly find things along the way that you did not expect and therefore you need to reconsider previously made choices. You need to experiment and try things out in order to see if you are on the right track. Also, depending on the size of the data and the complexity of the model, the training phase can take from hours to days until you can test the new version. Then depending on the result, it could be back to the drawing board. This is the reason why it can be very difficult to give a proper time estimate. There are so many unknowns involved, and new clues unfold for every iteration.
There should, of course, be a time-bound set for the project, and this is important, you still cannot guarantee that the project will be successful within this time frame. However, it will rarely be a waste of resources no matter the result, because you will learn a lot along the way, and this will be valuable for taking the next step in the right direction.
Machine Learning is not like any other technology, but it is in many cases the only technology that can solve certain problems. We need to ensure that all people involved in the project have a common understanding of what is required, how the process works, and that we have a realistic view of what is possible with the tools at hand. To boil down all this to its core components we could consider a few important rules:
- create a common ground of understanding, this will ensure the right mindset
- state early how progress should be measured
- communicate clearly how different machine learning concepts works
- acknowledge and consider the inherited uncertainty, it is part of the process
Hope this will help to bring some clarification on how these kinds of projects go about and what to consider before starting out. Happy Machine Learning!
References and Resources