Machine learning is an essential part of data science – a field which covers a range of activities from data acquisition and cleaning, through to analytics and data visualisation. It can be helpful to think in terms of a machine learning workflow that puts some structure around some of these processes. This post looks at a few existing data science workflows and suggests some other approaches you might want to consider.
There are plenty of data science and machine learning workflows already out there to draw on. A common starting point is the CRISP model which originates in the field of data mining. It can be broken down in to six stages which broadly cover areas of exploration, implementation and deployment:
- Business understanding
- Data understanding
- Data preparation
- Modelling
- Evaluation
- Deployment
CRISP itself is relatively well established, and is probably already general enough to fit many situations. That said, there are some workflows which attempt to make CRISP more relevant to data science and machine learning. Some of these are quite specialist, such as those focussing on data science for Kaggle competitions, or using data to change public services which you may also want to look at.
A Machine Learning Workflow
The following is my attempt to make sense of different possible approaches to data science and machine learning.
1. Start with a Purpose
It’s important to try to link you analysis to something in the ‘real world’ as far as possilbe. Putting this concept first helps avoid going on a ‘fishing trip‘ into huge data sets without really knowing why your are doing it and what you hope to achieve.
Starting with a purpose could involve trying to influence a decision, add insight or knowledge to an important area of academic research, or solve a particular problem. Not understanding the problem or issue can result in wasted time with few good or impactful results.
If you are trying to solve problems or help in a business or organisation a good place to start is Crunchy Questions.
If you are undertaking a data science project as a learning or intellectual exercise it is still worth trying to think about the impact and exactly what your are trying to answer. There are many resources available on how to select a good research question such as these from the University of Birmingam or MIT.
2. Get the Data
Once you’ve got a good sense of the issue you are going to work on, you need to get your hands on the right data. What exactly this stage means will depend on your project. Some issues that you may encounter include:
- Negotiating access/usage rights
- Finding public data sets
- Overcome challenges with open data sets
- Purchasing third party data sets
You may find there is an iterative process between stages one and two as the problem definition becomes clearer, and your / the client’s knowledge of the data is also improved.
3. Review the Literature
To avoid re-inventing the wheel you should undertake a literature review. You should already have some sense of the problem area from the time you spent identifying your issue. This stage is an opportunity to develop that knowledge further, and explore what progress has already been made in the area, and whether there are lessons you can learn before you start.
It may seem a rather dry and un-appealing task, but time spent reviewing the current state of the art will pay dividends in avoiding wasted effort later. Some data scientists recommend spending as much as 10 – 20% of the projects time on the literature review.
Your review of the literature could include academic writings, blogs and solutions from Kaggle competitions or any other sources of information. You should try to familiarise yourself with domain-specific information relating to the business area, as well problem or question that you are trying to solve or answer. You might find that someone has already published a neat approach to a similar problem, or even that your issue is already solved!
Your literature review doesn’t need to be a formal report as such, although some of the principles of formal scientific literature reviews may be helpful. Plenty of researchers and data scientists have generously published their literature reviews online, so take inspiration from these too. Some examples include research papers [pdf], Masters theses [pdf], and general project reports [pdf].
Exactly what seems appropriate to you may depend on your project and levels of experience. Some key things you should try to include are:
- Digging deeper into the problem area
- Exploring previous solutions and the current ‘state of the art’
- Identifying possible approaches and pitfalls
4. Create an Evaluation Framework
An ‘evaluation framework‘ will give you some assurance that you have done the right analysis on the right problem and have achieved the right level of performance in your models.
There are a few ways you can approach this. I quite like the description of evaluation in the Becoming a Data Scientist blog which reminds us that getting data science right requires us to understand the data, generate insight, as well as be technically well implemented.
You should be happy: with the answers to questions about the quality, history and suitability of the data; that you will be able to add (testable) insight with your analysis; that your analysis has been correctly implemented (the right analysis for the right problem).
Understanding how you will measure the success of your modelling can be an important part of establishing an evaluation framework. This can start with defining clear acceptance criteria, and having a good understanding of your issue. Measuring the success and performance of your models is also likely to come down to metrics for assessing them. It’s metrics that give you any indication of whether your model is a suitable approximation of the real world.
There are plenty of lists [1, 2, 3] of common model evaluation techiques, and SciKit learn also has a fairly comprehensive list of metrics you may wish to use.
Cross validation is also an important step to consider, which helps give some sense about how well your model would perform against an unseen data set. The way this is done in practice is to separate out the initial data set into a ‘training set‘ on which the model learns, and a ‘test set‘ which is used to validate the model.
If you working on a competition entry or similar, it may be that your metrics for success are already established. In ‘real world’ scenarios you may need to work with your client, or explore yourself to figure out sensible ways to measure the performance of your models.
5. Explore the Data
Once you have a sense of how you will assess the success of your analytics you can start to probe and explore your data. Exploratory data analysis is the stage where you can really get to grips with your data and start to answer initial questions.
Common goals in EDA include:
- Identify important variables
- Detect outliers and anomalies
- Begin to develop models
- Explore the data structure
- Assess statistical assumptions
- Select appropriate techniques
The techniques used in EDA often involve data visualisation to help identify outliers, trends or patterns that could be of further interest and use later in the workflow. EDA also includes the commonly used five-number summary of data that can be used to quickly characterise data. Other commonly used techniques include box-plots, histograms, as well as principal component analysis.
EDA, however is more about an approach or way of thinking than any particular tool or technique – delving deeper into the data can reveal phenomena not covered by simply fitting a regression. For example simple regression investigating tipping behaviour can hide a range of other activity such as preference for tipping whole-number amounts, gender differences and variance in data. Each one of these avenues could have relevant to the question being studied.
See some more case study examples of EDA
6. Pre-process the Data
Closely related to exploratory data analysis, is the pre-processing of your data itself. This stage is about getting your data ready to solve the problem or issue you identified earlier and helps to avoid the problem of ‘Garbage in: garbage out‘ when you run your models.
Your EDA may have highlighted issues such has having multiple, separate data sources, incomplete or noisy data. Your data may also not be in the right format to feed into your preferred algorithms. This is another rather unglamourous stage in the workflow, that it will pay dividends to spend some time over. Getting the data pre-processing wrong could at best mean slow running, inefficient analysis or at worst mean spurious results from incorrectly used algorithms with insufficient data.
In theory this can be a simple process of working through standard procedures. You may find things more difficult if you have many datasets, large amounts of data or many fields to deal with. In these situations a more formal ‘ETL‘ (Extract-Transform-Load) process may be helpful.
Typical tasks during the data pre-processing stage might include:
- Data cleaning – e.g. imputation, smoothing, outlier removal. This is such a common and important task that there are many guides out there to help you which ever tools you use, including SAS [pdf], Excel, Python, and R.
- Data integration – This will be a particular issue if you have data from multiple sources: for example customer data, transaction data, and product data. The goal at this stage is to be able to query the ‘data’ from one place irrespective of the source. For larger projects there are tools available to help with data integration.
- Data transformation – Transformation in this sense refers to operations which will help your analysis run more efficiently, such as normalisation of values, scaling, and aggregation.
- Data reduction – This may be more of an issue the more ‘big’ your data is. Having too many attributes can be a challenge, and techniques such as principal component analysis, filtering or clustering can help.
- Data discretisation to convert continuous variables to discrete intervals.
Exactly which steps are appropriate will depend on many factors such as the type and quality of your data, the tools you are using, and the algorithms and models you intend to apply.
7. Engineer Features
Feature engineering boils down to selecting the input or ”x’ values for your model. In this context a feature is anything that helps you make a prediction with your model. Having the right number of good features will help improve the the strength of your model, and make you less dependent on picking exactly the right one.
There is no single approach you can take to feature engineering, as the features you will encounter will vary as much as the problems you are working on. Your literature review and EDA may well have identified some likely canditates for good features. Domain knowledge can help, but it isn’t always necessary as there are other techniques you can use to take feature selection such as:
- Uni-variate feature selection – selects features by determining the significance strength of correlation between the input and output variables.
- Recursive feature elimination – starts by fitting all variables to the data, and then removes those with the smallest coefficients until the predictive accuracy of the model is sharply affected.
To take feature engineering to the next level you can also do feature creation. An example of this might be important is in predicting house prices. You would expect the land area of the plot to be important in predicting prices, but your data may only have the length and width of your plot. It may not be clear to the model (or you) how the price is linked to these characteristics. Instead you can create a new feature – the land area – from the length multiplied by the width.
This process of transforming and combining existing features is often regarded more as as an art than a science. Having a deep understanding of the problem area and the original features themselves can be very beneficial.
Selecting good features means getting a robust model and reducing the risk of over fitting, and also makes you less dependent on selecting exactly the right algorithm. The ‘added value’ of data scientists often comes at the feature engineering stage.
8. Select, Validate and Tune Your Models
The type of model or model you ultimately depends on the problem area you are working on and your data. For example, are you trying to classify your data or predict trends? Do you have historic data you can use to train your model? Your literature search and exploratory data analysis may have given you some ideas, and there is a handy guide from sci-kit learn, which may help you work out which models you may wish to consider.
Once you have an idea for the type of model you should be using, you can use a technique called cross validation to estimate the predictive abilities of your model. There are other techniques you could use, but cross-validation is a common way to check that check that your model is not just a good fit to your training data.
In a simple version of cross validation you would split your sample data into two pots – a training set and a testing set. The training set gives you your model parameters, and your test set give you confidence that they are more widely applicable to just the training set. Cross validation won’t guarantee that your model is sensible, but it does help reduce the risk that it is not. Run these validation techniques on different models to check which model or models are most appropriate for your problem.
To take the tuning of your model further you can use ‘hyper-parameter‘ optimisation. This involves optimising parameters such as the learning rate or complexity of the model. Hyperparameters are not learned by the model in the same way that conventional parameters are learned – hyperparameter Optimisation generally works by searching through the hyper-parameter space for one each model to find an optimum solution.
A final stage you can do to improve your machine learning solutions is to use an ensemble of models. By combining the predictions of multiple models it can be possible to make improvements above the predictions of a single one of them. There is a range of ensemble approaches you could explore, including:
- Bagging – works by combining the predictions of randomly generated training sets.
- Boosting – similar to bagging, but rather than randomly selected training sets, they are selected depending previous predictions.
- Blending and Stacking – The outputs of a first layer of trained models is used to to train the second layer and so on.
Ensemble solutions are frequently used in Kaggle competitions.
9. Finish Up
Exactly what this looks like will depend on your initial issue, problem or business area. Were you doing some research – in which case publish your results. Were you enterring a competiation such as Kaggle? In which case submit your entry, and share your working. If you were supporting a business decision or operational function, you may need to work with the business to make sure your model is implemented correctly, and be prepared to feedback from the business and any new operational data.