Category Archives: Data Science

Get Started With PySpark

Pyspark brings together the analytical power and popularity of Python with the distributed-computing capability of Spark. In this post I show how you can use a docker container with pyspark and spark pre-loaded to let you play with pyspark in a Jupyter notebook, rather than having to configure your own spark cluster first. Use Jupyter… Read More »

Software Testing vs Scientific Rigour

The CoViD-19 global pandemic has thrust modelling and analysis into the public eye in a way rarely seen before. One particular example is the CoViD model developed by MRC-IDE, which was a contributor to the UK response to the pandemic. The research team behind this model published their code on GitHub. As might be expected… Read More »

Save and Load Sci-kit Learn Models

Once you have trained a sci-kit learn model it is not obvious how you can deploy it and use it to score unseen data. This post shows you how to save and learn sci-kit learn models so you can execute it against unseen data. Train Your Model The first step is to train the model… Read More »