Category Archives: Data Science

What Are Confidence Intervals?

Confidence intervals are the range often provided alongside other statistics. As their name suggests, they somehow convey a degree of uncertainty in the statistic quoted, but what actually are confidence intervals? And how can we get an intuitive understanding of them? This post goes into some of the theory behind confidence intervals and points you… Read More »

Wikipedia Data Stream

Streaming data is an important part of modern data processing. If you are just starting out, and perhaps don’t yet work somewhere with access to a big data streaming infrastructure, it can be hard to know where to start. This post talks you through a simple wikipedia data stream example from the wikimedia documentation. Wikipedia… Read More »

String Comparison Techniques

String comparison is important for topics such as natural language processing and record linkage.  This post gives a few examples of string comparison techniques that you may wish to consider. String Comparison Techniques Each of these string comparison techniques makes different assumptions or simplifications. You may wish to try several techniques or use a hybrid… Read More »

Get Started With PySpark

Pyspark brings together the analytical power and popularity of Python with the distributed-computing capability of Spark. In this post I show how you can use a docker container with pyspark and spark pre-loaded to let you play with pyspark in a Jupyter notebook, rather than having to configure your own spark cluster first. Use Jupyter… Read More »

Software Testing vs Scientific Rigour

The CoViD-19 global pandemic has thrust modelling and analysis into the public eye in a way rarely seen before. One particular example is the CoViD model developed by MRC-IDE, which was a contributor to the UK response to the pandemic. The research team behind this model published their code on GitHub. As might be expected… Read More »