Category Archives: Data Science

Monte Carlo Simulation

Monto Carlo simulation is a technique for approximating future behaviour based on randomly sampled numbers. By sampling from different probability distributions it is possible to use Monte Carlo simulation for a range of different situations including physical systems, computer games or finance. This post gives a simple example of Monte Carlo simulation to give some… Read More »

Monte Carlo Integration

Monte Carlo integration uses random numbers to approximate the solutions to integrals. While not as sophisticated as some other numerical integration techniques, Monte Carlo integration is still a valuable tool to have in your toolbox. Monte Carlo integration is one type of Monte Carlo method – a family of techniques which use randomly generated numbers… Read More »

What Are Confidence Intervals?

Confidence intervals are the range often provided alongside other statistics. As their name suggests, they somehow convey a degree of uncertainty in the statistic quoted, but what actually are confidence intervals? And how can we get an intuitive understanding of them? This post goes into some of the theory behind confidence intervals and points you… Read More »

Wikipedia Data Stream

Streaming data is an important part of modern data processing. If you are just starting out, and perhaps don’t yet work somewhere with access to a big data streaming infrastructure, it can be hard to know where to start. This post talks you through a simple wikipedia data stream example from the wikimedia documentation. Wikipedia… Read More »

String Comparison Techniques

String comparison is important for topics such as natural language processing and record linkage. This post gives a few examples of string comparison techniques that you may wish to consider. String Comparison Techniques Each of these string comparison techniques makes different assumptions or simplifications. You may wish to try several techniques or use a hybrid… Read More »