Tag Archives: data science

Python Fake Data With Faker

Fake data can be invaluable for testing or demonstrating behaviour without using live, production data. This lets you protect your production data or help you get started when you don’t yet have a production system set up. This post gives an overview of the Python fake data package faker, which is invaluable for generating this… Read More »

Monte Carlo Simulation

Monto Carlo simulation is a technique for approximating future behaviour based on randomly sampled numbers. By sampling from different probability distributions it is possible to use Monte Carlo simulation for a range of different situations including physical systems, computer games or finance. This post gives a simple example of Monte Carlo simulation to give some… Read More »

Get Started With PySpark

Pyspark brings together the analytical power and popularity of Python with the distributed-computing capability of Spark. In this post I show how you can use a docker container with pyspark and spark pre-loaded to let you play with pyspark in a Jupyter notebook, rather than having to configure your own spark cluster first. Use Jupyter… Read More »

The ROC Curve

The ROC Curve is a commonly used method for and evaluating the performance of classification models. ROC curves use a combination the false positive rate (i.e. occurrences that were predicted positive, but actually negative) and true positive rate (i.e. occurrences that were correctly predicted) to build up a summary picture of the classification performance. ROC… Read More »