Tag Archives: data science

Python Web Scraping

Web scraping is the process of automatically identifying and downloading data from a webpage. This blog post looks a a few python web scraping options. Alternatives to web scraping include: – Using an API. Usually a better option than web scraping if an API is available. See also Creating an API with flask. Open data… Read More »

Python Fake Data With Faker

Fake data can be invaluable for testing or demonstrating behaviour without using live, production data. This lets you protect your production data or help you get started when you don’t yet have a production system set up. This post gives an overview of the Python fake data package faker, which is invaluable for generating this… Read More »

Python Compare Wikipedia Pages

Wikimedia has an API which lets you compare Wikipedia pages, and in some cases modify pages and information within the Wikimedia group. The main page for all Wikimedia API information is here: In this post I am most interested in the Wikipedia compare API, to show how you use it to see differences between versions… Read More »

Monte Carlo Simulation

Monto Carlo simulation is a technique for approximating future behaviour based on randomly sampled numbers. By sampling from different probability distributions it is possible to use Monte Carlo simulation for a range of different situations including physical systems, computer games or finance. This post gives a simple example of Monte Carlo simulation to give some… Read More »

Get Started With PySpark

Pyspark brings together the analytical power and popularity of Python with the distributed-computing capability of Spark. In this post I show how you can use a docker container with pyspark and spark pre-loaded to let you play with pyspark in a Jupyter notebook, rather than having to configure your own spark cluster first. Use Jupyter… Read More »