Wikipedia Data Stream

Streaming data is an important part of modern data processing. If you are just starting out, and perhaps don't yet work somewhere with access to a big data streaming infrastructure, it can be hard to know where to start. This post talks you through a simple wikipedia data stream example from the wikimedia documentation. Wikipedia…

String Comparison Techniques

String comparison is important for topics such as natural language processing and record linkage.  This post gives a few examples of string comparison techniques that you may wish to consider. String Comparison Techniques Each of these string comparison techniques makes different assumptions or simplifications. You may wish to try several techniques or use a hybrid…

Get Started With PySpark

Pyspark brings together the analytical power and popularity of Python with the distributed-computing capability of Spark. In this post I show how you can use a docker container with pyspark and spark pre-loaded to let you play with pyspark in a Jupyter notebook, rather than having to configure your own spark cluster first. Use Jupyter…

Save and Load Sci-kit Learn Models

Once you have trained a sci-kit learn model it is not obvious how you can deploy it and use it to score unseen data. This post shows you how to save and learn sci-kit learn models so you can execute it against unseen data. Train Your Model The first step is to train the model…