Python Fake Data With Faker

By | December 28, 2020

Fake data can be invaluable for testing or demonstrating behaviour without using live, production data. This lets you protect your production data or help you get started when you don’t yet have a production system set up. This post gives an overview of the Python fake data package faker, which is invaluable for generating this data.

Faker Basics

Faker is easily able to handle basic biographic information such as name, address, phone number, sign-post other providers. An example of some of these basics is below:


from faker import Faker

fake_data_generator = Faker()
Faker.seed(123)


# Some basics
# (defaults to US locale)

# Name
fake_data_generator.name()

# Address
fake_data_generator.address()

# Phone number
fake_data_generator.phone_number()

# Generate a whole profile
fake_data_generator.profile()

Faker uses the concept of a ‘provider’ to contain similar types of fake data. A provider can hold multiple methods, which are what you call to generate the data. The methods in each provider are available via the main faker object.

For example the address provider contains the ‘address’ method to produce a whole address, as well as a ‘building_number’ to only generate a building number.

A list of available providers is available here.

There are also community-made providers, see here.

Creating a Custom Provider

The pre-made providers are often sufficient, but you may want to create your own if you have uncommon data you wish to fake, or find that the built-in providers do not have the richness you require.

Creating a custom provider is a matter of creating a class for your provider, along with generator methods for the fake data you wish to generate. Finally, the custom provider is added to the maker Faker generator object so the new provider methods are accessible.

In the following example we create a new generator for nationalities.

import random 
# we'll need this later to make a random choice from a list of possibles

fake_nationality_generator = Faker()

from faker.providers import BaseProvider
class NationalityProvider(BaseProvider):
    def nationality(self):
        nationality = ['French', 'Indian', 'Chinese']

        return random.choice(nationality)

fake_nationality_generator.add_provider(NationalityProvider)

fake_nationality_generator.nationality()

Working With Different Locales

When working with fake data it is likely that you’ll want to have different information depending on the locale, for example different names and addresses for different nationalities. Not all providers are available for all locales. You can see a list of providers and locales here.

Setting the locale you are interested in can be done by specifying the locale when we instantiate the Faker object. In the example below we instantiate first a German faker object and then a Faker object based on Italian, US and Japanese locales.

Read more about working with locales.

from faker import Faker

german_generator = Faker(['de_DE'])
german_generator.name()

# It is also possible to 
multi_locale_generator = Faker(['it_IT', 'en_US', 'ja_JP'])
multi_locale_generator.name()

Generating Larger Multiple Records

You may find you want to use Faker to generate large amounts of data, rather than just single rows or items. You can wrap the faker in a for loop to generate multiple records (or use list comprehension).


from faker import Faker
multi_faker = Faker()

for i in range(10):
    print(multi_faker.name())

# or use list comprehension:

[multi_faker.name() for i in range(10)]

You can also include those list comprehensions in a pandas data frame include a list comprehension in the data frame creation code


from faker import Faker
import pandas as pd
multi_locale_generator = Faker(['de_DE', 'en_US', 'fr_FR'])
multi_locale_generator.name()

num_records = 5
df = pd.DataFrame({'name': [multi_locale_generator.name() for i in range(num_records)],
                   'date_of_birth': [multi_locale_generator.date_of_birth() for i in range(num_records)],
                   'address': [multi_locale_generator.address() for i in range(num_records)]
                   })

If you have performance issues with generating large volumes of fake data then you may want to check out a similar project mimesis.