Generate Fake Data using Python

Being a data engineer, one of the tasks which you have to do almost on a daily basis is load huge amounts of data into your data warehouse or data lakes. Sometimes to do benchmark load times or emulate performance tuning issues in your test environment, you need to use test datasets. While their is a lot of very good huge open datasets available on Kaggle and AWS

But instead of having actual data all you need is a CSV file with dummy data in it. Fear not, up comes Python to the resuce. Python is the golden goose in the age of information not only can it help you sort through massive amounts of data it can also help you generate data.

Faker is a Python package which can generate fake data for you. First you need to pip install faker. For this excercise we are using Python 3.7.2

$ python -m pip install faker

— Script to Generate a CSV file with Fake Data and 1 Billion Rows —

Caution : The file size will be about 1.3GB and it can really hammer your machine. I have an Ec2 instance on which i generate this test data and let it leave running in the background. You can use multiprocessor in Python and hammer all cores but that is a discussion worthy of it’s own blog post.

import csv
 import random
 from time import time
 from decimal import Decimal
 from faker import Faker
RECORD_COUNT = 1000000000
 fake = Faker()
    writer.writeheader()
    for i in range(RECORD_COUNT):
        writer.writerow(
            {
                'userid': fake.ean8(),
                'username': fake.user_name(),
                'firstname': fake.first_name(),
                'lastname': fake.last_name(),
                'city': fake.city(),
                'state': fake.state_abbr(),
                'email': fake.email(),
                'phone': fake.phone_number(),
                'cardno': fake.credit_card_number(card_type=None),
                'likesports': fake.null_boolean(),
                'liketheatre': fake.null_boolean(),
                'likeconcerts': fake.null_boolean(),
                'likejazz': fake.null_boolean(),
                'likeclassical': fake.null_boolean(),
                'likeopera': fake.null_boolean(),
                'likerock': fake.null_boolean(),
                'likevegas': fake.null_boolean(),
                'likebroadway': fake.null_boolean(),
                'likemusicals': fake.null_boolean(),
            }
        )
if name == 'main':
     create_csv_file()

This will create a file users1.csv with a billion rows and generated fake data which is almost like real data

Attached Script :

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s