Simple Script to Connect & Query AWS Redshift from Python 3.7

Pre-requisite

You need psycopg2 adapter for Python which is compatible with any PostgreSQL database

To install psycopg2

$ sudo python3.7 -m pip install psycopg2

Sometimes it gives an error when you install from the source, in that case you can install from the binary

$ sudo python3.7 -m pip install psycopg2-binary

import psycopg2

#Obtaining the connection to RedShift
con=psycopg2.connect(dbname= 'dev', host='your-redshift-cluster.redshift.amazonaws.com',
port= '5439', user= 'YourUser', password= 'YourPassword')

#Opening a cursor and running a query
cur = con.cursor()
cur.execute("SELECT SYSDATE;")
#Printing the output
print(cur.fetchall())
cur.close()
con.close()

Output:

[(datetime.datetime(2019, 4, 18, 23, 57, 0, 289196),)]

Generate Fake Data using Python

Being a data engineer, one of the tasks which you have to do almost on a daily basis is load huge amounts of data into your data warehouse or data lakes. Sometimes to do benchmark load times or emulate performance tuning issues in your test environment, you need to use test datasets. While their is a lot of very good huge open datasets available on Kaggle and AWS

But instead of having actual data all you need is a CSV file with dummy data in it. Fear not, up comes Python to the resuce. Python is the golden goose in the age of information not only can it help you sort through massive amounts of data it can also help you generate data.

Faker is a Python package which can generate fake data for you. First you need to pip install faker. For this excercise we are using Python 3.7.2

$ python -m pip install faker

— Script to Generate a CSV file with Fake Data and 1 Billion Rows —

Caution : The file size will be about 1.3GB and it can really hammer your machine. I have an Ec2 instance on which i generate this test data and let it leave running in the background. You can use multiprocessor in Python and hammer all cores but that is a discussion worthy of it’s own blog post.

import csv
 import random
 from time import time
 from decimal import Decimal
 from faker import Faker
RECORD_COUNT = 1000000000
 fake = Faker()
    writer.writeheader()
    for i in range(RECORD_COUNT):
        writer.writerow(
            {
                'userid': fake.ean8(),
                'username': fake.user_name(),
                'firstname': fake.first_name(),
                'lastname': fake.last_name(),
                'city': fake.city(),
                'state': fake.state_abbr(),
                'email': fake.email(),
                'phone': fake.phone_number(),
                'cardno': fake.credit_card_number(card_type=None),
                'likesports': fake.null_boolean(),
                'liketheatre': fake.null_boolean(),
                'likeconcerts': fake.null_boolean(),
                'likejazz': fake.null_boolean(),
                'likeclassical': fake.null_boolean(),
                'likeopera': fake.null_boolean(),
                'likerock': fake.null_boolean(),
                'likevegas': fake.null_boolean(),
                'likebroadway': fake.null_boolean(),
                'likemusicals': fake.null_boolean(),
            }
        )
if name == 'main':
     create_csv_file()

This will create a file users1.csv with a billion rows and generated fake data which is almost like real data

Attached Script :

AWS Reshift Insert into Table without S3

To generate random data into a table without using S3 for doing some quick tests

drop table if exists seed;

create table seed ( n int8 );

insert into seed (
SELECT
p0.n
+ p1.n*2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as number
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
Order by 1
);

commit;

drop table if exists test_table;

create table test_table(
ingest_time timestamp encode zstd,
doi date encode zstd,
id int encode bytedict,
value float encode zstd,
data_sig varchar(32) encode zstd
) DISTKEY(id) SORTKEY(ingest_time);

commit;

insert into test_table (
select dateadd(‘msec’, – 10n , getdate() ) as ingest_time, trunc(dateadd(‘msec’, – 10n , getdate() )) as doi,id,
n::float / 1000000 as value, ‘sig-‘ || to_hex(n % 16) as data_sig
FROM (select (a.n + b.n + c.n + d.n) as n, (random() * 1000)::int as id from seed a cross join (select n256 as n from seed) b cross join (select n65536 as n from seed) c
cross join (select n*16777216 as n from ( select distinct (n/16)::int as n from seed ) ) d)
) order by ingest_time;

commit;

analyze test_table;

select count(*) from test_table;

–Consecutive run on of above insert query will 268 million rows for each execution–

You can create a table with about 1 billion rows in 8 minutes on a ds2.xlarge cluster

Move Oracle Database 12c from On-Premise to AWS RDS Oracle Instance using SQL Developer

Amazon Web Services has been gaining popularity in the last few years since cloud computing has been in the spotlight. Slowly the Traditional Enterprises are making the journey to the cloud. Oracle is considered one of the most mission critical application in the Enterprise. Moving Oracle Database to cloud can bring its own benefits both from an operational and financial perspective.

In this exercise we will move an on-premise Oracle DB schema to an AWS RDS Instance running Oracle 12cR1

Pre-requisites :

1. You already have a source Oracle database installed

2. You know how to provision an AWS RDS Oracle Instance

3. You have access to both instances

4. You have basic understanding of AWS S3 and AWS console

5. You have the latest version of SQL Developer installed on your machine

Source DB:

Oracle 12cR1 (12.1.0.2) running on CentOS 7.1

Destination DB:

Oracle 12cR1 running on AWS RDS Instance

High Level Steps to Migrate:

1. Create the destination Oracle 12CR1 instance on AWS. It is one of the easiest things to provision an Oracle DB on AWS RDS

2.  Connect to Both Source(on-Prem) and Destination(AWS) Database from SQL Developer

3. Go to Tools > Database Copy and Select Source and Destination Databases

I prefer to do Tablespace Copy since most of the Apps i work reside in a single tablespace. But this depends on your choice. You can either chose Objects, Schemas or even entire Tablespaces to be copied across.

IMPORTANT : Make sure you have created the source schema in destination database before proceeding to next step else you will get an error “User does not exist”

In Destination AWS RDS run below commands

SQL> create user <source-schema-name> identified by <password123>;

SQL> grant dba to <source-schema-name>;

4. Start the Database Copy

5. Check from Performance Insights Console to Check whats happening in the background

6. Query the Destination Database to See if the Objects are valid and have arrived

SQL> select * from user_tables;

SQL> select * from dba_objects where status=’INVALID’;