AWS Lambda Function to Load Data from ‘Requestor Pays’ S3 bucket in One Account to Redshift Cluster in Another Account

Part A : Create Redshift Spectrum Cross-Account Access for S3 


Company Account A: Redshift Cluster Account: 24xxxxxx16
Role: RoleA

Company Account B: S3 Bucket Account: 8xxxxxxxx11
Role: RoleB
Bucket Name (Create with Option “Requestor Pays”): s3://shadmha-us-east-2

Use Case:  1. Read Data from S3 Bucket in different account into Spectrum Table
                   2. Unload Data from Redshift Cluster to S3 bucket in different account

======================================================

Step 1: In Redshift Cluster Account 24xxxxxx16, do this

a)    Go to IAM > Roles > Create Role
b)    Create Role > Redshift > Redshift – Customizable.
c)    No need to add policies or tags, go ahead and save this role as “RoleA”
d)    Add this role to your Redshift cluster. Goto Redshift Console > Select Cluster > Manage IAM > Add “RoleA” to Cluster

Step 2: In account which has the S3 Bucket Account 8xxxxxxxx11, do this: 

a)    Go to IAM > Policies > Create policy
b)    Select the JSON tab and add below IAM policy, replace my bucket name ‘shadmha-us-east-2’ with your bucket name

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Effect”: “Allow”,
            “Action”: [
                “s3:GetBucketLocation”,
                “s3:GetObject”,
                “s3:PutObject”,
                “s3:ListMultipartUploadParts”,
                “s3:ListBucket”,
                “s3:ListBucketMultipartUploads”
            ],
            “Resource”: [
                “arn:aws:s3:::shadmha-us-east-2”,
                “arn:aws:s3:::shadmha-us-east-2/*”
            ]
        },
        {
            “Effect”: “Allow”,
            “Action”: [
                “glue:CreateDatabase”,
                “glue:DeleteDatabase”,
                “glue:GetDatabase”,
                “glue:GetDatabases”,
                “glue:UpdateDatabase”,
                “glue:CreateTable”,
                “glue:DeleteTable”,
                “glue:BatchDeleteTable”,
                “glue:UpdateTable”,
                “glue:GetTable”,
                “glue:GetTables”,
                “glue:BatchCreatePartition”,
                “glue:CreatePartition”,
                “glue:DeletePartition”,
                “glue:BatchDeletePartition”,
                “glue:UpdatePartition”,
                “glue:GetPartition”,
                “glue:GetPartitions”,
                “glue:BatchGetPartition”
            ],
            “Resource”: “*”
        }
    ]
}

Chose Review Policy & Save the policy as let’s say ‘s3-cross-account-policy’

c)    Go to Roles > Create Role > Select type of trusted entity as ‘Another AWS Account’ tab
d)    Enter Account ID of Redshift Cluster Account ‘24xxxxxx16’ > Permissions > Search policy created in a)  “s3-cross-account-policy’
e)    Go next > create role > save it as “RoleB”
f)    Go to Roles > Select “RoleB” > “Trust Relationships” tab > Edit trust telationships. Add the below policy:

{
  “Version”: “2012-10-17”,
  “Statement”: [
    {
      “Effect”: “Allow”,
      “Principal”: {
        “AWS”: “arn:aws:iam::24xxxxxx16:root”
      },
      “Action”: “sts:AssumeRole”,
      “Condition”: {}
    }
  ]
}

Update the trust policy

Step 3: Go back to Account under which Redshift Cluster is created

a)    Go to IAM > Roles > Select role which you created earlier “RoleA”
b)    Add inline policy to this role and add the below policy and save it

{
    “Version”: “2012-10-17”,
    “Statement”: [
        {
            “Sid”: “Stmt1487639602000”,
            “Effect”: “Allow”,
            “Action”: [
                “sts:AssumeRole”
            ],
            “Resource”: “arn:aws:iam::80xxxxx11:role/RoleB”
        }
    ]
}
c)    Create policy and Save it to role

Part B: Deploy a Lambda Function Using Attached Code(S3-to-Redshift.zip). And Change Your Cluster and Bucket Details Accordingly

Add a Cloud Watch Event Trigger with Cron Expression : cron(0 2 ? * FRI *)

Increase Timeout & Memory of Lambda Function

Configure Test Event

Execute the Lambda Function to Test

Python Code for Lambda Function

#######################################################################################
# Author         :      Shadab Mohammad
# Create Date    :      13-05-2019
# Modified Date  :      26-09-2019
# Name           :      Load Dataset from AWS S3 bucket to your Redshift Cluster
# Dependencies   :      Requires Python 3.6+. Python Libraries required ‘psycopg2’
#######################################################################################
import psycopg2
import csv
import time
import sys
import os
import datetime
from datetime import date
datetime_object = datetime.datetime.now()


print (“###### Load Data From S3 to Redshift ######”)
print (“”)
print (“Start TimeStamp”)
print (“—————“)
print(datetime_object)
print (“”)

def lambda_handler(event, context):
        #Obtaining the connection to RedShift
    con=psycopg2.connect(dbname= ‘testdb’, host=’shadmha-us-east-2.crhzd8dtwytq.us-east-2.redshift.amazonaws.com’, port= ‘5439’, user= ‘awsuser’, password= ‘SomeP@ssword’)

    copy_command_1=”copy connection_log from ‘s3://shadmha-us-east-2/cross-acct-test/connection_events.csv’ delimiter ‘,’ csv iam_role ‘arn:aws:iam::241135536116:role/RoleA,arn:aws:iam::804739925711:role/RoleB’ ignoreheader 1;”

    #Opening a cursor and run truncate query
    cur = con.cursor()
    query= f”’
    DROP TABLE IF EXISTS connection_log CASCADE;

    CREATE TABLE connection_log(
    username varchar(50),
    event varchar(50),
    count int8);
    COMMIT;”’
    cur.execute(query)
    con.commit()

    #Opening a cursor and run copy query
    cur.execute(copy_command_1)
    con.commit()
    #Close the cursor and the connection
    cur.close()
    con.close()

    # Progress Bar Code Ends here

    datetime_object_2 = datetime.datetime.now()
    print (“End TimeStamp”)
    print (“————-“)
    print (datetime_object_2)
    print (“”)

Lambda Function Code : https://github.com/shadabshaukat/serverless/blob/98f42c7867d6eb4d9e602d2b703764ad891fdfed/S3-to-Redshift.zip

How to Create an Immutable Account Ledger on Amazon QLDB

Amazon QLDB is a new specialist database technology which uses the concept of blockchain for database which maintains a history of every transaction using an immutable & cryptographically verifiable block hash for each record. QLDB uses an immutable transactional log, known as a journal, for data storage. The journal tracks every change to your data and maintains a complete and verifiable history of changes over time.

Sourcing data from the journal, QLDB uses a cryptographic hash function (SHA-256) with a Merkle tree–based model to generate a secure output file of your ledger’s full hash chain. This output file is known as a digest and acts as a fingerprint of your data’s entire change history as of a point in time. It enables you to look back and validate the integrity of your data revisions relative to that fingerprint.

Each ledge in QLDB has exactly one journal. A journal can have multiple strands, which are partitions of the journal.

QLDB is not a de-centralized blockchain network like Bitcoin or Ethereum, if you are looking for a managed block chain network then this is what you must be after –> Amazon Managed Blockchain

/* QLDB is now Generally Available in many regions. Ensure you update your awscli to the latest version to have access to QLDB from awscli. */

$ aws –version
aws-cli/1.16.240 Python/3.7.4 Darwin/17.7.0 botocore/1.12.230

$aws qldb help
NAME
       qldb –

DESCRIPTION
       The control plane for Amazon QLDB

AVAILABLE COMMANDS
       o create-ledger

       o delete-ledger

       o describe-journal-s3-export

       o describe-ledger

       o export-journal-to-s3

       o get-block

       o get-digest

       o get-revision

       o help

Head over to your AWS Console > Amazon QLDB

  1. Create New Ledger called ‘ctransactions’


2. Go to Query Editor > Create Tables for the Account Ledger

CREATE TABLE Cardholder

CREATE TABLE Log_Record

CREATE TABLE Account



3. Create Indexes to help speed up queries against each table

CREATE INDEX ON Cardholder (CARD_NO)

CREATE INDEX ON Log_Record (TRXN_ID)

CREATE INDEX ON Account (ACCT_ID)



4. Insert 1 Record into the Ledger table ‘Account’. QLDB documents are stored in Amazon Ion format which is a Superset of JSON


INSERT INTO Account
 {
‘FirstName’ : ‘Shadab’,
‘LastName’ : ‘Mohammad’,
‘ACCT_ID’ : ‘AMZN10000001’,
‘ACCTYPE’ : ‘Current’,
‘Address’ : ‘2000, 2 Park Street, CBD’
}

documentId
   “7oOLmXPyHjXBSBCio76rJo”

This is the document ID for this transaction
   

5. Create more records and change the first record from ‘Current’ to ‘Saving’ account

Important :
Because QLDB does not enforce schema, you can insert the same document into a table
multiple times. Each insert statement commits a separate document entry to the journal, and
QLDB assigns each document a unique ID.

INSERT INTO Account
 << {
‘FirstName’ : ‘Shadab’,
‘LastName’ : ‘Mohammad’,
‘ACCT_ID’ : ‘AMZN10000001’,
‘ACCTYPE’ : ‘Saving’,
‘Address’ : ‘2000, 2 Park Street, CBD’
},
 {
‘FirstName’ : ‘William’,
‘LastName’ : ‘Tell’,
‘ACCT_ID’ : ‘AMZN10000002’,
‘ACCTYPE’ : ‘Saving’,
‘Address’ : ‘2013, 29 Campbell PDE, Bondi’
},
 {
‘FirstName’ : ‘Chris’,
‘LastName’ : ‘Wall’,
‘ACCT_ID’ : ‘AMZN10000003’,
‘ACCTYPE’ : ‘Current’,
‘Address’ : ‘2142, 15 Lane Cove, Marickville’
} >>

documentId
    “L7S9f2cONhz1JQhj8Gj8bk”
    “0007IXW5xSNDvq9gj3CZN8”
    “JELcBIyjv8uKRHBfhbCvHp”


6. Now let us check for all the changes done to this account using below query. We can check for all modifications made to the Account ID. Each modification or inser generated a unique hash we can be traced back to what that modification was

SELECT
    ACCT_ID FROM Account AS a BY ACCT_ID WHERE a.ACCT_ID = 'AMZN10000001'


ACCT_ID
    “7oOLmXPyHjXBSBCio76rJo”
    “L7S9f2cONhz1JQhj8Gj8bk”
    
As you can see there are 2 modifications to this record and we can verify what each of those modifications are
    

7. Check for the history for each new documentID for the Account
    

SELECT * FROM history(Account) AS h WHERE h.metadata.id
    = '7oOLmXPyHjXBSBCio76rJo'


{
  blockAddress: {
    strandId: “KQrxUDo7jDX4scpbKkhEVZ”,
    sequenceNo: 26
  },
  hash: {{6HVuwag6GMEki+ZRFaMdqidROpKZmsTsYcmWiAMZdmw=}},
  data: {
    FirstName: “Shadab”,
    LastName: “Mohammad”,
    ACCT_ID: “AMZN10000001”,
    ACCTYPE: “Current”,
    Address: “2000, 2 Park Street, CBD”
  },
  metadata: {
    id: “7oOLmXPyHjXBSBCio76rJo”,
    version: 0,
    txTime: 2019-09-19T05:56:20.874Z,
    txId: “2EOkrgLG2Xa7qBwEwnKzDJ”
  }
}

SELECT * FROM
    history(Account) AS h WHERE h.metadata.id = 'L7S9f2cONhz1JQhj8Gj8bk'


{
  blockAddress: {
    strandId: “KQrxUDo7jDX4scpbKkhEVZ”,
    sequenceNo: 41
  },
  hash: {{+qFvFfRx1pi9QOkotc9MAuAWs6Tw7jn3sZqlPpEjSDA=}},
  data: {
    FirstName: “Shadab”,
    LastName: “Mohammad”,
    ACCT_ID: “AMZN10000001”,
    ACCTYPE: “Saving”,
    Address: “2000, 2 Park Street, CBD”
  },
  metadata: {
    id: “L7S9f2cONhz1JQhj8Gj8bk”,
    version: 0,
    txTime: 2019-09-19T06:14:04.214Z,
    txId: “2EOkrgMRCo4CBTUdJWSjmT”
  }
}

Using the first hash we can see the account type was “Current”, and after we modified the document and checked it with the 2nd hash it now shows “Saving”. And we have a history of all modifications applied to that account to traverse back and check all modifications made to that Account ID.

8. To Verify a Document Revision

SELECT r.metadata.id, r.blockAddress FROM _ql_committed_Account AS r WHERE r.data.ACCT_ID = 'AMZN10000001'

id                                                   blockAddress
“7oOLmXPyHjXBSBCio76rJo”    {strandId:”KQrxUDo7jDX4scpbKkhEVZ”,sequenceNo:95}

“L7S9f2cONhz1JQhj8Gj8bk”        {strandId:”KQrxUDo7jDX4scpbKkhEVZ”,sequenceNo:95}


This is just the beginning of what can be possible to create in the FinTech world. No more complex audit reports and expensive 3rd part tools which are an additional overhead to manage, but your database itself keeps a history of any changes made to a transaction or account or debit/credit card.

References:

[1] Accessing Amazon QLDB Using the AWS CLI – https://docs.aws.amazon.com/qldb/latest/developerguide/Tools.CLI.html
[2] botocore – https://github.com/boto/botocore/commit/f646b7ea334c9dcf94af6ad201a45886c48c2648
[3] aws-cli – https://github.com/aws/aws-cli/commit/b3c7bd5a1b93d42ab1b75d99c33fc2b62b0dd6a4
[4] Installing the AWS CLI – Installing the AWS CLI Using pip – https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html#install-tool-pip

Build and store a Hive mestastore outside an EMR cluster in a RDS MySQL database and Connect a Redshift cluster to an EMR cluster

This document addresses the specific configuration points that needs to be in place in order to build and store a Hive mestastore outside an EMR cluster in a RDS MySQL database. It also covers the steps to connect a Redshift cluster to an EMR cluster so Redshift can create and access the tables stored within the external metastore.

Resources Used:

Redshift Cluster

• RDS MySQL Instance

• EMR Cluster

Note: All resources must be in same VPC and same region for this practice.

Creating the RDS MySQL:

1 – First, start creating a RDS MySQL instance if you don’t have one already. Open AWS RDS Console and create an MySQL instance that will be used during this practice.

Note: Please make note of RDS security group, endpoint, Master User and Master Password. We will need that information later on.

2 – Once the RDS MySQL instance is created, modify its security groups to add a rule for All traffic on all Port Range to be allowed from the VPC’s default security group.

Note: This VPC’s default Security Group will be used while creating the EMR cluster later on as well but it needs to be whitelisted beforehand otherwise the EMR launching will fail while trying to reach out to the RDS MySQL.

Before creating the EMR Cluster:

3 – After creating the RDS MySQL (and open its security group to EMR) but right before creating the EMR cluster, a JSON configuration file needs to be created. This file will be ingested by EMR during the bootstrapping phase of EMR’s creation, it will basically tell EMR how to access the remote RDS MySQL database.

4 – Copy the JSON property structure from the following link (use Copy icon): https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html

5 – Paste it in a text editor and modify it carefully with the RDS details you noted earlier.

Note: Be careful, the value property can not contain any spaces or carriage returns. It should appear all on one line. Save it as “hiveConfiguration.json”.

6 – The final JSON configuration file should look like the following:

[
    {
      “Classification”: “hive-site”,
      “Properties”: {

        “javax.jdo.option.ConnectionURL”: “jdbc:mysql:\/\/database-1.cefjr3enh3dk.us-east-2.rds.amazonaws.com:3306\/hive?createDatabaseIfNotExist=true“,      

        “javax.jdo.option.ConnectionDriverName”:”org.mariadb.jdbc.Driver”,
        “javax.jdo.option.ConnectionUserName”:”admin“,

        “javax.jdo.option.ConnectionPassword”: “*********
      }
    }
]

Note 1: replace <hostname>, <username>, <password> with your own details:

Note 2: The part “hive?createDatabaseIfNotExist=true” determines the name of the database to be created in the MySQL RDS, in this case the database will be called “hive”.

7 – After creating above file, upload it to an S3 bucket/folder of your choice (in the same region of your resources).

Creating the EMR:

8 – Now, it is time to create the EMR cluster. To do this, open AWS EMR console and click Create Cluster button. This will prompt the Quick Options page but we won’t be using that. Click on Go to advanced options on the top of the page.

9 – This will send you to the Advanced Options page. There, under Software Configuration, select the following Applications:

Hadoop, Ganglia, Hive, Hue, Tez, Pig, Mahout

10 – In the same page, under Edit Software Settings section, click Load JSON from S3 and select the S3 bucket/path where you uploaded the previous created file “hiveConfiguration.json“. Select the file there and hit Select.

11 – In the Hardware Configuration page, make sure that the EMR cluster is in the same VPC as your MySQL RDS instance. Hit Next if you don’t want to change any Network configuration or Node types.

12 – Hit Next in the General Options page if you don’t want to change anything, although you might want to change the name of your EMR cluster here.

13 – In the next page, Security Options, make sure you have an EC2 Key Pair in that region and select it. Otherwise, create one!

Note: Create one now (if you don’t have one) before creating the EMR as you CAN’T add it later!!!

14 – Still in the Security Options page, expand the EC2 security groups panel and change both, Master and Core & Task instances to use the VPC’s default security group (the same whitelisted in the RDS MySQL security group earlier).

15 – Hit Create cluster and wait the EMR to be created. It will take some time…

Confirming that the metastore was created in the RDS MySQL

16 – Once the EMR is created, another rule needs to be added to the VPC’s default security group, one that allows SSHing into the EMR cluster on port 22 from your local IP. It should look like the following:

17 – With the right rules in place, try to connect to your EMR cluster from your local machine:

– – – chmod 600 article_key.pem 
–   ssh -i article_key.pem hadoop@ec2-18-XX-XX-XX.us-east-2.compute.amazonaws.com

18 – EMR has a MySQL client installed, use this client to connect to your MySQL database and perform few tests such as if the Security Groups are working properly and if the “hive” database was created properly

Note: You can do a telnet test from within EMR box as well to test Security Group access.

19 – To connect to the RDS MySQL, run the following command from your EMR box:

 mysql -h <rds-endpoint> -P 3306 -u <rds master user> -p <rds master password>

Examplemysql -h database-1.cefjr3enh3dk.us-east-2.rds.amazonaws.com -P 3306 -u admin123 -pPwD12345

20 – Once connected, use the following commands to verify if the Hive metastore was indeed created in the RDS. You should be able to see a database named “hive” there:

show databases;       à Lists all databases – “hive” should be there
use hive;             à Connects you to “hive” database
show tables;          à Lists all the meta tables within hive database
select * from TBLS;   à Lists all tables created in hive. At this point there’s none

Setting up necessary Spectrum Roles and Network requirements for Redshift and EMR

Note 1: Following steps assume that you already have a Redshift cluster and that you can connect to it. It will not guide you on how to create and access the Redshift cluster. 

Note 2: Since EMR, RDS MySQL share the same VPC’s default security group, they should be able to communicate to each other already. If that’s the case, you can skip Step 22 and go straight to Step 23, otherwise, If EMR and Redshift use different security groups, please do the step 22 first.

21 – Create a Role for Spectrum and attach it to your Redshift cluster. Follow the instructions here:

•       To Create the Role: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-role.html

•       To Associate the Role: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html

22 – (Optional) Now that Redshift can access S3, Redshift also needs to access EMR cluster and vice-versa. Follow the steps listed under section “Enabling Your Amazon Redshift Cluster to Access Your Amazon EMR Cluster” in the following link: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html#c-spectrum-enabling-emr-access

Note: In summary, this creates an EC2 security group with Redshift’s Security Group and the EMR’s master node’s security groups inside it. Redshift’s Security Group must allow TCP in every port (0 – 65535) while EMR’s Security Group must allow TCP in port 9083 (Hive’s default). Next, you attach this newly created security group to both of your Redshift and EMR clusters.

23 – Once this is done, you should now be able to create the External Schema in Redshift, query the external tables from Redshift and also be able to create/see the schemas/tables from EMR Hive as well. However, at this point there’s no tables created yet.

Creating Tables on Hive First

24 – Log to Hive console and run the following:

> show databases;
default  (that’s the only database so far)
 
> create external table hive_table (col1 int, col2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’
location ‘s3://<your_bucket>/<your_folder>/‘;

> show tables;
hive_table  (that’s the table we just created) 

25 – Log back to your MySQL database and run the following commands:

show databases;       à Lists all databases – “hive” should be there
use hive;             à Connects you to “hive” database
show tables;          à Lists all the meta tables within hive database
select * from TBLS;   à  Lists all tables created in hive, “hive_table” is there!

Note: Now you will be able to see the newly created table “hive_table” showing on your External MySQL catalog.

Creating Schemas and Tables on Redshift Now

26 – On Redshift side, an External Schema must be created first before creating or querying the Hive tables, like following:

CREATE EXTERNAL SCHEMA emr_play                     à It can be any name, that’s a schema valid only for Redshift.
FROM HIVE METASTORE DATABASE ‘default’              à Use default database to match the database we have in Hive.  
URI ‘172.XXX.XXX.XXX‘ PORT 9083                     à EMR’s Private IP of the Master Instance. Hive’s default port is 9083.
IAM_ROLE ‘arn:aws:iam::000000000000:role/spectrum‘; à A valid Spectrum Role attached Redshift.

27 – Create the table(s):

create external table emr_play.redshift_table (col1 int, col2 varchar)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’
location ‘s3://<your_bucket>/<your_folder>/‘;

28 – Simply query the table now:

select * from emr_play.redshift_table;

29 – One more time, log back to your MySQL database and run the following commands again:

show databases;       à Lists all databases – “hive” should be there
use hive;             à Connects you to “hive” database
show tables;          à Lists all the meta tables within hive database
select * from TBLS;   à  Lists all tables created in hive, both tables are there!

 Note: You should be able to see the both Hive and Redshift tables now showing on your External MySQL catalog. You can also query the tables and create new tables on both Hive and Redshift side.