A Comprehensive Guide to Running PostgreSQL on Docker : One Database, Many Personalities

July 9, 2026July 9, 2026 ~ Shadab Mohammad ~ Leave a comment

PostgreSQL is much more than a conventional relational database. With the right extensions, the same core engine can become a vector store, a time-series platform, a geospatial database, an analytical engine, an extension laboratory, or the data layer behind an AI agent.

In this hands-on lab, I run several PostgreSQL personalities side by side with Docker, then connect one of them to a secured PostgreSQL Model Context Protocol (MCP) server. The aim is not to declare one image “best.” It is to demonstrate just how broad the PostgreSQL ecosystem has become, and to show the Docker details that make a multi-image lab reliable.

Lab, not production blueprint: these examples prioritize learning and isolation. Before production use, add backups, monitoring, resource limits, TLS, a secrets manager, tested upgrades, and a deliberately designed high-availability architecture.

What we are building

Container	Image	Capability	Host port	Persistent volume
`postgres18_server`	`postgres:18`	Vanilla PostgreSQL baseline	5432	`pg18_vanilla_data`
`postgres_pgvector18`	`pgvector/pgvector:0.8.4-pg18-trixie`	Vector similarity search	5433	`pgvector18_data`
`postgres_timescale18_node1`	`timescale/timescaledb-ha:pg18`	Time-series plus vector extensions	5434	`timescale18_node1_data`
`postgres_timescale18_node2`	`timescale/timescaledb-ha:pg18`	Second independent TimescaleDB instance	5435	`timescale18_node2_data`
`postgres_pglayers18`	`ghcr.io/pglayers/pglayers-full:18`	Large extension catalogue	5436	`pglayers18_data`
`postgres_ai_exts17`	`postgresai/extended-postgres:17-0.7.0`	PostgresAI/DBLab extension set	5437	`postgresai17_data`
`postgres-mcp`	`postgres-mcp-server:latest`	Agent-safe database discovery and diagnostics	8899	None

Every database receives a unique host port and a unique volume. Inside the Docker network, however, every PostgreSQL container still listens on its normal container port, 5432.

Prerequisites and safety

Docker Desktop, or Docker Engine plus the Compose plugin on Linux.
Enough disk space for six independent database clusters.
At least 8 GB of RAM if several extension-heavy images will run together; stop containers you are not actively testing.
psql, pgAdmin, or DBeaver if you want to connect from the host.
Node.js 18 or newer, including npm/npx, on the computer where Claude Desktop runs the mcp-remote bridge.
Node.js 20 or newer if you rebuild or test the PostgreSQL MCP server source outside Docker.

All passwords and tokens below are placeholders. Generate new values for your own lab. Never paste a real MCP bearer token into a blog post, source repository, screenshot, or shared configuration file.

Set reusable lab variables (Add it to .bash_profile)

			
# Choose the bootstrap administrator used by the standard PostgreSQL images.
export POSTGRES_ADMIN_USER='my_admin_user'
# Replace this value before running the lab. Use a long, unique password.
export POSTGRES_ADMIN_PASSWORD='replace-with-a-long-random-password'
# This is the application database created during first initialization.
export POSTGRES_DATABASE='my_database'
# Wait up to two minutes for a container to accept PostgreSQL connections.
# Arguments: container name, database role, and database name.
wait_for_postgres() {
  container_name="$1"
  role_name="$2"
  database_name="$3"
  for attempt in $(seq 1 60); do
    if docker exec "$container_name" \
      pg_isready -U "$role_name" -d "$database_name" >/dev/null 2>&1; then
      return 0
    fi
    sleep 2
  done
  docker logs --tail 100 "$container_name"
  return 1
}
# Confirm that Docker is available before creating anything.
docker version
docker ps

		

Environment variables are convenient for a lab, but they remain visible to processes in the shell and can appear in container metadata. Use Docker secrets or your platform’s secret manager for production.

The Docker foundation that prevents most PostgreSQL problems

Create one private network and one volume per server

			
# Create a user-defined bridge network if it does not already exist.
docker network inspect postgres-lab >/dev/null 2>&1 || \
docker network create postgres-lab
# Create independent persistent storage for each PostgreSQL personality.
docker volume create pg18_vanilla_data
docker volume create pgvector18_data
docker volume create timescale18_node1_data
docker volume create timescale18_node2_data
docker volume create pglayers18_data
docker volume create postgresai17_data

		

Never share PGDATA. Mounting one data directory into two PostgreSQL containers can corrupt the cluster. A different image or major version does not make the files interchangeable.

Understand the three image-specific storage paths

Image family	Correct container mount target	Why
PostgreSQL 18, pgvector PG18, pglayers PG18	`/var/lib/postgresql`	PostgreSQL 18 stores the cluster below a version-specific subdirectory such as `/var/lib/postgresql/18/docker`.
TimescaleDB HA PG18	`/home/postgres/pgdata`	The HA image defines `PGDATA=/home/postgres/pgdata/data`. Mounting the parent preserves the version’s complete packaged data area.
PostgresAI Extended PostgreSQL 17	`/var/lib/postgresql/data`	This is the image’s declared PGDATA/VOLUME. The lab uses a fresh named volume with `volume-nocopy` so initialization begins in an empty target.

Common `docker run` parameters

Parameter	Purpose
`--detach`	Runs the container in the background and prints its container ID.
`--name NAME`	Assigns a stable name used by `docker exec`, logs, health checks, and Docker DNS.
`--network postgres-lab`	Places the container on the private user-defined network. Other containers can reach it by name.
`--publish 127.0.0.1:H:C`	Maps host port `H` to container port `C`, but only on host loopback. Use an SSH tunnel for remote access.
`--volume VOLUME:PATH`	Persists database files outside the writable container layer.
`--mount type=volume,source=V,target=PATH,volume-nocopy`	Uses the explicit mount syntax and prevents Docker from pre-populating a new volume with files already present at the image path.
`--env NAME=value`	Supplies initialization or runtime settings. `POSTGRES_*` initialization settings only apply when PGDATA is empty.
`--shm-size=1g`	Raises shared memory above Docker’s small default, useful for parallel queries and index builds.
`--health-cmd`	Defines the command Docker uses to test database readiness.
`--health-interval`	Controls how often Docker runs the health check.
`--health-timeout`	Limits how long one health check may run.
`--health-retries`	Sets how many consecutive failures make the container unhealthy.
`IMAGE`	Selects the exact PostgreSQL distribution and tag.
`postgres -c name=value`	Overrides the image command and passes a startup-only PostgreSQL setting directly to the server.

Related command-line flags and shell syntax

Flag or syntax	Purpose
`docker network inspect NAME`	Checks whether a named network already exists and displays its metadata.
`docker network create NAME`	Creates a user-defined network with built-in container-name DNS.
`docker volume create NAME`	Creates a Docker-managed persistent volume.
`docker exec --interactive`	Keeps standard input open so `psql` can read a heredoc or accept input.
`docker exec --tty`	Allocates a terminal for a human-driven interactive `psql` session.
`docker logs --follow`	Streams new log records until you press Ctrl+C.
`docker logs --tail N`	Shows only the newest `N` log lines.
`docker logs --since 2m`	Shows logs produced during the last two minutes.
`docker inspect --format TEMPLATE`	Extracts a selected value, such as health status, from Docker metadata.
`docker update --restart=unless-stopped`	Adds a restart policy to a verified container without recreating it.
`docker restart NAME`	Stops and starts an existing container, preserving its command, environment, mounts, and published ports.
`docker manifest inspect --verbose`	Displays the platforms and detailed manifest data published for an image tag.
`docker build --tag NAME .`	Builds the Dockerfile in the current directory and assigns the resulting image a name and tag.
`docker compose config`	Resolves variables and validates the Compose model before deployment.
`docker compose up --detach`	Creates or reconciles the Compose services and leaves them running in the background.
`psql -h HOST`	Selects the database host; omitting it normally uses a local Unix socket.
`psql -p PORT`	Selects the PostgreSQL TCP port.
`psql -U ROLE`	Selects the PostgreSQL login role.
`psql -d DATABASE`	Selects the database to connect to.
`psql -W`	Forces a password prompt before connecting.
`psql -c SQL`	Runs one SQL command and exits.
`psql -v ON_ERROR_STOP=1`	Makes scripted `psql` stop immediately when any statement fails.
`ssh -N`	Creates forwarding only and does not run a remote shell command.
`ssh -L LPORT:HOST:RPORT`	Forwards a local port through SSH to a host and port visible from the remote machine.
`ssh -i KEY`	Selects the private key used to authenticate to the remote host.
`openssl rand -hex N`	Generates `N` random bytes and encodes them as twice as many hexadecimal characters.
`[ -z "${VAR:-}" ]`	Tests safely whether a shell variable is unset or empty.
`${VAR:?MESSAGE}`	Stops the current command with `MESSAGE` when a required variable is unset or empty.
`chmod 600 FILE`	Allows only the file owner to read or write a secret-bearing configuration file on POSIX systems.
`>/dev/null 2>&1`	Suppresses both normal and error output from the idempotent network-existence check.
`COMMAND_A \|\| COMMAND_B`	Runs the second command only if the first command fails.
`wait_for_postgres CONTAINER ROLE DATABASE`	Calls the helper defined above, retrying `pg_isready` for up to two minutes and printing recent logs if startup fails.
`<<'SQL'`	Feeds a literal heredoc into `psql`; the quoted marker prevents shell expansion inside the SQL body.

I deliberately omit a restart policy during the first boot. Once a database is healthy, enable unless-stopped. This makes startup errors visible instead of hiding them inside a rapid restart loop.

Lab 1: Vanilla PostgreSQL 18—the baseline

The official PostgreSQL image is the control group for the lab: no third-party extensions, no custom process supervisor, and the standard PostgreSQL 18 data layout.

			
# Start vanilla PostgreSQL 18 on host port 5432.
# The database is reachable by other lab containers as postgres18_server:5432.
docker run --detach \
  --name postgres18_server \
  --network postgres-lab \
  --publish 127.0.0.1:5432:5432 \
  --volume pg18_vanilla_data:/var/lib/postgresql \
  --env POSTGRES_USER="$POSTGRES_ADMIN_USER" \
  --env POSTGRES_PASSWORD="$POSTGRES_ADMIN_PASSWORD" \
  --env POSTGRES_DB="$POSTGRES_DATABASE" \
  --shm-size=1g \
  --health-cmd='pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"' \
  --health-interval=10s \
  --health-timeout=5s \
  --health-retries=12 \
  postgres:18
# Do not continue until the server accepts connections.
wait_for_postgres postgres18_server \
  "$POSTGRES_ADMIN_USER" "$POSTGRES_DATABASE"

		

			
# Review startup logs; press Ctrl+C to leave follow mode.
docker logs --follow postgres18_server
# In another terminal, inspect Docker's health result.
docker inspect --format '{{.State.Health.Status}}' postgres18_server
# Connect from inside the container; no host port is required here.
docker exec --interactive --tty postgres18_server \
  psql -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE"
# After the first healthy boot, enable automatic restart after host reboots.
docker update --restart=unless-stopped postgres18_server

		

Docker Compose alternative

Use this instead of the preceding docker run, not at the same time. Save it as compose.yaml.

			
# Compose specification for the vanilla PostgreSQL service.
services:
  db:
    image: postgres:18
    container_name: postgres18_server
    networks:
      - postgres-lab
    ports:
      - "127.0.0.1:5432:5432"
    environment:
      POSTGRES_USER: ${POSTGRES_ADMIN_USER}
      POSTGRES_PASSWORD: ${POSTGRES_ADMIN_PASSWORD}
      POSTGRES_DB: ${POSTGRES_DATABASE}
    shm_size: 1gb
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U $$POSTGRES_USER -d $$POSTGRES_DB"]
      interval: 10s
      timeout: 5s
      retries: 12
    volumes:
      - pg18_vanilla_data:/var/lib/postgresql
# Reuse the network created earlier.
networks:
  postgres-lab:
    external: true
# Reuse the volume created earlier instead of making a project-prefixed volume.
volumes:
  pg18_vanilla_data:
    external: true

		

Compose key	Purpose
`services` / `db`	Defines the application services and gives this PostgreSQL service its Compose-local name.
`image`	Selects the container image and tag.
`container_name`	Assigns the same stable Docker name used by the `docker run` example.
`networks`	Attaches the service to `postgres-lab`.
`ports`	Publishes host-loopback port 5432 to container port 5432.
`environment`	Passes the three shell variables into the container. Compose resolves the single-dollar expressions.
`shm_size`	Allocates 1 GB for the container’s `/dev/shm`.
`healthcheck.test`	Runs `pg_isready` through a container shell. Double dollar signs defer variable expansion to that shell.
`interval`, `timeout`, `retries`	Set the health-check cadence, per-check limit, and failure threshold.
`volumes`	Mounts the persistent volume at the PostgreSQL 18 parent data directory.
`external: true`	Tells Compose to reuse the pre-created network and volume rather than make project-prefixed replacements.

			
# Validate the Compose file, then start it in detached mode.
docker compose config
docker compose up --detach

After the service is healthy, add restart: unless-stopped beneath container_name, then apply the edited Compose model:

			
# Reconcile the service after adding the verified restart policy.
docker compose up --detach

Lab 2: pgvector—PostgreSQL as a vector database

The pgvector image extends the official PostgreSQL image with the vector data type, exact distance operations, and approximate indexes such as HNSW and IVFFlat. The pinned tag below provides pgvector 0.8.4 on PostgreSQL 18 and Debian Trixie.

			
# Start an independent pgvector cluster on host port 5433.
docker run --detach \
  --name postgres_pgvector18 \
  --network postgres-lab \
  --publish 127.0.0.1:5433:5432 \
  --volume pgvector18_data:/var/lib/postgresql \
  --env POSTGRES_USER="$POSTGRES_ADMIN_USER" \
  --env POSTGRES_PASSWORD="$POSTGRES_ADMIN_PASSWORD" \
  --env POSTGRES_DB="$POSTGRES_DATABASE" \
  --shm-size=1g \
  --health-cmd='pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"' \
  --health-interval=10s \
  --health-timeout=5s \
  --health-retries=12 \
  pgvector/pgvector:0.8.4-pg18-trixie
# Wait for initialization before running extension SQL.
wait_for_postgres postgres_pgvector18 \
  "$POSTGRES_ADMIN_USER" "$POSTGRES_DATABASE"

		

			
# Create and exercise the extension in my_database. (Run as a single block command)
docker exec --interactive postgres_pgvector18 \
  psql -v ON_ERROR_STOP=1 \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" <<'SQL'
-- Extensions are installed per database, not once per server.
CREATE EXTENSION IF NOT EXISTS vector;
-- Create a tiny three-dimensional vector table.
CREATE TABLE IF NOT EXISTS vector_demo (
  id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
  label text NOT NULL UNIQUE,
  embedding vector(3) NOT NULL
);
-- Insert the sample embeddings once; reruns update the existing labels.
INSERT INTO vector_demo (label, embedding)
VALUES
  ('alpha', '[1,0,0]'),
  ('beta',  '[0,1,0]'),
  ('gamma', '[0.8,0.2,0]')
ON CONFLICT (label) DO UPDATE
SET embedding = EXCLUDED.embedding;
-- The <-> operator returns Euclidean/L2 distance; smaller is closer.
SELECT label, embedding <-> '[1,0,0]' AS distance
FROM vector_demo
ORDER BY distance
LIMIT 3;
SQL
# Enable restart only after the health check succeeds.
docker update --restart=unless-stopped postgres_pgvector18

		

Lab 3: TimescaleDB—time-series and high-performance vectors

The TimescaleDB HA image combines PostgreSQL with TimescaleDB and other packaged extensions. It uses PGDATA=/home/postgres/pgdata/data, so the named volume is mounted at its parent, /home/postgres/pgdata, rather than the official image’s /var/lib/postgresql path.

Two containers do not automatically form a highly available cluster. The commands below create two independent instances for comparison and failover experiments. Patroni, a distributed configuration store, replication, and a routing layer require separate configuration.

			
# Start independent TimescaleDB instance 1 on host port 5434.
docker run --detach \
  --name postgres_timescale18_node1 \
  --network postgres-lab \
  --publish 127.0.0.1:5434:5432 \
  --volume timescale18_node1_data:/home/postgres/pgdata \
  --env POSTGRES_USER=postgres \
  --env POSTGRES_PASSWORD="$POSTGRES_ADMIN_PASSWORD" \
  --env POSTGRES_DB="$POSTGRES_DATABASE" \
  --shm-size=1g \
  --health-cmd='pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"' \
  --health-interval=10s \
  --health-timeout=5s \
  --health-retries=12 \
  timescale/timescaledb-ha:pg18
# Start independent TimescaleDB instance 2 with a different port and volume.
docker run --detach \
  --name postgres_timescale18_node2 \
  --network postgres-lab \
  --publish 127.0.0.1:5435:5432 \
  --volume timescale18_node2_data:/home/postgres/pgdata \
  --env POSTGRES_USER=postgres \
  --env POSTGRES_PASSWORD="$POSTGRES_ADMIN_PASSWORD" \
  --env POSTGRES_DB="$POSTGRES_DATABASE" \
  --shm-size=1g \
  --health-cmd='pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"' \
  --health-interval=10s \
  --health-timeout=5s \
  --health-retries=12 \
  timescale/timescaledb-ha:pg18
# Wait for both independent instances before executing SQL or enabling restarts.
wait_for_postgres postgres_timescale18_node1 postgres "$POSTGRES_DATABASE"
wait_for_postgres postgres_timescale18_node2 postgres "$POSTGRES_DATABASE"

		

			
# Verify and enable the TimescaleDB and pgvectorscale extensions on node 1.
docker exec --interactive postgres_timescale18_node1 \
  psql -v ON_ERROR_STOP=1 \
  -U postgres -d "$POSTGRES_DATABASE" <<'SQL'
-- Confirm that the required extension packages are available.
SELECT name, default_version, installed_version
FROM pg_available_extensions
WHERE name IN ('timescaledb', 'vector', 'vectorscale')
ORDER BY name;
-- Enable TimescaleDB in this database.
CREATE EXTENSION IF NOT EXISTS timescaledb;
-- CASCADE also enables pgvector when vectorscale requires it.
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
-- Create a simple time-series table.
CREATE TABLE IF NOT EXISTS sensor_readings (
  observed_at timestamptz NOT NULL,
  sensor_id text NOT NULL,
  temperature_c double precision NOT NULL
);
-- Convert the table to a TimescaleDB hypertable.
SELECT create_hypertable(
  'sensor_readings',
  by_range('observed_at'),
  if_not_exists => TRUE
);
SQL
# Enable restart policies after both instances report healthy.
docker update --restart=unless-stopped postgres_timescale18_node1
docker update --restart=unless-stopped postgres_timescale18_node2

		

Lab 4: pglayers Full—an extension-rich PostgreSQL 18

pglayers publishes PostgreSQL extensions as composable image layers and also provides a full profile containing more than fifty extensions. Although the project documents multi-architecture layer support, the live pglayers-full:18 tag resolved to Linux AMD64 only when this article was reviewed. Check the manifest again before using it on ARM64.

This image preloads many libraries that register background workers. PostgreSQL’s default worker limit is eight, while the pglayers test suite uses 64. The full profile also configures components around the canonical postgres role, so this lab intentionally keeps POSTGRES_USER=postgres. Because this walkthrough does not initialize DocumentDB, its internal PostgreSQL background worker is disabled to suppress the missing-role warning. This setting does not disable the separately preloaded MongoDB wire-gateway library.

			
# Start pglayers Full on host port 5436.
# max_worker_processes=64 prevents the extension workers exhausting the default pool.
# The DocumentDB worker is disabled until that extension is deliberately installed.
docker run --detach \
  --name postgres_pglayers18 \
  --network postgres-lab \
  --publish 127.0.0.1:5436:5432 \
  --volume pglayers18_data:/var/lib/postgresql \
  --env POSTGRES_USER=postgres \
  --env POSTGRES_PASSWORD="$POSTGRES_ADMIN_PASSWORD" \
  --env POSTGRES_DB="$POSTGRES_DATABASE" \
  --shm-size=1g \
  --health-cmd='pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"' \
  --health-interval=10s \
  --health-timeout=5s \
  --health-retries=12 \
  ghcr.io/pglayers/pglayers-full:18 \
  postgres \
    -c max_worker_processes=64 \
    -c documentdb.enableBackgroundWorker=off
# Extension-rich images can take longer to initialize.
wait_for_postgres postgres_pglayers18 postgres "$POSTGRES_DATABASE"

		

			
# Check the live image architecture and database health.
docker manifest inspect --verbose \
  ghcr.io/pglayers/pglayers-full:18
docker inspect --format '{{.State.Health.Status}}' postgres_pglayers18
# Run the inspection SQL as one fail-fast script.
docker exec --interactive postgres_pglayers18 \
  psql -v ON_ERROR_STOP=1 \
  -U postgres -d "$POSTGRES_DATABASE" <<'SQL'
-- Confirm the expanded worker pool.
SHOW max_worker_processes;
-- Count and inspect the extension packages available in this image.
SELECT count(*) AS available_extensions
FROM pg_available_extensions;
SELECT name, default_version, installed_version
FROM pg_available_extensions
ORDER BY name;
SQL

		

			
# After verification, enable the restart policy from the shell.
docker update --restart=unless-stopped postgres_pglayers18

The full image makes extensions available; it does not mean every extension should be created in every database. Some extensions have background workers, database-role requirements, or mutual conflicts. Enable only what your experiment needs. To test DocumentDB, stop and recreate this container against the same named volume without the disabling -c option; command arguments cannot be changed by a simple restart. Then follow the project’s documented DocumentDB installation sequence in the configured postgres database.

Lab 5: PostgresAI Extended PostgreSQL 17

The postgresai/extended-postgres image is primarily designed for PostgresAI Database Lab workflows. Its default startup script expects an existing cluster and deliberately keeps the container alive if PostgreSQL stops. For a fresh standalone lab, appending postgres activates the inherited official initialization path. Mounting a brand-new named volume at the image’s declared PGDATA path with volume-nocopy guarantees an empty initialization target.

			
# Start the AMD64 PostgresAI PostgreSQL 17 image on host port 5437.
# Mount the image's declared PGDATA and prevent Docker from copying image-layer files.
# The final "postgres" argument is essential for first-run initialization.
docker run --detach \
  --name postgres_ai_exts17 \
  --network postgres-lab \
  --publish 127.0.0.1:5437:5432 \
  --mount type=volume,source=postgresai17_data,target=/var/lib/postgresql/data,volume-nocopy \
  --env POSTGRES_USER="$POSTGRES_ADMIN_USER" \
  --env POSTGRES_PASSWORD="$POSTGRES_ADMIN_PASSWORD" \
  --env POSTGRES_DB="$POSTGRES_DATABASE" \
  --shm-size=1g \
  --health-cmd='pg_isready -U "$POSTGRES_USER" -d "$POSTGRES_DB"' \
  --health-interval=10s \
  --health-timeout=5s \
  --health-retries=12 \
  postgresai/extended-postgres:17-0.7.0 \
  postgres
# Wait for the inherited PostgreSQL entrypoint to finish initialization.
wait_for_postgres postgres_ai_exts17 \
  "$POSTGRES_ADMIN_USER" "$POSTGRES_DATABASE"

		

			
# Inspect initialization before trying to use psql.
docker logs --tail 100 postgres_ai_exts17
docker inspect --format '{{.State.Health.Status}}' postgres_ai_exts17
# List a few useful extensions supplied by the image.
docker exec --interactive postgres_ai_exts17 \
  psql -v ON_ERROR_STOP=1 \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" <<'SQL'
-- See whether selected extension packages are available.
SELECT name, default_version, installed_version
FROM pg_available_extensions
WHERE name IN ('vector', 'hypopg', 'pg_stat_statements', 'timescaledb')
ORDER BY name;
-- Enable only the extensions needed by this database.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS hypopg;
SQL
# Enable automatic restart after a successful first boot.
docker update --restart=unless-stopped postgres_ai_exts17

		

If initdb reports that PGDATA “exists but is not empty,” do not delete files until you know what they are. Stop the container, inspect the volume, and use a new empty volume for a disposable lab. PostgreSQL will not initialize over unrelated or partial files.

Comparing the PostgreSQL personalities

Stack	Best suited to	Initialization	Architecture note	Main caution
Official PostgreSQL 18	Baseline relational and JSON workloads	Automatic on empty volume	Multi-architecture	Add extensions yourself
pgvector PG18	Embeddings and similarity search	`CREATE EXTENSION vector`	AMD64 and ARM64 tags available	One extension-focused image, not an AI platform by itself
TimescaleDB HA PG18	Time-series, telemetry, PostGIS, and vectorscale	Create/verify extensions per database	AMD64 and ARM64	Two containers are not automatically HA
pglayers Full PG18	Discovering and testing a wide extension catalogue	Create selected extensions per database	Current full tag: AMD64; verify the live manifest	Many preloaded workers; keep the `postgres` role and raise worker slots
PostgresAI Extended PG17	Database Lab and advanced extension experiments	Override the command with `postgres` for a fresh standalone cluster	Published tag is AMD64	Default startup assumes existing PGDATA

Connect with psql, pgAdmin, or DBeaver

From inside a database container, use the container’s normal port 5432—or omit the port entirely. From the host, use the mapped port from the lab table.

			
# From inside the vanilla container: local socket, no host port mapping involved.
docker exec --interactive --tty postgres18_server \
  psql -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE"
# From the Docker host: connect to the vanilla instance on host port 5432.
psql -h 127.0.0.1 -p 5432 \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" -W
# From the Docker host: connect to pgvector on its unique host port 5433.
psql -h 127.0.0.1 -p 5433 \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" -W
# From the Docker host: connect to pglayers on host port 5436.
psql -h 127.0.0.1 -p 5436 \
  -U postgres -d "$POSTGRES_DATABASE" -W

		

For pgAdmin or DBeaver, use host 127.0.0.1, the mapped host port, the configured database, and the matching user. On an EC2 host, keep Docker bound to loopback and use an SSH tunnel instead of opening every database port to the internet.

			
# Forward local laptop port 5432 securely to the EC2 host's loopback port 5432.
ssh -N \
  -L 5432:127.0.0.1:5432 \
  -i /absolute/path/to/key.pem \
  ec2-user@YOUR_EC2_HOST

		

Add the PostgreSQL MCP server

The Postgres MCP Server exposes schema discovery, object inspection, bounded SQL execution, query-plan diagnostics, index recommendations, workload analysis, database monitoring, and optional Prometheus metrics. One MCP process connects to one PostgreSQL database URI. For simultaneous targets, each additional MCP instance needs its own Docker container name, host port, database role, Claude configuration key and URL, plus an allowed-origin entry matching that URL.

Build the MCP image

			
# Clone the MCP server and enter its repository before building.
git clone https://github.com/shadabshaukat/postgres-mcp-server.git
cd postgres-mcp-server
# Option A: build an unchanged checkout from its tracked build output.
docker build --tag postgres-mcp-server:latest .

		

If you modify the TypeScript source, use the following validation-and-build path instead of the final build command above.

			
# Option B: install the locked dependencies, validate the source, and rebuild.
# If you edit TypeScript source, rebuild and test before rebuilding the image.
# These commands require Node.js 20 or newer on the host.
npm ci
npm run check
npm run test:unit
npm run build
docker build --tag postgres-mcp-server:latest .

		

Create a least-privilege database role

MCP_DB_MODE=restricted adds application-level safeguards, but PostgreSQL privileges remain the real security boundary. Do not connect the MCP service as a superuser.

			
# Generate a fresh URL-safe database password and keep it in this private shell.
# Hexadecimal output contains no URI delimiter characters.
export MCP_DB_PASSWORD="$(openssl rand -hex 24)"
# Stop before creating the role if OpenSSL failed.
: "${MCP_DB_PASSWORD:?OpenSSL did not generate an MCP database password}"
# Create the MCP role and grants as one fail-fast script.
# psql safely quotes the password and database-name variables in the SQL below.
docker exec --interactive postgres18_server \
  psql -v ON_ERROR_STOP=1 \
  -v mcp_password="$MCP_DB_PASSWORD" \
  -v target_db="$POSTGRES_DATABASE" \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" <<'SQL'
-- Create a login dedicated to MCP read access.
CREATE ROLE mcp_reader
  WITH LOGIN
  PASSWORD :'mcp_password';
-- Allow the role to connect to this database and inspect the public schema.
GRANT CONNECT ON DATABASE :"target_db" TO mcp_reader;
GRANT USAGE ON SCHEMA public TO mcp_reader;
-- Grant read access to current tables.
GRANT SELECT ON ALL TABLES IN SCHEMA public TO mcp_reader;
-- Grant read access to future tables created by this administrator.
ALTER DEFAULT PRIVILEGES IN SCHEMA public
  GRANT SELECT ON TABLES TO mcp_reader;
SQL

		

Roles are local to a PostgreSQL cluster. Before pointing MCP at pgvector, TimescaleDB, pglayers, or PostgresAI, repeat the dedicated-role and grant step in that target cluster with its administrator and database name. Do not merely change the hostname in the URI.

Optional: grant deeper MCP observability

The least-privilege role above can inspect schemas and selected data, but some workload and monitoring tools will return partial results. The vanilla image does not preload pg_stat_statements, and ordinary roles cannot see every session’s query text. If that wider visibility is acceptable in your lab, enable it explicitly:

			
# Configure the bundled statistics library; this setting needs a restart.
docker exec --interactive postgres18_server \
  psql -v ON_ERROR_STOP=1 \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" <<'SQL'
ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';
SQL
# Restart so PostgreSQL can preload the library.
docker restart postgres18_server
# Wait until PostgreSQL accepts connections before running the next SQL script.
wait_for_postgres postgres18_server \
  "$POSTGRES_ADMIN_USER" "$POSTGRES_DATABASE"
# Create the extension and grant the broad built-in monitoring role.
docker exec --interactive postgres18_server \
  psql -v ON_ERROR_STOP=1 \
  -U "$POSTGRES_ADMIN_USER" -d "$POSTGRES_DATABASE" <<'SQL'
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
GRANT pg_monitor TO mcp_reader;
SQL

		

pg_monitor exposes cluster-wide monitoring information, so grant it only after reviewing that visibility. HypoPG is not packaged in the vanilla image; MCP can still recommend indexes there, but hypothetical-index validation remains unavailable unless you choose an image that supplies HypoPG.

Run the MCP server on the private Docker network

			
# Generate a fresh 64-character bearer token for this MCP instance.
# Keep it out of shell history, recordings, screenshots, and source control.
export POSTGRES_MCP_TOKEN="$(openssl rand -hex 32)"
# Stop immediately if OpenSSL failed and left the token empty.
: "${POSTGRES_MCP_TOKEN:?OpenSSL did not generate an MCP token}"
# Start a restricted, bearer-authenticated Streamable HTTP MCP endpoint.
# Docker DNS resolves postgres18_server directly on the private network.
docker run --detach \
  --name postgres-mcp \
  --network postgres-lab \
  --publish 127.0.0.1:8899:8899 \
  --read-only \
  --tmpfs /tmp \
  --security-opt no-new-privileges:true \
  --env "DATABASE_URI=postgresql://mcp_reader:${MCP_DB_PASSWORD:?MCP_DB_PASSWORD is not set}@postgres18_server:5432/${POSTGRES_DATABASE:?POSTGRES_DATABASE is not set}?sslmode=disable" \
  --env PGSSLMODE=disable \
  --env MCP_TRANSPORT=http \
  --env MCP_HTTP_HOST=0.0.0.0 \
  --env MCP_HTTP_PORT=8899 \
  --env MCP_HTTP_PATH=/mcp \
  --env MCP_DB_MODE=restricted \
  --env "MCP_AUTH_TOKEN=${POSTGRES_MCP_TOKEN:?POSTGRES_MCP_TOKEN is not set}" \
  --env 'MCP_ALLOWED_HOSTS=localhost,127.0.0.1' \
  --env 'MCP_ALLOWED_ORIGINS=http://localhost:8899,http://127.0.0.1:8899' \
  postgres-mcp-server:latest

		

Claude needs the same bearer token. While this private shell or SSH session is still open, transfer it directly into your password manager or secure clipboard. If you must display it, do so once in a private, non-recorded terminal and clear the terminal scrollback afterward:

			
# Display the token only in a private terminal so it can be copied to Claude.
printf '%s\n' "$POSTGRES_MCP_TOKEN"

MCP parameter reference

Parameter	Purpose
`--network postgres-lab`	Lets the MCP container reach the selected database by container name and internal port 5432.
`--publish 127.0.0.1:8899:8899`	Exposes MCP only on host loopback. It is not directly reachable from the network.
`--read-only`	Makes the MCP container filesystem read-only.
`--tmpfs /tmp`	Provides a temporary writable in-memory directory required by some runtime operations.
`--security-opt no-new-privileges:true`	Prevents processes from gaining additional Linux privileges.
`DATABASE_URI`	Selects exactly one PostgreSQL target. Use container DNS and port 5432 on the shared network.
`PGSSLMODE=disable`	Disables TLS only for this trusted, private container network. Use certificate verification for remote databases.
`MCP_TRANSPORT=http`	Enables Streamable HTTP. The legacy value `sse` is only an alias; legacy SSE endpoints require a separate opt-in.
`MCP_HTTP_HOST=0.0.0.0`	Listens on all interfaces inside the container. The host-side publish remains safely bound to 127.0.0.1.
`MCP_HTTP_PORT=8899`	Sets the HTTP listener port inside the container.
`MCP_HTTP_PATH=/mcp`	Sets the Streamable HTTP MCP endpoint path.
`MCP_DB_MODE=restricted`	Enables read-oriented SQL inspection, read-only transactions, row limits, and timeouts.
`MCP_AUTH_TOKEN`	Sets the static Bearer token. The server requires at least 16 characters.
`MCP_ALLOWED_HOSTS`	Restricts accepted HTTP Host values when the internal listener is non-loopback.
`MCP_ALLOWED_ORIGINS`	Restricts browser-style Origin values when an Origin header is present.

			
# Confirm the MCP process, database connection, and readiness endpoint.
docker logs postgres-mcp
curl http://127.0.0.1:8899/healthz
curl http://127.0.0.1:8899/readyz
# Enable automatic restart only after readiness succeeds.
docker update --restart=unless-stopped postgres-mcp

		

To point MCP at another lab server, first create mcp_reader and its grants in that cluster, then recreate the MCP container with that target’s hostname, database, and generated password in DATABASE_URI. The private-network port remains 5432. A simultaneous second MCP target also needs a unique --name, a different host-side published port, a matching allowed origin, and a distinct client configuration entry.

Configure Claude Desktop

Claude Desktop starts the community mcp-remote bridge as a local process and forwards it to the loopback-only Streamable HTTP endpoint. The example pins version 0.1.38, verified when this article was reviewed, instead of downloading an unspecified future release. Replace the placeholder with the token generated above; the value must include the Bearer prefix.

			
{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": [
        "-y",
        "mcp-remote@0.1.38",
        "http://127.0.0.1:8899/mcp",
        "--allow-http",
        "--transport",
        "http-only",
        "--header",
        "Authorization:${AUTH_HEADER}"
      ],
      "env": {
        "AUTH_HEADER": "Bearer REPLACE_WITH_THE_GENERATED_TOKEN"
      }
    }
  }
}

		

-y allows npx to install/run the pinned bridge without an interactive confirmation.
--allow-http is acceptable here only because the endpoint is local loopback.
--transport http-only selects Streamable HTTP.
--header adds the required Authorization header.
If Claude cannot locate npx, replace it with the absolute path returned by command -v npx.

Restrict the Claude configuration file to your account where the operating system supports POSIX permissions—for example, chmod 600 "$HOME/Library/Application Support/Claude/claude_desktop_config.json" on macOS. The token is still plaintext in that file, so do not share it. Completely quit and reopen Claude Desktop after changing the configuration. A cloud-hosted connector cannot reach 127.0.0.1 on your computer; this is a local Claude Desktop configuration.

MCP on EC2: tunnel it instead of publishing it

			
# Run this on the laptop that hosts Claude Desktop.
# It maps laptop port 8899 to the EC2 host's loopback-only MCP endpoint.
ssh -N \
  -L 8899:127.0.0.1:8899 \
  -i /absolute/path/to/key.pem \
  ec2-user@YOUR_EC2_HOST

		

Claude still connects to http://127.0.0.1:8899/mcp. Do not open port 8899 in the EC2 security group merely to make the demo reachable.

Troubleshooting playbook

Symptom	Likely cause	Fix
`port is already allocated`	Two containers publish the same host port.	Use the unique host-port map in this article. Container port 5432 remains unchanged.
`initdb: directory exists but is not empty`	The mounted PGDATA contains files, perhaps from an earlier or incorrect mount.	Inspect it first. For a disposable lab, create a new empty volume.
Missing `.s.PGSQL.5432` socket	PostgreSQL did not finish starting.	Run `docker logs --tail 200 CONTAINER`; do not treat a running container as proof of a running database.
Host port is listening but PostgreSQL refuses connections	Docker published the port even though the database process failed.	Check container health, logs, and `pg_isready`.
pglayers repeatedly says to increase `max_worker_processes`	The full profile exhausted PostgreSQL’s default worker pool.	Start it with `postgres -c max_worker_processes=64`.
pglayers reports `role "postgres" does not exist`	A custom `POSTGRES_USER` replaced the canonical bootstrap role while bundled workers expect `postgres`.	For a fresh full-profile lab, initialize with `POSTGRES_USER=postgres`.
DocumentDB worker role warning	The DocumentDB library is preloaded but its extension has not created the required role.	Create DocumentDB in its configured database or disable that background worker if DocumentDB is not part of the test.
PostgresAI container is up but PostgreSQL is not	The default DBLab-oriented script expects initialized PGDATA.	Use an empty PG17 volume and append `postgres` to the image command.
Changed `POSTGRES_USER` or `POSTGRES_DB` has no effect	The volume already contains an initialized cluster.	Change roles/databases with SQL, or initialize a new empty volume.
MCP cannot reach PostgreSQL	The URI uses `localhost` inside the MCP container.	Use the shared Docker network and the PostgreSQL container name.
MCP returns HTTP 401	The bearer token is missing, stale, or lacks the `Bearer` prefix.	Use the same generated token in the container and client header.
Claude rejects its configuration	Malformed JSON, missing closing brace, or unavailable `npx`.	Validate the JSON and use an absolute `npx` path if necessary.

Production hardening checklist

Pin immutable image tags or digests and test upgrades before deployment.
Use a secret manager rather than plaintext environment variables.
Bind database and MCP ports to private interfaces; prefer SSH tunnels, private networks, or VPN access.
Use TLS with hostname and certificate verification for remote PostgreSQL endpoints.
Give MCP a dedicated least-privilege PostgreSQL role; keep restricted mode enabled.
Do not enable EXPLAIN ANALYZE or unrestricted MCP mode without understanding that queries or writes will execute.
Add tested backups, restore drills, monitoring, WAL management, disk alerts, and capacity limits.
Do not call two standalone TimescaleDB containers “HA” until replication, leader election, routing, and failover have been configured and tested.
Prefer a deliberately composed extension image over an everything-enabled bundle for production.

Conclusion

This lab demonstrates why PostgreSQL earns the “Swiss Army knife” description. The core database remains familiar, while extensions change the workload it can address: pgvector adds similarity search, TimescaleDB adds time-series behavior, pglayers turns extension discovery into a composable workflow, PostgresAI packages a broad Database Lab toolset, and MCP makes PostgreSQL safely inspectable by AI clients.

The real lesson is not only PostgreSQL’s flexibility. It is that Docker isolation matters: unique ports, unique volumes, image-correct PGDATA paths, explicit health checks, canonical roles where an image expects them, and least-privilege connections. Get those foundations right and the PostgreSQL ecosystem becomes an unusually capable platform for experimentation.

Primary references

Amazon Aurora DSQL First Preview – Create Multi-Region Cluster

December 4, 2024December 4, 2024 ~ Shadab Mohammad ~ Leave a comment

Amazon just launched the new Distributed SQ> Aurora Database today.

Aurora DSQL is already available as Public Preview in the US Regions. In this article I want to give you the first preview on creating a cluster and connecting to it with psql client.

Go to this link to get started : https://console.aws.amazon.com/dsql/

1.Create the DSQL Cluster

We will create a Multi-Region with a Linked region and a Witness region.

us-east-1 (N Virginia) -> Writer
us-east-2 (Ohio) -> Writer

us-west-2 (Oregon) -> Quorum

2. Wait for Cluster Creation to complete to get the Endpoint

3. Generate Auth token to login into Aurora DSQL

https://docs.aws.amazon.com/aurora-dsql/latest/userguide/authentication-token-cli.html

aws dsql generate-db-connect-admin-auth-token \
–expires-in 3600 \
–region us-east-1 \
–hostname <dsql-cluster-endpoint>

The full output will be the password, like below :

v4********4u.dsql.us-east-1.on.aws/?Action=DbConnectAdmin&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AK*****04%2Fus-east-1%2Fdsql%2Faws4_request&X-Amz-Date=202**X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Signature=41e15*****ddfc49

4. Connect with PSQL

https://docs.aws.amazon.com/aurora-dsql/latest/userguide/getting-started.html#getting-started-create-cluster

PGSSLMODE=require \
psql –dbname postgres \
–username admin \
–host v4*******u.dsql.us-east-1.on.aws

Password for user admin: <paste-full-string-of-auth-token-output>
psql (17.2, server 16.5)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_128_GCM_SHA256, compression: off, ALPN: none)
Type “help” for help.

postgres=>

We can connect with a Standard PSQL client!!

5. Create some test objects

https://docs.aws.amazon.com/aurora-dsql/latest/userguide/getting-started.html#getting-started-create-cluster

CREATE SCHEMA app;

CREATE TABLE app.orders (
order_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id INTEGER,
product_id INTEGER,
product_description VARCHAR(500),
order_delivery_address VARCHAR(500),
order_date_taken DATE,
order_misc_notes VARCHAR(500)
);

Sample CSV File to Load Data to Orders Table :

sample_orders.csv Download

\COPY app.orders (order_id,customer_id,product_id,product_description,order_delivery_address,order_date_taken,order_misc_notes) FROM ‘/Users/shadab/Downloads/sample_orders.csv’ DELIMITER ‘,’ CSV HEADER;

/* Try to wrap the command in a single-line */

6. Run SQL Query

[a] Query to Find the Top 5 Customers by Total Orders Within the Last 6 Months

WITH recent_orders AS (
SELECT
customer_id,
product_id,
COUNT(*) AS order_count
FROM
app.orders
WHERE
order_date_taken >= CURRENT_DATE – INTERVAL ‘6 months’
GROUP BY
customer_id, product_id
)
SELECT
customer_id,
SUM(order_count) AS total_orders,
STRING_AGG(DISTINCT product_id::TEXT, ‘, ‘) AS ordered_products
FROM
recent_orders
GROUP BY
customer_id
ORDER BY
total_orders DESC
LIMIT 5;

[b] Query to Find the Most Common Delivery Address Patterns

SELECT
LEFT(order_delivery_address, POSITION(‘,’ IN order_delivery_address) – 1) AS address_prefix,
COUNT(*) AS order_count
FROM
app.orders
GROUP BY
address_prefix
ORDER BY
order_count DESC
LIMIT 10;

[c] Query to Calculate Monthly Order Trends by Product

SELECT
TO_CHAR(order_date_taken, ‘YYYY-MM’) AS order_month,
product_id,
COUNT(*) AS total_orders,
AVG(LENGTH(order_misc_notes)) AS avg_note_length — Example of additional insight
FROM
app.orders
GROUP BY
order_month, product_id
ORDER BY
order_month DESC, total_orders DESC;

7. Check Latency

You can check latency from AWS Cloud Shell using traceroute to your Aurora DSQL endpoints from different regions

us-east-1 (N Virginia)

$ traceroute v*****u.dsql.us-east-1.on.aws

traceroute to v****u.dsql.us-east-1.on.aws (44.223.172.242), 30 hops max, 60 byte packets
1 * * 216.182.237.241 (216.182.237.241) 1.566 ms

ap-southeast-2 (Sydney)

$ traceroute v*****u.dsql.us-east-1.on.aws

traceroute to v********u.dsql.us-east-1.on.aws (44.223.172.242), 30 hops max, 60 byte packets
1 244.5.0.119 (244.5.0.119) 1.224 ms * 244.5.0.115 (244.5.0.115) 5.922 ms
2 100.65.22.0 (100.65.22.0) 4.048 ms 100.65.23.112 (100.65.23.112) 5.203 ms 100.65.22.224 (100.65.22.224) 3.309 ms
3 100.66.9.110 (100.66.9.110) 25.430 ms 100.66.9.176 (100.66.9.176) 7.950 ms 100.66.9.178 (100.66.9.178) 3.966 ms
4 100.66.10.32 (100.66.10.32) 0.842 ms 100.66.11.36 (100.66.11.36) 2.745 ms 100.66.11.96 (100.66.11.96) 3.638 ms
5 240.1.192.3 (240.1.192.3) 0.263 ms 240.1.192.1 (240.1.192.1) 0.278 ms 240.1.192.3 (240.1.192.3) 0.244 ms
6 240.0.236.32 (240.0.236.32) 197.174 ms 240.0.184.33 (240.0.184.33) 197.206 ms 240.0.236.13 (240.0.236.13) 199.076 ms
7 242.3.84.161 (242.3.84.161) 200.891 ms 242.2.212.161 (242.2.212.161) 202.113 ms 242.2.212.33 (242.2.212.33) 197.571 ms
8 240.0.32.47 (240.0.32.47) 196.768 ms 240.0.52.96 (240.0.52.96) 196.935 ms 240.3.16.65 (240.3.16.65) 197.235 ms
9 242.7.128.1 (242.7.128.1) 234.734 ms 242.2.168.185 (242.2.168.185) 203.477 ms 242.0.208.5 (242.0.208.5) 204.263 ms
10 * 100.66.10.209 (100.66.10.209) 292.168 ms *

References:

[1] Aurora DSQL : https://aws.amazon.com/rds/aurora/dsql/features/

[2] Aurora DSQL User Guide : https://docs.aws.amazon.com/aurora-dsql/latest/userguide/getting-started.html#getting-started-create-cluster

[3] Use the AWS CLI to generate a token in Aurora DSQL : https://docs.aws.amazon.com/aurora-dsql/latest/userguide/authentication-token-cli.html

[4] DSQL Vignette: Aurora DSQL, and A Personal Story : https://brooker.co.za/blog/2024/12/03/aurora-dsql.html

——————————————————————————————–

Build and store a Hive mestastore outside an EMR cluster in a RDS MySQL database and Connect a Redshift cluster to an EMR cluster

March 7, 2020 ~ Shadab Mohammad ~ Leave a comment

This document addresses the specific configuration points that needs to be in place in order to build and store a Hive mestastore outside an EMR cluster in a RDS MySQL database. It also covers the steps to connect a Redshift cluster to an EMR cluster so Redshift can create and access the tables stored within the external metastore.

Resources Used:

• Redshift Cluster

• RDS MySQL Instance

• EMR Cluster

Note: All resources must be in same VPC and same region for this practice.

Creating the RDS MySQL:

1 – First, start creating a RDS MySQL instance if you don’t have one already. Open AWS RDS Console and create an MySQL instance that will be used during this practice.

Note: Please make note of RDS security group, endpoint, Master User and Master Password. We will need that information later on.

2 – Once the RDS MySQL instance is created, modify its security groups to add a rule for All traffic on all Port Range to be allowed from the VPC’s default security group.

Note: This VPC’s default Security Group will be used while creating the EMR cluster later on as well but it needs to be whitelisted beforehand otherwise the EMR launching will fail while trying to reach out to the RDS MySQL.

Before creating the EMR Cluster:

3 – After creating the RDS MySQL (and open its security group to EMR) but right before creating the EMR cluster, a JSON configuration file needs to be created. This file will be ingested by EMR during the bootstrapping phase of EMR’s creation, it will basically tell EMR how to access the remote RDS MySQL database.

4 – Copy the JSON property structure from the following link (use Copy icon): h ttps://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-external.html

5 – Paste it in a text editor and modify it carefully with the RDS details you noted earlier.

Note: Be careful, the value property can not contain any spaces or carriage returns. It should appear all on one line. Save it as “hiveConfiguration.json”.

6 – The final JSON configuration file should look like the following:

[
    {
      “Classification”: “hive-site”,
      “Properties”: {

“javax.jdo.option.ConnectionURL”: “jdbc:mysql:\/\/database-1.cefjr3enh3dk.us-east-2.rds.amazonaws.com:3306\/hive?createDatabaseIfNotExist=true“,

“javax.jdo.option.ConnectionDriverName”:”org.mariadb.jdbc.Driver”,
“javax.jdo.option.ConnectionUserName”:”admin“,

        “javax.jdo.option.ConnectionPassword”: “*********“
      }
    }
]

Note 1: replace <hostname>, <username>, <password> with your own details:

Note 2: The part “hive?createDatabaseIfNotExist=true” determines the name of the database to be created in the MySQL RDS, in this case the database will be called “hive”.

7 – After creating above file, upload it to an S3 bucket/folder of your choice (in the same region of your resources).

Creating the EMR:

8 – Now, it is time to create the EMR cluster. To do this, open AWS EMR console and click Create Cluster button. This will prompt the Quick Options page but we won’t be using that. Click on Go to advanced options on the top of the page.

9 – This will send you to the Advanced Options page. There, under Software Configuration, select the following Applications:

Hadoop, Ganglia, Hive, Hue, Tez, Pig, Mahout

10 – In the same page, under Edit Software Settings section, click Load JSON from S3 and select the S3 bucket/path where you uploaded the previous created file “hiveConfiguration.json“. Select the file there and hit Select.

11 – In the Hardware Configuration page, make sure that the EMR cluster is in the same VPC as your MySQL RDS instance. Hit Next if you don’t want to change any Network configuration or Node types.

12 – Hit Next in the General Options page if you don’t want to change anything, although you might want to change the name of your EMR cluster here.

13 – In the next page, Security Options, make sure you have an EC2 Key Pair in that region and select it. Otherwise, create one!

Note: Create one now (if you don’t have one) before creating the EMR as you CAN’T add it later!!!

14 – Still in the Security Options page, expand the EC2 security groups panel and change both, Master and Core & Task instances to use the VPC’s default security group (the same whitelisted in the RDS MySQL security group earlier).

15 – Hit Create cluster and wait the EMR to be created. It will take some time…

Confirming that the metastore was created in the RDS MySQL

16 – Once the EMR is created, another rule needs to be added to the VPC’s default security group, one that allows SSHing into the EMR cluster on port 22 from your local IP. It should look like the following:

17 – With the right rules in place, try to connect to your EMR cluster from your local machine:

– – – chmod 600 article_key.pem
– ssh -i article_key.pem hadoop@ec2-18-XX-XX-XX.us-east-2.compute.amazonaws.com

18 – EMR has a MySQL client installed, use this client to connect to your MySQL database and perform few tests such as if the Security Groups are working properly and if the “hive” database was created properly

Note: You can do a telnet test from within EMR box as well to test Security Group access.

19 – To connect to the RDS MySQL, run the following command from your EMR box:

mysql -h <rds-endpoint> -P 3306 -u <rds master user> -p <rds master password>

Example: mysql -h database-1.cefjr3enh3dk.us-east-2.rds.amazonaws.com -P 3306 -u admin123 -pPwD12345

20 – Once connected, use the following commands to verify if the Hive metastore was indeed created in the RDS. You should be able to see a database named “hive” there:

show databases;       à Lists all databases – “hive” should be there
use hive;             à Connects you to “hive” database
show tables;          à Lists all the meta tables within hive database
select * from TBLS;   à Lists all tables created in hive. At this point there’s none

Setting up necessary Spectrum Roles and Network requirements for Redshift and EMR

Note 1: Following steps assume that you already have a Redshift cluster and that you can connect to it. It will not guide you on how to create and access the Redshift cluster.

Note 2: Since EMR, RDS MySQL share the same VPC’s default security group, they should be able to communicate to each other already. If that’s the case, you can skip Step 22 and go straight to Step 23, otherwise, If EMR and Redshift use different security groups, please do the step 22 first.

21 – Create a Role for Spectrum and attach it to your Redshift cluster. Follow the instructions here:

• To Create the Role: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-create-role.html

• To Associate the Role: https://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum-add-role.html

22 – (Optional) Now that Redshift can access S3, Redshift also needs to access EMR cluster and vice-versa. Follow the steps listed under section “Enabling Your Amazon Redshift Cluster to Access Your Amazon EMR Cluster” in the following link: https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-schemas.html#c-spectrum-enabling-emr-access

Note: In summary, this creates an EC2 security group with Redshift’s Security Group and the EMR’s master node’s security groups inside it. Redshift’s Security Group must allow TCP in every port (0 – 65535) while EMR’s Security Group must allow TCP in port 9083 (Hive’s default). Next, you attach this newly created security group to both of your Redshift and EMR clusters.

23 – Once this is done, you should now be able to create the External Schema in Redshift, query the external tables from Redshift and also be able to create/see the schemas/tables from EMR Hive as well. However, at this point there’s no tables created yet.

Creating Tables on Hive First

24 – Log to Hive console and run the following:

> show databases;
default (that’s the only database so far)

> create external table hive_table (col1 int, col2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’
location ‘s3://<your_bucket>/<your_folder>/‘;

> show tables;
hive_table (that’s the table we just created)

25 – Log back to your MySQL database and run the following commands:

Note: Now you will be able to see the newly created table “hive_table” showing on your External MySQL catalog.

Creating Schemas and Tables on Redshift Now

26 – On Redshift side, an External Schema must be created first before creating or querying the Hive tables, like following:

CREATE EXTERNAL SCHEMA emr_play                     à It can be any name, that’s a schema valid only for Redshift.
FROM HIVE METASTORE DATABASE ‘default’              à Use default database to match the database we have in Hive.
URI ‘172.XXX.XXX.XXX‘ PORT 9083                     à EMR’s Private IP of the Master Instance. Hive’s default port is 9083.
IAM_ROLE ‘arn:aws:iam::000000000000:role/spectrum‘; à A valid Spectrum Role attached Redshift.

27 – Create the table(s):

create external table emr_play.redshift_table (col1 int, col2 varchar)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘|’
location ‘s3://<your_bucket>/<your_folder>/‘;

28 – Simply query the table now:

select * from emr_play.redshift_table;

29 – One more time, log back to your MySQL database and run the following commands again:

Note: You should be able to see the both Hive and Redshift tables now showing on your External MySQL catalog. You can also query the tables and create new tables on both Hive and Redshift side.

Create-Modify-Destroy Redshift Cluster Using Terraform

June 13, 2019 ~ Shadab Mohammad ~ Leave a comment

Devops tool have become quite popular in the last few years. Infrastructure automation tools like Chef, Ansible, Cloudformation and Terraform are increasingly being used to provision cloud infrastructure. Once only used for provisioning compute resources but nowadays due to the agile data analytics organizational need even resources like Data warehouses are being added to the devops cycle. Most of these tools eg: Saltstack, Ansible, Chef, Puppet etc are widely used in the industry one of them stands out among the rest : Terraform.

What makes Terraform different from others including our very own Cloudformation is it’s declarative nature, most of infrastructure automation tools are procedural in nature not declarative. Let me explain the difference between declarative and procedural

Lets say you want to provision 10 ec2 instances using an automation approach. With a tool like Ansible your template would like something below using a procedural declaration.

– ec2:
count: 10
image: ami-v1
instance_type: t2.micro

Same code in Terraform using a declarative approach looks like

resource “aws_instance” “example” {
count = 10
ami = “ami-v1”
instance_type = “t2.micro”
}

The difference is that even though both approaches look similar, lets say you want to add additional 5 servers to the configuration. The ansible code is essentially useless since ansible does not maintain state. For ansible if you change the count and increase it to 15, it will create 15 new additional EC2 instances. Ansible has no way to know what it did in the past. For creating total 15 servers you need to add additional 5.

– ec2:
count: 5
image: ami-v1
instance_type: t2.micro

With Terraform this is the big game changer. Terraform maintains state of your infrastructure. Terraform is aware of any state it created in the past. Therefore, to deploy additional 5 more servers, all you have to do is go back to the same Terraform template and update the count from 10 to 15:

resource “aws_instance” “example” {
count = 15
ami = “ami-v1”
instance_type = “t2.micro”
}

When you execute this template Terraform knows it created 10 instances before so it will add only the 5 new instances. With declarative approach the end goal matters. This makes Terraform the winner IMHO from all others. So in this example once we are done with the test , to delete the cluster we just have to run one command without specifying any additional details. Becuase Terraform maintains a record that it created a Redshift cluster with so and so name.

Let’s now jump in and create a Redshift dc1.large cluster in region ‘us-east-1’ using Terraform

1. Download and Install Terraform for Linux from the Terraform Website : https://www.terraform.io/downloads.html

Note : Install awscli and configure your AWS credentials before we begin

On Linux the download is a zip file containing only 1 file. Unzip to any directory and copy the file ‘terraform’ to /usr/bin

2. Create a Terraform configuration file in a new directory

mkdir redshift_tf

cd redshift_tf

vim redshift.tf

provider “aws” {
region = “us-east-1”
}
resource “aws_redshift_cluster” “default” {
cluster_identifier = “terraform-rs-cluster”
database_name = “testdb”
master_username = “awsuser”
master_password = “SomePassword1”
node_type = “dc1.large”
cluster_type = “single-node”
skip_final_snapshot = true
}

3. Initiate Terraform

$ terraform init

Initializing the backend…

Initializing provider plugins…
– Checking for available provider plugins…
– Downloading plugin for provider “aws” (terraform-providers/aws) 2.14.0…

The following providers do not have any version constraints in configuration,
so the latest version was installed.

To prevent automatic upgrades to new major versions that may contain breaking
changes, it is recommended to add version = “…” constraints to the
corresponding provider blocks in configuration, with the constraint strings
suggested below.

* provider.aws: version = “~> 2.14”

Terraform has been successfully initialized!

4. Apply Terraform Configuration

Note 1: From Terraform 0.11 and above you do not have to run ‘terraform plan’ command

Note
2 : For security purpose it is not good practice to store access_key or
secret_key in the .tf file. If you have installed awscli then Terraform
will take your AWS credentials from ‘~/.aws/credentials’ or IAM
credentials.

$ terraform apply

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
+ create

Terraform will perform the following actions:

# aws_redshift_cluster.default will be created
+ resource “aws_redshift_cluster” “default” {
+ allow_version_upgrade = true
+ automated_snapshot_retention_period = 1
+ availability_zone = (known after apply)
+ bucket_name = (known after apply)
+ cluster_identifier = “terraform-rs-cluster”
+ cluster_parameter_group_name = (known after apply)
+ cluster_public_key = (known after apply)
+ cluster_revision_number = (known after apply)
+ cluster_security_groups = (known after apply)
+ cluster_subnet_group_name = (known after apply)
+ cluster_type = “single-node”
+ cluster_version = “1.0”
+ database_name = “testdb”
+ dns_name = (known after apply)
+ enable_logging = (known after apply)
+ encrypted = false
+ endpoint = (known after apply)
+ enhanced_vpc_routing = (known after apply)
+ iam_roles = (known after apply)
+ id = (known after apply)
+ kms_key_id = (known after apply)
+ master_password = (sensitive value)
+ master_username = “awsuser”
+ node_type = “dc1.large”
+ number_of_nodes = 1
+ port = 5439
+ preferred_maintenance_window = (known after apply)
+ publicly_accessible = true
+ s3_key_prefix = (known after apply)
+ skip_final_snapshot = false
+ vpc_security_group_ids = (known after apply)
}

Plan: 1 to add, 0 to change, 0 to destroy.

aws_redshift_cluster.default: Creation complete after 3m33s [id=terraform-rs-cluster]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

5. Check the state of your infrastructure

You can go
check in your AWS console > Redshift Dashboard and you will see the cluster. To see it from terraform run the below command

$ terraform show

6. Destroy the Redshift cluster
Like i mentioned in the beginning of this article, the beauty of Terraform is it maintains
state of your infrastructure. You can remove the Redshift cluster by running
just one simple command

$ terraform destroy

Check Redshift Table and Send SMS Programmatically using Amazon SNS

May 25, 2019May 25, 2019 ~ Shadab Mohammad ~ Leave a comment

Requirement : Check if an UPDATE was run on a Redshift Table and Send SMS Programmatically using Amazon SNS. This script can be used in a variety of different scenarios, for eg: you can use the same logic to check for load errors on you cluster or check for INSERTS or DELETE commands.

Check if an UPDATE has occurred on a Table and send an SMS everytime the table is updated

Pre-requisites :

1. BOTO3 Python SDK installed for Python3.7

2. AWS CLI installed

3. Pscyopg2 package installed for Python3.7

4. Redshift Cluster

5. Amazon SNS configured to send SMS

6. Basic understanding of Python scripting, BOTO3 and Redshift.

Environment:

EC2 Instance Running CENTOS

Python3.7 Installed

AWS CLI installed and Configured Account credentials

Solution:

We will create a Python script to check svl_statementtext for update statements on a table ‘TEST’. The script can be configured to run in crontab every minute and if it an UPDATE occurs it dispatches an SMS using SNS.

1. Create the TEST table in your Redshift cluster and Insert some data into it

testdb=# create table test (id int8, name varchar(20));
CREATE TABLE
testdb=# insert into test values(1,’John’);
INSERT 0 1
testdb=# insert into test values(2,’Matt’);
INSERT 0 1
testdb=# insert into test values(3,’Chris’);

2. Run an UPDATE statement on the table and check svl_statementtext

testdb=# select * from test;
id | name
—-+——-
1 | John
2 | Matt
3 | Chris
(3 rows)

testdb=# update test set name=’Tim’ where id=1;
UPDATE 1

testdb=# select * from svl_statementtext where text ilike ‘update%test%’ and starttime > date_trunc(‘minute’, sysdate);
userid | xid   | pid |             label              |         starttime
       |          endtime           | sequence | type |
                                                                           text
    100 | 506858 | 20017 | default                        | 2019-05-25 14:44:41.
657139 | 2019-05-25 14:44:49.955101 |        0 | QUERY | update test set name=’T
im’ where id=1;

(1 row)

As you can see the table logs the update command and displays it. Now we will run another update command and check the same table but using count(*)

testdb=# update test set name=’John’ where id=1;
UPDATE 1

testdb=# select count(*) from svl_statementtext where text ilike ‘update%test%’ and starttime > date_trunc(‘minute’, sysdate);
count
——-
1
(1 row)

So it correctly display that 1 row was updated in the last minute on table ‘TEST’. Using this logic we can poll the table every minute to see if a transaction hit the table svl_statementtext. And if it did we will send an SMS via SNS

3. Python Script (#Attached svl_statementtext_1min.py) to Check for UPDATE statements in last one minute and if COUNT is not ‘0’ then send an SMS

import boto3
import psycopg2

#Obtaining the connection to RedShift
con=psycopg2.connect(dbname= ‘testdb’, host=’redshift-dc2-test.ctzrqaulg0u6.us-east-1.redshift.amazonaws.com’,
port= ‘5439’, user= ‘awsuser’, password= ‘********’)

#Opening a cursor and run sql query
cur = con.cursor()
cur.execute(“select count(*) from svl_statementtext where text ilike ‘update%test%’ and starttime > date_trunc(‘minute’, sysdate);”)
data = str(cur.fetchone())
print(data)
con.commit()

#Close the cursor and the connection
cur.close()
con.close()

# Compare data variable for threshold
if data == ‘(0,)’:
    print(“NO UPDATES IN LAST 1 MINUTE ON TABLE TEST”)

else:
    print(“UPDATES IN LAST 1 MINUTE ON TABLE TEST”)
    # Create an SNS client
    client = boto3.client(
    “sns”,”us-east-1″
    )
    # Create the topic if it doesn’t exist
    topic = client.create_topic(Name=”invites-for-push-notifications”)
    topic_arn = topic[‘TopicArn’] # get its Amazon Resource Name
    # Get List of Contacts
    list_of_contacts = [“+6144*********”] # <– You can add a list of mutiple mobile numbers here
    # Add SMS Subscribers
    for number in list_of_contacts:
        client.subscribe(
        TopicArn=topic_arn,
        Protocol=’sms’,
        Endpoint=number # <– numbers who’ll receive an SMS message.
        )
    # Publish a message.
    client.publish(Message=”Hello World!”, TopicArn=topic_arn, MessageAttributes={
    ‘AWS.SNS.SMS.SenderID’: {
    ‘DataType’: ‘String’,
    ‘StringValue’: ‘EASYORADBA’ # <– Name of Sender, Not Available in USA
    },’AWS.SNS.SMS.SMSType’: {‘DataType’: ‘String’, ‘StringValue’: ‘Transactional’}})

Open 2 sessions and from one session run an UPDATE command on table ‘TEST’ and from another session execute the Python script. If you configured every properly, you will get an SMS from Sender ‘EASYORADBA’ wuth message text “Hello World”

4. Save Script & Schedule to run every minute in Crontab

* * * * * /usr/local/bin/python3.7 svl_statementtext_1min.py

This script can be used in a variety of different scenarios to dispatch SMS based on some count logic. Another scenario is to schedule this script to check load errors on your Redshift cluster. Example run this script in your Data loading window and check for errors in STL_LOAD_ERRORS table. If there was a data loading issue then the Data Engineering team can be notified via SMS. I am attaching the script (stl_load_errors.py) to check for data loading errors. You can change the granuliarity of time in which it should check for load errors by simply changing the time intervsl in date_trunc function.

SQL : select count(*) from stl_load_errors where starttime > date_trunc(‘minute’, sysdate);

AWS Reshift Insert into Table without S3

April 14, 2019 ~ Shadab Mohammad ~ Leave a comment

To generate random data into a table without using S3 for doing some quick tests

drop table if exists seed;

create table seed ( n int8 );

insert into seed (
SELECT
p0.n
+ p1.n*2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as number
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
Order by 1
);

commit;

drop table if exists test_table;

create table test_table(
ingest_time timestamp encode zstd,
doi date encode zstd,
id int encode bytedict,
value float encode zstd,
data_sig varchar(32) encode zstd
) DISTKEY(id) SORTKEY(ingest_time);

commit;

insert into test_table (
select dateadd(‘msec’, – 10n , getdate() ) as ingest_time, trunc(dateadd(‘msec’, – 10n , getdate() )) as doi,id,
n::float / 1000000 as value, ‘sig-‘ || to_hex(n % 16) as data_sig
FROM (select (a.n + b.n + c.n + d.n) as n, (random() * 1000)::int as id from seed a cross join (select n256 as n from seed) b cross join (select n65536 as n from seed) c
cross join (select n*16777216 as n from ( select distinct (n/16)::int as n from seed ) ) d)
) order by ingest_time;

commit;

analyze test_table;

select count(*) from test_table;

–Consecutive run on of above insert query will 268 million rows for each execution–

You can create a table with about 1 billion rows in 8 minutes on a ds2.xlarge cluster

What we are building

Prerequisites and safety

Set reusable lab variables (Add it to .bash_profile)

The Docker foundation that prevents most PostgreSQL problems

Create one private network and one volume per server

Understand the three image-specific storage paths

Common docker run parameters

Related command-line flags and shell syntax

Lab 1: Vanilla PostgreSQL 18—the baseline

Docker Compose alternative

Lab 2: pgvector—PostgreSQL as a vector database

Lab 3: TimescaleDB—time-series and high-performance vectors

Lab 4: pglayers Full—an extension-rich PostgreSQL 18

Lab 5: PostgresAI Extended PostgreSQL 17

Comparing the PostgreSQL personalities

Connect with psql, pgAdmin, or DBeaver

Add the PostgreSQL MCP server

Build the MCP image

Create a least-privilege database role

Optional: grant deeper MCP observability

Run the MCP server on the private Docker network

MCP parameter reference

Configure Claude Desktop

MCP on EC2: tunnel it instead of publishing it

Troubleshooting playbook

Production hardening checklist

Conclusion

Primary references

1.Create the DSQL Cluster

2. Wait for Cluster Creation to complete to get the Endpoint

3. Generate Auth token to login into Aurora DSQL

4. Connect with PSQL

5. Create some test objects

6. Run SQL Query

7. Check Latency

Common `docker run` parameters