Migrating 43M Records from DynamoDB to PostgreSQL

Migrated 43 million records from DynamoDB to Aurora PostgreSQL using a resumable CLI migration tool built in Node.js (TypeScript).

Final runtime: 2h15m
Rows inserted: 12.2M
Errors: 0

Idempotent design, file-level checkpointing, and predictable restart behavior made the system reliable under failure.

No scaling miracle. Just calm engineering.

Who This Is For

This guide is for engineers who need to migrate large datasets between databases and want a system that handles failure gracefully. You'll find it useful if you've dealt with partial writes, crashed processes, or the anxiety of restarting a migration at 3am.

Context: Large-Scale DynamoDB to Aurora PostgreSQL Migration

I've had to run this kind of migration more than once.

Data moves. Architectures evolve. What was "good enough" two years ago often isn't anymore.

Eventually someone says: "We need to migrate this."

I don't decide that part. But I'm usually the one responsible for making sure it finishes, doesn't duplicate data, and doesn't explode at 3am.

Migration Context

Property	Value
Source	DynamoDB
Target	Aurora PostgreSQL
Table size	43,378,852 items
Data size	~8.3 GB
Export format	DynamoDB Export to S3

The export produced 16 gzipped JSON files, each between 150–166 MB.

On paper: Read → Transform → Insert.

In practice, data migration is a reliability problem.

DynamoDB to PostgreSQL Migration Challenges

Long-running migrations fail in many ways:

Process crash
Partial writes
Network interruption
DB timeouts
Infrastructure restart
Malformed rows

The real questions become:

What was written?
What wasn't?
Can I restart safely?
Will I duplicate data?

After dealing with this uncertainty multiple times, I stopped treating migration as a scripting problem. It's a failure-handling problem.

Why AWS Lambda Failed for Long-Running Data Migration

My first attempt used AWS Lambda. It looked clean: stateless execution, easy scaling, no servers, chunk-based processing.

But migrations run long. Very long.

Problem: 30-Minute Timeout

Lambda's timeout forced me to build logic to detect near-timeout, persist checkpoint, and reinvoke itself.

Technically possible. Operationally? Messy.

Operational Issues

Logs fragmented across executions
Difficult to understand global progress
Hard to correlate runs
CloudWatch streams everywhere
Monitoring became painful
More orchestration logic than migration logic

Instead of solving data migration, I was building a mini scheduler. That's clever in the wrong direction.

So I killed it.

Final Approach: A Boring, Controlled CLI Migration Tool

I switched to a long-running CLI process in Node.js (TypeScript).

No reinvocation logic. No distributed coordination. No orchestration complexity.

Just: Stream file → Parse → Transform → Batch → Upsert → Log → Repeat.

Sometimes boring is better.

Designing an Idempotent Migration System

Before writing the tool, I defined one rule:

If I restart the migration from the beginning, the final database state must be identical.

This simplified everything. Instead of trying to eliminate duplicate work entirely, I made duplicate work harmless.

Implementation

The core of idempotency relies on PostgreSQL's ON CONFLICT clause:

sql

INSERT INTO target_table (id, data, updated_at)
VALUES ($1, $2, $3)
ON CONFLICT (id) DO UPDATE SET
  data = EXCLUDED.data,
  updated_at = EXCLUDED.updated_at;

Key elements:

Deterministic primary key derived from source ID
PostgreSQL ON CONFLICT DO UPDATE for upsert-based writes
No external state required to determine if a row was already processed

That means:

Retrying a batch is safe
Restarting the whole process is safe
Reprocessing a file is safe

Even if something runs twice, the final state is consistent. That removed most of the stress.

Checkpoint Strategy: File-Level Instead of Row-Level

I considered row-level checkpointing. It adds offset storage, metadata persistence, edge-case complexity, and more reasoning under failure.

Instead, I chose file-level checkpointing.

Each S3 file is:

Processed entirely
Marked complete only after success

If the process crashes halfway through a file, the file is reprocessed from the beginning. Because the system is idempotent, this is safe.

Less precise. Much simpler. Much more predictable.

Handling Bad Records During Bulk Inserts

Large DynamoDB exports always contain problematic records.

If a batch insert fails:

Split batch in half
Retry each half
Keep splitting
Isolate the bad record
Skip it

Everything else continues. One bad row should not stop hours of work.

Streaming Strategy for Large JSON Exports

Each .json.gz file is:

Streamed directly from S3
Decompressed on the fly
Parsed line by line
Inserted in batches of 500

Why Not Parallelize?

I intentionally used single worker, sequential file processing.

Yes, parallel execution would increase throughput. But it complicates restart logic, pressure control, and Aurora resource stability.

Predictability > maximum throughput.

Real Migration Execution Metrics

Here's one full production run:

Metric	Value
Duration	2h15m
Batch size	500
Files processed	16 / 16
Items scanned	37,783,238
Rows inserted	12,205,544
Skipped	0
Errors	0

Why 37M Scanned but Only 12.2M Inserted?

Not every DynamoDB item mapped directly to a PostgreSQL row. The source data used a single-table design where multiple entity types (users, sessions, events) lived in one table with composite keys.

For the target relational model:

Some rows were grouped (e.g., denormalized event sequences → single aggregate row)
Some entity types weren't needed in PostgreSQL
Only target-valid entities were inserted

Everything else was ignored by design. No errors. No manual cleanup afterward.

Aurora PostgreSQL Performance During Migration

I monitored Aurora CloudWatch metrics throughout the run.

Observed Behavior

Metric	Value
Commit throughput	~3–6 commits/sec
Commit latency	~200–300µs
Replica lag	~15–25 seconds during peak bursts
CPU credit usage	Stable
Buffer cache hit ratio	~100%

Nothing exploded. No replication issues. No WAL pressure disaster.

Batch size of 500 turned out to be a good balance: high throughput with controlled database pressure.

Migration Architecture Structure

┌─────────────────────────────────────────────────────────┐
│                     Core Engine                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐      │
│  │ Checkpoints │  │   Retries   │  │Orchestration│      │
│  └─────────────┘  └─────────────┘  └─────────────┘      │
└─────────────────────────────────────────────────────────┘
          │                                    │
          ▼                                    ▼
┌─────────────────────┐          ┌─────────────────────────┐
│   Source Dialect    │          │     Target Dialect      │
│  ─────────────────  │          │  ─────────────────────  │
│  DynamoDB S3 Export │          │  PostgreSQL Abstraction │
│  Reader             │          │  Layer                  │
└─────────────────────┘          └─────────────────────────┘
          │                                    │
          └──────────────┬─────────────────────┘
                         ▼
              ┌─────────────────────┐
              │    Profile Layer    │
              │  ─────────────────  │
              │  Domain-specific    │
              │  transformation     │
              │  logic              │
              └─────────────────────┘

Clear separation of concerns:

Core Engine: Checkpoints, retries, orchestration logic
Source Dialect: DynamoDB S3 export reader
Target Dialect: PostgreSQL abstraction layer
Profile Layer: Domain-specific transformation logic

The engine knows nothing about business logic. The business logic knows nothing about checkpoint mechanics. That makes reuse possible.

Failure Scenarios Considered

The system was designed assuming failure is normal:

Failure	Recovery Behavior
Process crash	Restart from last incomplete file
EC2 reboot	Same as process crash
Network interruption	Batch retry with exponential backoff
Database timeout	Connection pool refresh + retry
Malformed row	Isolate via binary search, skip, continue
Manual stop	Resume from checkpoint

In every case, restart behavior is deterministic. That clarity mattered more than optimization.

Tradeoffs Accepted

File-level checkpoints instead of row-level
Single worker instead of parallel
CLI instead of serverless
Continue-on-error for isolated bad records

Each decision favors recovery safety over cleverness.

Migration systems are not where I try to be fancy. They are where I try to sleep well.

What I Would Improve

Failure-injection testing (e.g., Chaos Monkey style)
Better metrics export (Prometheus/Grafana integration)
Possibly safe parallelism with bounded concurrency

The core engine does not need rewriting.

Key Lessons from Large-Scale Data Migration

The hardest part of migration is not moving data. It's designing a system that behaves calmly when things go wrong.

This reinforced:

Idempotency > micro-optimization
Deterministic recovery > precision checkpointing
Simple boundaries > clever orchestration
Observability > elegance

It's not flashy. But it's something I would trust to run overnight.

And that matters more.

Source Code

If you're in a similar situation and want to adapt it:

github.com/ny-randriantsarafara/dataflux

Have questions or built something similar? I'd love to hear about your approach.

DynamoDB to Aurora PostgreSQL: Migrating 43M Records with a Resumable CLI Tool

Who This Is For

Context: Large-Scale DynamoDB to Aurora PostgreSQL Migration

Migration Context

DynamoDB to PostgreSQL Migration Challenges

Why AWS Lambda Failed for Long-Running Data Migration

Problem: 30-Minute Timeout

Operational Issues

Final Approach: A Boring, Controlled CLI Migration Tool

Designing an Idempotent Migration System

Implementation

Checkpoint Strategy: File-Level Instead of Row-Level

Handling Bad Records During Bulk Inserts

Streaming Strategy for Large JSON Exports

Why Not Parallelize?

Real Migration Execution Metrics

Why 37M Scanned but Only 12.2M Inserted?

Aurora PostgreSQL Performance During Migration

Observed Behavior

Migration Architecture Structure

Failure Scenarios Considered

Tradeoffs Accepted

What I Would Improve

Key Lessons from Large-Scale Data Migration

Source Code

Comments

More from this blog

A brief introduction into the art of software craftsmanship

Command Palette

Who This Is For

Context: Large-Scale DynamoDB to Aurora PostgreSQL Migration

Migration Context

DynamoDB to PostgreSQL Migration Challenges

Why AWS Lambda Failed for Long-Running Data Migration

Problem: 30-Minute Timeout

Operational Issues

Final Approach: A Boring, Controlled CLI Migration Tool

Designing an Idempotent Migration System

Implementation

Checkpoint Strategy: File-Level Instead of Row-Level

Handling Bad Records During Bulk Inserts

Streaming Strategy for Large JSON Exports

Why Not Parallelize?

Real Migration Execution Metrics

Why 37M Scanned but Only 12.2M Inserted?

Aurora PostgreSQL Performance During Migration

Observed Behavior

Migration Architecture Structure

Failure Scenarios Considered

Tradeoffs Accepted

What I Would Improve

Key Lessons from Large-Scale Data Migration

Source Code

Comments

More from this blog