PostgreSQL's COPY command is the fastest way to push CSV data into a table — but it has one rigid behavior that makes it dangerous for production pipelines: a single malformed row aborts the entire transaction. When you're ingesting a million-row file at 3 AM and row 700,000 has a stray comma in an unquoted text field, you don't want the whole load to roll back. You want the good rows committed and the bad row quarantined for review.

pgLoader is the standard tool for that job in the Postgres ecosystem. It wraps the COPY protocol with error handling, schema discovery, data transformation, and a declarative command-file format that fits into version control. For the underlying COPY mechanics, see the COPY vs INSERT comparison; this article focuses on when pgLoader is the better choice and how to configure it for scale.

pgLoader vs COPY at a glance

Capability	pgLoader	COPY / \copy
Bad row handling	Quarantines bad rows, keeps loading	Aborts entire transaction
Schema discovery	Auto-creates table from headers	Table must already exist
Data transformation	Lisp-based casts and lambdas	None — load as-is or pre-process
Remote sources	HTTP, S3, archives, FTP	Local filesystem only
Parallelism	Multi-threaded reader/writer	Single-threaded per command
Throughput on clean data	Noticeably slower (workload-dependent)	Fastest available
Installation	External binary required	Built into PostgreSQL
Configuration format	Declarative .load file (version-controllable)	Inline SQL or shell args

When to Reach for pgLoader

pgLoader is not a universal replacement for COPY. It is the right tool for a specific set of pipeline problems where rigid all-or-nothing loading is the bottleneck.

Untrusted or vendor-supplied data

When the CSV comes from a third party (sales export, partner feed, scraped data), you cannot guarantee every row is well-formed. pgLoader's reject log lets the load complete and gives you a focused file of rows to investigate.

Recurring nightly jobs that must not fail

A pipeline that runs every night needs to make forward progress even when a few rows are bad. With pgLoader, you alert on the reject log size but the data warehouse stays current.

Unknown or evolving schema

When you receive a new export format and don't want to hand-build the table DDL, pgLoader will infer column types from the CSV. Useful for ad-hoc exploration of unfamiliar files.

Cleanup happens during the load

Currency symbols, date format inconsistencies, "Y/N" booleans, padded whitespace — pgLoader's CAST clause applies the transformation as data streams in, removing the need for a separate scrub step.

Source files are remote

pgLoader fetches directly from HTTPS URLs and S3 buckets without an intermediate download. The file is streamed and parsed in memory.

When to Stick With COPY

For internal exports of known shape — your nightly database-to-database dump, your QA seed file, a one-off migration of trusted data — COPY remains the right tool. It is faster, requires no installation, and produces a cleaner failure: if the data is wrong, you want to know immediately, not discover it in a reject log a day later. pgLoader's value appears only when you actually need its resilience features.

Installing pgLoader

Platform	Command	Notes
macOS	brew install pgloader	Latest stable; depends on SBCL
Ubuntu/Debian	apt-get install pgloader	May lag upstream by a release
Docker	docker pull dimitri/pgloader	Best for CI/CD; pin to a tagged version
RHEL/CentOS	Build from source	No official package; use Docker instead
Windows	Use WSL2 + Ubuntu install	No native Windows binary

Anatomy of a `.load` Command File

pgLoader uses a declarative DSL stored in a .load file. The file describes the source, the target, the parsing options, and any setup SQL to run before or after the load. The structure is intentionally readable — you write it once, commit it to Git, and team members can review changes the same way they review code.

Sections of a .load file

Clause	Purpose
LOAD CSV FROM	Source location — local path, HTTP URL, or s3://
INTO postgresql://	Connection string for the target database
TARGET TABLE	Destination table name (created if missing)
WITH	Parsing flags — skip header, field separator, quote character, truncate
SET	Per-session Postgres parameters — work_mem, client_encoding, statement_timeout
BEFORE LOAD DO	SQL statements to run once before ingestion starts (drop indexes, create tables)
CAST	Type conversion and transformation rules for specific columns
AFTER LOAD DO	Post-load SQL — rebuild indexes, ANALYZE, refresh materialized views

A minimal command file looks like this:

LOAD CSV
  FROM 'data/sales_2023.csv' (id, product, price, sold_at)
  INTO postgresql://user:password@localhost/sales_db
  TARGET TABLE sales_data

  WITH truncate,
       skip header = 1,
       fields terminated by ',',
       fields optionally enclosed by '"'

  SET work_mem to '128MB',
      client_encoding to 'utf8';

Run it with a single command: pgloader import_sales.load. On completion, pgLoader prints a summary table showing rows read, rows imported, rows rejected, and the time spent in each phase (read, prepare, copy, index).

Error Resilience: The Reject Log Pattern

This is the single feature that justifies installing pgLoader for most teams. When a row fails to parse or fails a type cast, pgLoader writes the raw row to a .rejects file and the reason to a .errors file, then continues with the next row. The pipeline does not abort.

The two-log pattern

Always configure both REJECTED DATA LOG (the raw row, ready to re-ingest after fixing) and REJECTED DATA ERRORS (the parse or type error message). Without the errors file you cannot diagnose why a row failed; without the rejects file you have nothing to re-ingest after the fix.

Combined with rows per transaction, this turns a fragile script into a resumable, observable pipeline. A typical configuration commits every 10,000 rows, quarantines bad rows, and produces a daily reject file that an on-call engineer reviews each morning. The database stays current, the load never wakes anyone up at 3 AM, and bad data has a defined remediation path.

Reject-log clauses

Clause	What it does
rows per transaction	Commit interval — smaller batches limit rollback scope on connection errors
REJECTED DATA LOG	Path to the file that captures raw bad rows for re-ingestion
REJECTED DATA ERRORS	Path to the file that captures the parse or cast error for each rejected row
on error stop	Override default behavior and abort on first error (rarely used)
errors per second	Threshold above which pgLoader aborts — catches catastrophic schema mismatches

On-the-Fly Transformation With CAST

The CAST clause lets you transform a column's value while it is streaming into Postgres, instead of loading raw text and running a cleanup pass afterward. This is faster (one pass instead of two), cleaner (no temporary "dirty" state in the target table), and easier to reason about (the transformation lives next to the load definition).

Common transformation patterns

Source data problem	CAST approach
Currency prefix ("$19.99")	Strip the dollar sign with remove-chars, cast to numeric
"Y" / "N" instead of true/false	Map with a lambda that returns t for "Y", nil otherwise
Padded whitespace in text fields	Use the built-in trim type-modifier on text columns
Mixed date formats	Cast to timestamptz using format string detection
Empty string vs NULL	Use null if blanks to treat empty strings as NULL
Localized decimals ("1.234,56")	Custom lambda to swap separators before numeric cast

The transformation language is Common Lisp, which is unfamiliar to most teams. In practice you copy-paste from the pgLoader documentation, adjust the column name and the function, and move on. You rarely need to write Lisp from scratch.

Parallel Ingestion

pgLoader is multi-threaded by default. While one worker reads and parses the CSV, another applies casts, and a third streams the result to Postgres via the COPY protocol. On a multi-core machine with a fast SSD, this pipeline-style parallelism keeps the network connection saturated even when the source CSV is large or the transformations are non-trivial.

This is different from running multiple COPY commands in parallel, which is also possible but requires you to manually split the source file and manage concurrent connections. With pgLoader, the parallelism is automatic and configured per-run via workers and concurrency settings.

Production Tuning Tips

Increase work_mem in SET

Give pgLoader more memory for sorts and hash operations during transformation. 128MB to 512MB is a reasonable range for most workloads — anything higher rarely helps and risks exhausting server memory under concurrency.

Consider UNLOGGED tables for wipe-and-load

If the target table is truncated before every load, set it to UNLOGGED to bypass the write-ahead log entirely. Throughput can improve substantially, with the actual gain depending on disk speed, row width, and indexes. Switch back to LOGGED after the load if you need replication or crash recovery for that table.

Version-control .load files

A .load file contains your full ingestion logic — source, schema, transformations, and reject handling. Treat it like application code: store in Git, code-review changes, and tie revisions to the data shape changes that triggered them.

Pin the Docker image tag

pgLoader behavior can change between versions, particularly around CAST defaults and reject-log formats. Pin to a specific image tag in CI/CD so pipeline output stays reproducible.

Drop indexes BEFORE LOAD, rebuild AFTER LOAD

For multi-million-row loads into existing tables, dropping non-essential indexes in BEFORE LOAD and recreating them in AFTER LOAD is significantly faster than maintaining them row-by-row. The same pattern as raw COPY tuning.

UNLOGGED has a real cost

An UNLOGGED table is not replicated, is not crash-safe, and is truncated to empty after a Postgres crash. It is appropriate for staging tables or wipe-and-load patterns where the source CSV is the system of record. Never make a transactional table UNLOGGED for the sake of a faster load.

Decision Summary

Internal export, known schema, clean data

Use \copy (or COPY for server-local files)

Third-party data, any chance of bad rows

Use pgLoader with REJECTED DATA logs

Need schema inference from CSV headers

Use pgLoader with auto-create

Need transformation during load (currency, dates, booleans)

Use pgLoader with CAST clauses

Source is on S3 / HTTPS / FTP

Use pgLoader with remote source URI

Single-shot migration, you control the file

Use \copy — simpler, faster, no install

Scalable Data Ingestion: Using pgLoader for Advanced CSV-to-Postgres Pipelines

When to Reach for pgLoader

When to Stick With COPY

Installing pgLoader

Anatomy of a `.load` Command File

Error Resilience: The Reject Log Pattern

On-the-Fly Transformation With CAST

Parallel Ingestion

Production Tuning Tips

Decision Summary

No .load files, no reject logs, no Lisp

When to Reach for pgLoader

When to Stick With COPY

Installing pgLoader

Anatomy of a .load Command File

Error Resilience: The Reject Log Pattern

On-the-Fly Transformation With CAST

Parallel Ingestion

Production Tuning Tips

Decision Summary

No .load files, no reject logs, no Lisp

Anatomy of a `.load` Command File