PostgreSQL's COPY command is the fastest way to push CSV data into a table — but it has one rigid behavior that makes it dangerous for production pipelines: a single malformed row aborts the entire transaction. When you're ingesting a million-row file at 3 AM and row 700,000 has a stray comma in an unquoted text field, you don't want the whole load to roll back. You want the good rows committed and the bad row quarantined for review.
pgLoader is the standard tool for that job in the Postgres ecosystem. It wraps the COPY protocol with error handling, schema discovery, data transformation, and a declarative command-file format that fits into version control. For the underlying COPY mechanics, see the COPY vs INSERT comparison; this article focuses on when pgLoader is the better choice and how to configure it for scale.
pgLoader vs COPY at a glance
| Capability | pgLoader | COPY / \copy |
|---|---|---|
| Bad row handling | Quarantines bad rows, keeps loading | Aborts entire transaction |
| Schema discovery | Auto-creates table from headers | Table must already exist |
| Data transformation | Lisp-based casts and lambdas | None — load as-is or pre-process |
| Remote sources | HTTP, S3, archives, FTP | Local filesystem only |
| Parallelism | Multi-threaded reader/writer | Single-threaded per command |
| Throughput on clean data | Noticeably slower (workload-dependent) | Fastest available |
| Installation | External binary required | Built into PostgreSQL |
| Configuration format | Declarative .load file (version-controllable) | Inline SQL or shell args |
When to Reach for pgLoader
pgLoader is not a universal replacement for COPY. It is the right tool for a specific set of pipeline problems where rigid all-or-nothing loading is the bottleneck.
Untrusted or vendor-supplied data
When the CSV comes from a third party (sales export, partner feed, scraped data), you cannot guarantee every row is well-formed. pgLoader's reject log lets the load complete and gives you a focused file of rows to investigate.
Recurring nightly jobs that must not fail
A pipeline that runs every night needs to make forward progress even when a few rows are bad. With pgLoader, you alert on the reject log size but the data warehouse stays current.
Unknown or evolving schema
When you receive a new export format and don't want to hand-build the table DDL, pgLoader will infer column types from the CSV. Useful for ad-hoc exploration of unfamiliar files.
Cleanup happens during the load
Currency symbols, date format inconsistencies, "Y/N" booleans, padded whitespace — pgLoader's CAST clause applies the transformation as data streams in, removing the need for a separate scrub step.
Source files are remote
pgLoader fetches directly from HTTPS URLs and S3 buckets without an intermediate download. The file is streamed and parsed in memory.
When to Stick With COPY
For internal exports of known shape — your nightly database-to-database dump, your QA seed file, a one-off migration of trusted data — COPY remains the right tool. It is faster, requires no installation, and produces a cleaner failure: if the data is wrong, you want to know immediately, not discover it in a reject log a day later. pgLoader's value appears only when you actually need its resilience features.
Installing pgLoader
| Platform | Command | Notes |
|---|---|---|
| macOS | brew install pgloader | Latest stable; depends on SBCL |
| Ubuntu/Debian | apt-get install pgloader | May lag upstream by a release |
| Docker | docker pull dimitri/pgloader | Best for CI/CD; pin to a tagged version |
| RHEL/CentOS | Build from source | No official package; use Docker instead |
| Windows | Use WSL2 + Ubuntu install | No native Windows binary |
Anatomy of a .load Command File
pgLoader uses a declarative DSL stored in a .load file. The file describes the source, the target, the parsing options, and any setup SQL to run before or after the load. The structure is intentionally readable — you write it once, commit it to Git, and team members can review changes the same way they review code.
Sections of a .load file
| Clause | Purpose |
|---|---|
| LOAD CSV FROM | Source location — local path, HTTP URL, or s3:// |
| INTO postgresql:// | Connection string for the target database |
| TARGET TABLE | Destination table name (created if missing) |
| WITH | Parsing flags — skip header, field separator, quote character, truncate |
| SET | Per-session Postgres parameters — work_mem, client_encoding, statement_timeout |
| BEFORE LOAD DO | SQL statements to run once before ingestion starts (drop indexes, create tables) |
| CAST | Type conversion and transformation rules for specific columns |
| AFTER LOAD DO | Post-load SQL — rebuild indexes, ANALYZE, refresh materialized views |
A minimal command file looks like this:
LOAD CSV
FROM 'data/sales_2023.csv' (id, product, price, sold_at)
INTO postgresql://user:password@localhost/sales_db
TARGET TABLE sales_data
WITH truncate,
skip header = 1,
fields terminated by ',',
fields optionally enclosed by '"'
SET work_mem to '128MB',
client_encoding to 'utf8';Run it with a single command: pgloader import_sales.load. On completion, pgLoader prints a summary table showing rows read, rows imported, rows rejected, and the time spent in each phase (read, prepare, copy, index).
Error Resilience: The Reject Log Pattern
This is the single feature that justifies installing pgLoader for most teams. When a row fails to parse or fails a type cast, pgLoader writes the raw row to a .rejects file and the reason to a .errors file, then continues with the next row. The pipeline does not abort.
The two-log pattern
Always configure both REJECTED DATA LOG (the raw row, ready to re-ingest after fixing) and REJECTED DATA ERRORS (the parse or type error message). Without the errors file you cannot diagnose why a row failed; without the rejects file you have nothing to re-ingest after the fix.
Combined with rows per transaction, this turns a fragile script into a resumable, observable pipeline. A typical configuration commits every 10,000 rows, quarantines bad rows, and produces a daily reject file that an on-call engineer reviews each morning. The database stays current, the load never wakes anyone up at 3 AM, and bad data has a defined remediation path.
Reject-log clauses
| Clause | What it does |
|---|---|
| rows per transaction | Commit interval — smaller batches limit rollback scope on connection errors |
| REJECTED DATA LOG | Path to the file that captures raw bad rows for re-ingestion |
| REJECTED DATA ERRORS | Path to the file that captures the parse or cast error for each rejected row |
| on error stop | Override default behavior and abort on first error (rarely used) |
| errors per second | Threshold above which pgLoader aborts — catches catastrophic schema mismatches |
On-the-Fly Transformation With CAST
The CAST clause lets you transform a column's value while it is streaming into Postgres, instead of loading raw text and running a cleanup pass afterward. This is faster (one pass instead of two), cleaner (no temporary "dirty" state in the target table), and easier to reason about (the transformation lives next to the load definition).
Common transformation patterns
| Source data problem | CAST approach |
|---|---|
| Currency prefix ("$19.99") | Strip the dollar sign with remove-chars, cast to numeric |
| "Y" / "N" instead of true/false | Map with a lambda that returns t for "Y", nil otherwise |
| Padded whitespace in text fields | Use the built-in trim type-modifier on text columns |
| Mixed date formats | Cast to timestamptz using format string detection |
| Empty string vs NULL | Use null if blanks to treat empty strings as NULL |
| Localized decimals ("1.234,56") | Custom lambda to swap separators before numeric cast |
The transformation language is Common Lisp, which is unfamiliar to most teams. In practice you copy-paste from the pgLoader documentation, adjust the column name and the function, and move on. You rarely need to write Lisp from scratch.
Parallel Ingestion
pgLoader is multi-threaded by default. While one worker reads and parses the CSV, another applies casts, and a third streams the result to Postgres via the COPY protocol. On a multi-core machine with a fast SSD, this pipeline-style parallelism keeps the network connection saturated even when the source CSV is large or the transformations are non-trivial.
This is different from running multiple COPY commands in parallel, which is also possible but requires you to manually split the source file and manage concurrent connections. With pgLoader, the parallelism is automatic and configured per-run via workers and concurrency settings.
Production Tuning Tips
Increase work_mem in SET
Give pgLoader more memory for sorts and hash operations during transformation. 128MB to 512MB is a reasonable range for most workloads — anything higher rarely helps and risks exhausting server memory under concurrency.
Consider UNLOGGED tables for wipe-and-load
If the target table is truncated before every load, set it to UNLOGGED to bypass the write-ahead log entirely. Throughput can improve substantially, with the actual gain depending on disk speed, row width, and indexes. Switch back to LOGGED after the load if you need replication or crash recovery for that table.
Version-control .load files
A .load file contains your full ingestion logic — source, schema, transformations, and reject handling. Treat it like application code: store in Git, code-review changes, and tie revisions to the data shape changes that triggered them.
Pin the Docker image tag
pgLoader behavior can change between versions, particularly around CAST defaults and reject-log formats. Pin to a specific image tag in CI/CD so pipeline output stays reproducible.
Drop indexes BEFORE LOAD, rebuild AFTER LOAD
For multi-million-row loads into existing tables, dropping non-essential indexes in BEFORE LOAD and recreating them in AFTER LOAD is significantly faster than maintaining them row-by-row. The same pattern as raw COPY tuning.
UNLOGGED has a real cost
An UNLOGGED table is not replicated, is not crash-safe, and is truncated to empty after a Postgres crash. It is appropriate for staging tables or wipe-and-load patterns where the source CSV is the system of record. Never make a transactional table UNLOGGED for the sake of a faster load.
Decision Summary
Continue along the pipeline
No .load files, no reject logs, no Lisp
Elvity ingests messy CSV data into Postgres with automatic schema inference, type detection, and per-row error handling — without writing or maintaining a pgLoader command file.