Home/Articles/pgLoader for CSV-to-Postgres pipelines

Scalable Data Ingestion: Using pgLoader for Advanced CSV-to-Postgres Pipelines

COPY is fast but brittle — one bad row rolls back the whole load. pgLoader trades a little speed for error resilience, schema discovery, and on-the-fly transformation. Here is when each one is the right tool, and how to build a production pgLoader pipeline.

9 min read·PostgreSQL · Data Pipelines

PostgreSQL's COPY command is the fastest way to push CSV data into a table — but it has one rigid behavior that makes it dangerous for production pipelines: a single malformed row aborts the entire transaction. When you're ingesting a million-row file at 3 AM and row 700,000 has a stray comma in an unquoted text field, you don't want the whole load to roll back. You want the good rows committed and the bad row quarantined for review.

pgLoader is the standard tool for that job in the Postgres ecosystem. It wraps the COPY protocol with error handling, schema discovery, data transformation, and a declarative command-file format that fits into version control. For the underlying COPY mechanics, see the COPY vs INSERT comparison; this article focuses on when pgLoader is the better choice and how to configure it for scale.

pgLoader vs COPY at a glance

CapabilitypgLoaderCOPY / \copy
Bad row handlingQuarantines bad rows, keeps loadingAborts entire transaction
Schema discoveryAuto-creates table from headersTable must already exist
Data transformationLisp-based casts and lambdasNone — load as-is or pre-process
Remote sourcesHTTP, S3, archives, FTPLocal filesystem only
ParallelismMulti-threaded reader/writerSingle-threaded per command
Throughput on clean dataNoticeably slower (workload-dependent)Fastest available
InstallationExternal binary requiredBuilt into PostgreSQL
Configuration formatDeclarative .load file (version-controllable)Inline SQL or shell args

When to Reach for pgLoader

pgLoader is not a universal replacement for COPY. It is the right tool for a specific set of pipeline problems where rigid all-or-nothing loading is the bottleneck.

Untrusted or vendor-supplied data

When the CSV comes from a third party (sales export, partner feed, scraped data), you cannot guarantee every row is well-formed. pgLoader's reject log lets the load complete and gives you a focused file of rows to investigate.

Recurring nightly jobs that must not fail

A pipeline that runs every night needs to make forward progress even when a few rows are bad. With pgLoader, you alert on the reject log size but the data warehouse stays current.

Unknown or evolving schema

When you receive a new export format and don't want to hand-build the table DDL, pgLoader will infer column types from the CSV. Useful for ad-hoc exploration of unfamiliar files.

Cleanup happens during the load

Currency symbols, date format inconsistencies, "Y/N" booleans, padded whitespace — pgLoader's CAST clause applies the transformation as data streams in, removing the need for a separate scrub step.

Source files are remote

pgLoader fetches directly from HTTPS URLs and S3 buckets without an intermediate download. The file is streamed and parsed in memory.

When to Stick With COPY

For internal exports of known shape — your nightly database-to-database dump, your QA seed file, a one-off migration of trusted data — COPY remains the right tool. It is faster, requires no installation, and produces a cleaner failure: if the data is wrong, you want to know immediately, not discover it in a reject log a day later. pgLoader's value appears only when you actually need its resilience features.

Installing pgLoader

PlatformCommandNotes
macOSbrew install pgloaderLatest stable; depends on SBCL
Ubuntu/Debianapt-get install pgloaderMay lag upstream by a release
Dockerdocker pull dimitri/pgloaderBest for CI/CD; pin to a tagged version
RHEL/CentOSBuild from sourceNo official package; use Docker instead
WindowsUse WSL2 + Ubuntu installNo native Windows binary

Anatomy of a .load Command File

pgLoader uses a declarative DSL stored in a .load file. The file describes the source, the target, the parsing options, and any setup SQL to run before or after the load. The structure is intentionally readable — you write it once, commit it to Git, and team members can review changes the same way they review code.

Sections of a .load file

ClausePurpose
LOAD CSV FROMSource location — local path, HTTP URL, or s3://
INTO postgresql://Connection string for the target database
TARGET TABLEDestination table name (created if missing)
WITHParsing flags — skip header, field separator, quote character, truncate
SETPer-session Postgres parameters — work_mem, client_encoding, statement_timeout
BEFORE LOAD DOSQL statements to run once before ingestion starts (drop indexes, create tables)
CASTType conversion and transformation rules for specific columns
AFTER LOAD DOPost-load SQL — rebuild indexes, ANALYZE, refresh materialized views

A minimal command file looks like this:

LOAD CSV
  FROM 'data/sales_2023.csv' (id, product, price, sold_at)
  INTO postgresql://user:password@localhost/sales_db
  TARGET TABLE sales_data

  WITH truncate,
       skip header = 1,
       fields terminated by ',',
       fields optionally enclosed by '"'

  SET work_mem to '128MB',
      client_encoding to 'utf8';

Run it with a single command: pgloader import_sales.load. On completion, pgLoader prints a summary table showing rows read, rows imported, rows rejected, and the time spent in each phase (read, prepare, copy, index).

Error Resilience: The Reject Log Pattern

This is the single feature that justifies installing pgLoader for most teams. When a row fails to parse or fails a type cast, pgLoader writes the raw row to a .rejects file and the reason to a .errors file, then continues with the next row. The pipeline does not abort.

The two-log pattern

Always configure both REJECTED DATA LOG (the raw row, ready to re-ingest after fixing) and REJECTED DATA ERRORS (the parse or type error message). Without the errors file you cannot diagnose why a row failed; without the rejects file you have nothing to re-ingest after the fix.

Combined with rows per transaction, this turns a fragile script into a resumable, observable pipeline. A typical configuration commits every 10,000 rows, quarantines bad rows, and produces a daily reject file that an on-call engineer reviews each morning. The database stays current, the load never wakes anyone up at 3 AM, and bad data has a defined remediation path.

Reject-log clauses

ClauseWhat it does
rows per transactionCommit interval — smaller batches limit rollback scope on connection errors
REJECTED DATA LOGPath to the file that captures raw bad rows for re-ingestion
REJECTED DATA ERRORSPath to the file that captures the parse or cast error for each rejected row
on error stopOverride default behavior and abort on first error (rarely used)
errors per secondThreshold above which pgLoader aborts — catches catastrophic schema mismatches

On-the-Fly Transformation With CAST

The CAST clause lets you transform a column's value while it is streaming into Postgres, instead of loading raw text and running a cleanup pass afterward. This is faster (one pass instead of two), cleaner (no temporary "dirty" state in the target table), and easier to reason about (the transformation lives next to the load definition).

Common transformation patterns

Source data problemCAST approach
Currency prefix ("$19.99")Strip the dollar sign with remove-chars, cast to numeric
"Y" / "N" instead of true/falseMap with a lambda that returns t for "Y", nil otherwise
Padded whitespace in text fieldsUse the built-in trim type-modifier on text columns
Mixed date formatsCast to timestamptz using format string detection
Empty string vs NULLUse null if blanks to treat empty strings as NULL
Localized decimals ("1.234,56")Custom lambda to swap separators before numeric cast

The transformation language is Common Lisp, which is unfamiliar to most teams. In practice you copy-paste from the pgLoader documentation, adjust the column name and the function, and move on. You rarely need to write Lisp from scratch.

Parallel Ingestion

pgLoader is multi-threaded by default. While one worker reads and parses the CSV, another applies casts, and a third streams the result to Postgres via the COPY protocol. On a multi-core machine with a fast SSD, this pipeline-style parallelism keeps the network connection saturated even when the source CSV is large or the transformations are non-trivial.

This is different from running multiple COPY commands in parallel, which is also possible but requires you to manually split the source file and manage concurrent connections. With pgLoader, the parallelism is automatic and configured per-run via workers and concurrency settings.

Production Tuning Tips

Increase work_mem in SET

Give pgLoader more memory for sorts and hash operations during transformation. 128MB to 512MB is a reasonable range for most workloads — anything higher rarely helps and risks exhausting server memory under concurrency.

Consider UNLOGGED tables for wipe-and-load

If the target table is truncated before every load, set it to UNLOGGED to bypass the write-ahead log entirely. Throughput can improve substantially, with the actual gain depending on disk speed, row width, and indexes. Switch back to LOGGED after the load if you need replication or crash recovery for that table.

Version-control .load files

A .load file contains your full ingestion logic — source, schema, transformations, and reject handling. Treat it like application code: store in Git, code-review changes, and tie revisions to the data shape changes that triggered them.

Pin the Docker image tag

pgLoader behavior can change between versions, particularly around CAST defaults and reject-log formats. Pin to a specific image tag in CI/CD so pipeline output stays reproducible.

Drop indexes BEFORE LOAD, rebuild AFTER LOAD

For multi-million-row loads into existing tables, dropping non-essential indexes in BEFORE LOAD and recreating them in AFTER LOAD is significantly faster than maintaining them row-by-row. The same pattern as raw COPY tuning.

UNLOGGED has a real cost

An UNLOGGED table is not replicated, is not crash-safe, and is truncated to empty after a Postgres crash. It is appropriate for staging tables or wipe-and-load patterns where the source CSV is the system of record. Never make a transactional table UNLOGGED for the sake of a faster load.

Decision Summary

Internal export, known schema, clean data
Use \copy (or COPY for server-local files)
Third-party data, any chance of bad rows
Use pgLoader with REJECTED DATA logs
Need schema inference from CSV headers
Use pgLoader with auto-create
Need transformation during load (currency, dates, booleans)
Use pgLoader with CAST clauses
Source is on S3 / HTTPS / FTP
Use pgLoader with remote source URI
Single-shot migration, you control the file
Use \copy — simpler, faster, no install

No .load files, no reject logs, no Lisp

Elvity ingests messy CSV data into Postgres with automatic schema inference, type detection, and per-row error handling — without writing or maintaining a pgLoader command file.