NomalvoDocsTechnology
Related
How Your Mouse Tracks Movement: A Step-by-Step Guide to Ball and Optical TechnologyA Guide to Getting the Liquid Glass Theme on WhatsAppUbuntu's New Default Terminal Ptyxis Brings Modern Container Support and Tab OverviewsPython 3.14.3 and 3.13.12 Released: Your Questions AnsweredBosch Boosts E-Bike Power and Torque with a Simple Software UpdateReviving the Depths: How Unknown Worlds Brought a Lost Feature Back for Subnautica 2Rust 1.94.0 Released: Array Windows, Smarter Cargo Config, and TOML 1.1Apple Abandons Vision Pro After M5 Failure, Shifts Focus to MacBook Ultra and Foldable iPhone

Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format

Last updated: 2026-05-03 22:36:44 · Technology

Overview

DuckLake 1.0 introduces a fresh approach to managing data lake metadata. Instead of scattering metadata across numerous files in object storage, it centralizes table metadata in a SQL database—making updates, sorting, and partitioning more efficient. Built as a DuckDB extension, DuckLake integrates seamlessly with existing workflows and offers compatibility with Iceberg-style features. This guide walks you through its setup, core operations, and common pitfalls.

Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format
Source: www.infoq.com

Prerequisites

  • DuckDB: Version 0.10.0 or higher (command-line interface or Python binding).
  • Object Storage: A bucket or directory (e.g., S3, MinIO, local filesystem) for storing parquet files.
  • SQL Database: For the catalog—DuckDB itself works for local testing; production uses PostgreSQL or MySQL.
  • DuckLake Extension: Install via INSTALL ducklake; LOAD ducklake;.

Step-by-Step Instructions

1. Install and Load the DuckLake Extension

Open DuckDB and run:

INSTALL ducklake FROM community;
LOAD ducklake;

This registers DuckLake’s functions and types. Verify with SELECT * FROM ducklake_version();

2. Create a DuckLake Catalog

A catalog holds all table metadata. Use CREATE DUCKLAKE CATALOG:

CREATE DUCKLAKE CATALOG my_catalog
  DATABASE 'duckdb'  -- can be 'postgresql' or 'mysql'
  CONNECTION_STRING 'file:///path/to/catalog.db';

-- Switch to the catalog
USE my_catalog;

Tip: For remote databases, use a connection string like postgresql://user:pass@host/db.

3. Create a DuckLake Table

Define a table with partitioning and sorting:

CREATE DUCKLAKE TABLE sales (
    order_id INTEGER,
    amount DECIMAL(10,2),
    order_date DATE,
    region VARCHAR
)
PARTITIONED BY (region)
SORTED BY (order_date);

This creates a logical table. Data is stored as Parquet files in your object storage.

4. Insert Data

Insert directly or from a SELECT:

INSERT INTO sales VALUES
    (1, 150.00, '2025-01-15', 'East'),
    (2, 200.50, '2025-01-16', 'West');

DuckLake automatically writes new Parquet files per partition and updates the catalog.

5. Query the Table

Standard SQL works—DuckLake reads the catalog to locate files:

SELECT region, SUM(amount) AS total_sales
FROM sales
WHERE order_date >= '2025-01-01'
GROUP BY region;

Partition pruning and sorting are applied automatically.

Getting Started with DuckLake 1.0: A SQL-Based Data Lake Format
Source: www.infoq.com

6. Manage Partitions and Small Updates

DuckLake supports incremental updates without rewriting whole partitions. Use MERGE or DELETE:

DELETE FROM sales WHERE order_id = 1;

MERGE INTO sales AS target
USING (VALUES (3, 300.00, '2025-01-20', 'East')) AS src
ON target.order_id = src.column1
WHEN MATCHED THEN UPDATE SET amount = src.column2
WHEN NOT MATCHED THEN INSERT (order_id, amount, order_date, region)
    VALUES (src.column1, src.column2, src.column3, src.column4);

The catalog tracks these small changes efficiently.

7. Iceberg Compatibility

DuckLake can read Iceberg tables if you enable compatibility mode:

SET ducklake_iceberg_compat = true;
SELECT * FROM iceberg_scan('s3://bucket/iceberg_table');

Write support is limited to DuckLake-native tables.

Common Mistakes

  • Forgetting to load the extension: Always run LOAD ducklake; after installation.
  • Wrong catalog connection string: Ensure the path or database URL is correct and accessible.
  • Partition key mismatch: When inserting, include the partition column; missing it causes errors.
  • Overwriting small files: DuckLake handles small updates, but avoid frequent tiny inserts—compact periodically with OPTIMIZE TABLE sales;.
  • Ignoring sorting: Define a sort column to speed up range queries; otherwise full scans occur.

Summary

DuckLake 1.0 simplifies data lake management by storing metadata in SQL, enabling faster updates and smarter partitioning. With its DuckDB extension, you get a lightweight yet powerful alternative to Hive or Iceberg for analytical workloads. Start small, tune your partitions, and enjoy seamless SQL-driven data lakes.