Reporting on data in DuckDB

Query your Parquet exports directly from S3 without downloading files.

Feature currently in closed beta.

Prerequisites

# Install DuckDB (one-time)
# macOS
brew install duckdb

# or Linux:
# curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip && unzip duckdb.zip

Setup AWS credentials

DuckDB reads AWS credentials from standard locations:

# Option 1: AWS CLI (recommended)
aws sso login
eval $(aws configure export-credentials --format env)
aws sts get-caller-identity

# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"

Recommended setup – automated initialization

The export generates initialization scripts to help you query the data. Choose based on your use case:

Local files

init_duckdb_local.sql - Creates views that query local Parquet files directly (no data duplication)

S3 files

For BI tools: Use init_duckdb_s3.sql (TABLES)

  • Downloads data once from S3 into DuckDB for fast queries
  • Best for: Dashboards, reporting, BI tools in containers (e.g. Metabase)
  • Performance: Fast (data cached in DuckDB)
  • Trade-off: Requires manual refresh to see new data

For ad-hoc queries: Use init_duckdb_s3_views.sql (VIEWS)

  • Queries S3 Parquet files directly on each SELECT
  • Best for: DuckDB CLI exploration, always-fresh data
  • Performance: Slower (reads from S3 on each query)
  • Trade-off: Always shows latest data without refresh

Example: BI tool setup (tables)

# Download the init script (tables version)
aws s3 cp s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3.sql .

# Initialize the database
duckdb trelica_dw.db < init_duckdb_s3.sql

Example: ad-hoc queries (views)

# Download the init script (views version)
aws s3 cp s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3_views.sql .

# Initialize the database
duckdb trelica_dw.db < init_duckdb_s3_views.sql

Now you can query all tables using simple names:

-- Start DuckDB
duckdb trelica_dw.db -ui

-- Query tables directly
SELECT * FROM person WHERE email LIKE '%@example.com';

-- Join across tables
SELECT
    p.name,
    p.email,
    t.name as team_name
FROM person p
JOIN person_team pt ON p.person_id = pt.person_id
JOIN team t ON pt.team_id = t.team_id;

Query multiple dates

Track changes over time by querying multiple export dates:

-- Count persons in each export
SELECT
    regexp_extract(filename, '(\d{4}-\d{2}-\d{2})', 1) as export_date,
    COUNT(*) as person_count
FROM read_parquet(
    's3://<YOUR-BUCKET>/trelica/*/person.parquet',
    filename=true
)
GROUP BY export_date
ORDER BY export_date;

-- Query all data across dates
SELECT * FROM read_parquet('s3://<YOUR-BUCKET>/trelica/*/person.parquet');

-- Include export date in results
SELECT
    *,
    regexp_extract(filename, '(\d{4}-\d{2}-\d{2})', 1) as export_date
FROM read_parquet(
    's3://<YOUR-BUCKET>/trelica/*/person.parquet',
    filename=true
);

Common queries

Using the automated setup, query with simple table names:

-- Count records per table
SELECT COUNT(*) FROM person;
SELECT COUNT(*) FROM app_user;

-- Check schema
DESCRIBE SELECT * FROM person;

-- Find people by email domain
SELECT name, email, department
FROM person
WHERE email LIKE '%@example.com';

-- License usage by team
SELECT
    t.name as team,
    COUNT(DISTINCT p.person_id) as person_count,
    COUNT(DISTINCT au.app_user_id) as app_user_count
FROM person p
LEFT JOIN person_team pt ON p.person_id = pt.person_id
LEFT JOIN team t ON pt.team_id = t.team_id
LEFT JOIN app_user au ON p.person_id = au.person_id
GROUP BY t.name
ORDER BY person_count DESC;

Export query results

-- Export to CSV
COPY (
    SELECT * FROM person
) TO 'person_export.csv' (HEADER, DELIMITER ',');

-- Export filtered results
COPY (
    SELECT * FROM person WHERE department = 'Engineering'
) TO 'engineering_team.csv' (HEADER, DELIMITER ',');

-- Export joined data to Parquet
COPY (
    SELECT
        p.name,
        p.email,
        t.name as team
    FROM person p
    JOIN person_team pt ON p.person_id = pt.person_id
    JOIN team t ON pt.team_id = t.team_id
) TO 'person_teams.parquet' (FORMAT PARQUET);

Refreshing data

Tables (init_duckdb_s3.sql)

When using tables, you need to manually refresh to see new data:

# Re-run the init script to reload data from S3
duckdb trelica_dw.db < init_duckdb_s3.sql

This is ideal for BI tools where you want controlled refreshes (e.g., nightly updates).

Views (init_duckdb_s3_views.sql)

When using views, data is always fresh - no refresh needed! Each query reads directly from S3.

Advanced: IAM role assumption

If using IAM role instead of access keys:

-- Set role ARN in DuckDB
SET s3_role_arn='arn:aws:iam::1234567890:role/TrelicaDWRole';

-- Query as normal - DuckDB will assume the role automatically
SELECT * FROM person LIMIT 10;

Troubleshooting

Access denied

# Verify AWS credentials work
aws s3 ls s3://<YOUR-BUCKET>/trelica/

# If SSO, ensure session is active
aws sso login
-- In DuckDB, check region is set correctly
  SELECT current_setting('s3_region');

  -- Update if needed (must match bucket region)
  SET s3_region='us-east-1';

httpfs extension not found

-- Install and load the extension
  INSTALL httpfs;
  LOAD httpfs;

IO Error: No files found

Check that:

  1. You used an S3 init script (init_duckdb_s3.sql or init_duckdb_s3_views.sql), not the local one (init_duckdb_local.sql)
  2. The export date exists in S3
  3. httpfs is installed and loaded
# Verify the date exists
aws s3 ls s3://<YOUR-BUCKET>/trelica/2025-10-22/

# Verify init scripts are uploaded
aws s3 ls s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3.sql
aws s3 ls s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3_views.sql

Performance tips

1. Filter early - DuckDB pushes filters down to Parquet

-- Good: Filter pushed to Parquet
SELECT * FROM person WHERE email LIKE '%@example.com';

-- Bad: Reads entire file first
SELECT * FROM (SELECT * FROM person) WHERE email LIKE '%@example.com';

2. Select only needed columns - Parquet is columnar

-- Only reads name and email columns from Parquet
SELECT name, email FROM person;

3. Use date partitioning - Read only relevant dates

-- Read only October 2025 exports
SELECT * FROM read_parquet('s3://<YOUR-BUCKET>/trelica/2025-10-*/person.parquet');

Integration with other tools

Once tables or views are created, connect any tool that supports DuckDB:

  • Python: duckdb.connect('trelica_dw.db')
  • Tableau: DuckDB JDBC driver
  • Superset: DuckDB database connection
  • Jupyter: %load_ext sql + %sql duckdb:///trelica_dw.db

Example Python:

import duckdb

con = duckdb.connect('trelica_dw.db')
df = con.execute("""
    SELECT * FROM person
    WHERE email LIKE '%@example.com'
""").df()
print(df)

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.