Query your Parquet exports directly from S3 without downloading files.
Prerequisites
# Install DuckDB (one-time)
# macOS
brew install duckdb
# or Linux:
# curl -L https://github.com/duckdb/duckdb/releases/latest/download/duckdb_cli-linux-amd64.zip -o duckdb.zip && unzip duckdb.zipSetup AWS credentials
DuckDB reads AWS credentials from standard locations:
# Option 1: AWS CLI (recommended)
aws sso login
eval $(aws configure export-credentials --format env)
aws sts get-caller-identity
# Option 2: Environment variables
export AWS_ACCESS_KEY_ID="AKIA..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="us-east-1"Recommended setup – automated initialization
The export generates initialization scripts to help you query the data. Choose based on your use case:
Local files
init_duckdb_local.sql - Creates views that query local Parquet files directly (no data duplication)
S3 files
For BI tools: Use init_duckdb_s3.sql (TABLES)
- Downloads data once from S3 into DuckDB for fast queries
- Best for: Dashboards, reporting, BI tools in containers (e.g. Metabase)
- Performance: Fast (data cached in DuckDB)
- Trade-off: Requires manual refresh to see new data
For ad-hoc queries: Use init_duckdb_s3_views.sql (VIEWS)
- Queries S3 Parquet files directly on each SELECT
- Best for: DuckDB CLI exploration, always-fresh data
- Performance: Slower (reads from S3 on each query)
- Trade-off: Always shows latest data without refresh
Example: BI tool setup (tables)
# Download the init script (tables version)
aws s3 cp s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3.sql .
# Initialize the database
duckdb trelica_dw.db < init_duckdb_s3.sqlExample: ad-hoc queries (views)
# Download the init script (views version)
aws s3 cp s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3_views.sql .
# Initialize the database
duckdb trelica_dw.db < init_duckdb_s3_views.sqlNow you can query all tables using simple names:
-- Start DuckDB
duckdb trelica_dw.db -ui
-- Query tables directly
SELECT * FROM person WHERE email LIKE '%@example.com';
-- Join across tables
SELECT
p.name,
p.email,
t.name as team_name
FROM person p
JOIN person_team pt ON p.person_id = pt.person_id
JOIN team t ON pt.team_id = t.team_id;Query multiple dates
Track changes over time by querying multiple export dates:
-- Count persons in each export
SELECT
regexp_extract(filename, '(\d{4}-\d{2}-\d{2})', 1) as export_date,
COUNT(*) as person_count
FROM read_parquet(
's3://<YOUR-BUCKET>/trelica/*/person.parquet',
filename=true
)
GROUP BY export_date
ORDER BY export_date;
-- Query all data across dates
SELECT * FROM read_parquet('s3://<YOUR-BUCKET>/trelica/*/person.parquet');
-- Include export date in results
SELECT
*,
regexp_extract(filename, '(\d{4}-\d{2}-\d{2})', 1) as export_date
FROM read_parquet(
's3://<YOUR-BUCKET>/trelica/*/person.parquet',
filename=true
);Common queries
Using the automated setup, query with simple table names:
-- Count records per table
SELECT COUNT(*) FROM person;
SELECT COUNT(*) FROM app_user;
-- Check schema
DESCRIBE SELECT * FROM person;
-- Find people by email domain
SELECT name, email, department
FROM person
WHERE email LIKE '%@example.com';
-- License usage by team
SELECT
t.name as team,
COUNT(DISTINCT p.person_id) as person_count,
COUNT(DISTINCT au.app_user_id) as app_user_count
FROM person p
LEFT JOIN person_team pt ON p.person_id = pt.person_id
LEFT JOIN team t ON pt.team_id = t.team_id
LEFT JOIN app_user au ON p.person_id = au.person_id
GROUP BY t.name
ORDER BY person_count DESC;Export query results
-- Export to CSV
COPY (
SELECT * FROM person
) TO 'person_export.csv' (HEADER, DELIMITER ',');
-- Export filtered results
COPY (
SELECT * FROM person WHERE department = 'Engineering'
) TO 'engineering_team.csv' (HEADER, DELIMITER ',');
-- Export joined data to Parquet
COPY (
SELECT
p.name,
p.email,
t.name as team
FROM person p
JOIN person_team pt ON p.person_id = pt.person_id
JOIN team t ON pt.team_id = t.team_id
) TO 'person_teams.parquet' (FORMAT PARQUET);Refreshing data
Tables (init_duckdb_s3.sql)
When using tables, you need to manually refresh to see new data:
# Re-run the init script to reload data from S3
duckdb trelica_dw.db < init_duckdb_s3.sqlThis is ideal for BI tools where you want controlled refreshes (e.g., nightly updates).
Views (init_duckdb_s3_views.sql)
When using views, data is always fresh - no refresh needed! Each query reads directly from S3.
Advanced: IAM role assumption
If using IAM role instead of access keys:
-- Set role ARN in DuckDB
SET s3_role_arn='arn:aws:iam::1234567890:role/TrelicaDWRole';
-- Query as normal - DuckDB will assume the role automatically
SELECT * FROM person LIMIT 10;Troubleshooting
Access denied
# Verify AWS credentials work
aws s3 ls s3://<YOUR-BUCKET>/trelica/
# If SSO, ensure session is active
aws sso login-- In DuckDB, check region is set correctly
SELECT current_setting('s3_region');
-- Update if needed (must match bucket region)
SET s3_region='us-east-1';httpfs extension not found
-- Install and load the extension
INSTALL httpfs;
LOAD httpfs;IO Error: No files found
Check that:
- You used an S3 init script (
init_duckdb_s3.sqlorinit_duckdb_s3_views.sql), not the local one (init_duckdb_local.sql) - The export date exists in S3
- httpfs is installed and loaded
# Verify the date exists
aws s3 ls s3://<YOUR-BUCKET>/trelica/2025-10-22/
# Verify init scripts are uploaded
aws s3 ls s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3.sql
aws s3 ls s3://<YOUR-BUCKET>/trelica/2025-10-22/init_duckdb_s3_views.sqlPerformance tips
1. Filter early - DuckDB pushes filters down to Parquet
-- Good: Filter pushed to Parquet
SELECT * FROM person WHERE email LIKE '%@example.com';
-- Bad: Reads entire file first
SELECT * FROM (SELECT * FROM person) WHERE email LIKE '%@example.com';2. Select only needed columns - Parquet is columnar
-- Only reads name and email columns from Parquet
SELECT name, email FROM person;3. Use date partitioning - Read only relevant dates
-- Read only October 2025 exports
SELECT * FROM read_parquet('s3://<YOUR-BUCKET>/trelica/2025-10-*/person.parquet');Integration with other tools
Once tables or views are created, connect any tool that supports DuckDB:
-
Python:
duckdb.connect('trelica_dw.db') - Tableau: DuckDB JDBC driver
- Superset: DuckDB database connection
-
Jupyter:
%load_ext sql+%sql duckdb:///trelica_dw.db
Example Python:
import duckdb
con = duckdb.connect('trelica_dw.db')
df = con.execute("""
SELECT * FROM person
WHERE email LIKE '%@example.com'
""").df()
print(df)
Comments
0 comments
Please sign in to leave a comment.