Analytics in Snowflake

Query your Parquet exports directly from S3 using Snowflake external tables.

Feature currently in closed beta.

Prerequisites

  1. Snowflake account with ACCOUNTADMIN role (for initial setup)
  2. AWS IAM role configured for Snowflake access
  3. Database and schema created in Snowflake

Quick start

The export automatically generates init_snowflake_s3.sql with all configuration needed.

Step 1: Download the initialization script

# Download from S3
aws s3 cp s3://<YOUR-BUCKET>/trelica/2025-10-22/init_snowflake_s3.sql .

Step 2: Create the storage integration

IMPORTANT: You must configure the AWS IAM trust policy (step 3) before creating external tables, otherwise you'll get an error: User: arn:aws:iam::829545265689:user/externalstages/... is not authorized to perform: sts:AssumeRole

First, create only the storage integration in Snowflake (requires ACCOUNTADMIN role):

-- Copy ONLY the storage integration part from init_snowflake_s3.sql
CREATE STORAGE INTEGRATION IF NOT EXISTS dw_s3_integration
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = 'S3'
  ENABLED = TRUE
  STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::908027406242:role/TrelicaDWRole'
  STORAGE_ALLOWED_LOCATIONS = ('s3://<YOUR-BUCKET>/trelica/');
Don't run the rest of the script yet - you need to configure AWS first.

Step 3: Configure AWS IAM trust policy

Now get Snowflake's AWS identity so you can configure the IAM trust policy.

3a. Get Snowflake's AWS identity

In Snowflake, run:

DESC STORAGE INTEGRATION dw_s3_integration;

Look for these two properties in the output:

  • STORAGE_AWS_IAM_USER_ARN - The IAM user ARN that Snowflake will use (e.g., arn:aws:iam::829545265689:user/externalstages/cic8na0000)
  • STORAGE_AWS_EXTERNAL_ID - The external ID for additional security (e.g., ABC12345_SFCRole=1_abcdefg)

Copy these values - you'll need them in the next step.

3b. Update the IAM role trust policy in AWS

You need to update the trust policy of the IAM role specified in your storage integration (e.g., TrelicaDWRole).

Option 1: AWS Console

  1. Navigate to AWS IAM Console > Roles.
  2. Search for your role, for example TrelicaDWRole.
  3. Select the role name.
  4. Open the Trust relationships tab.
  5. Select Edit trust policy.
  6. Update the policy to include another Statement for Snowflake's IAM user:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::829545265689:user/externalstages/cic8na0000"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "ABC12345_SFCRole=1_abcdefg"
        }
      }
    }
  ]
}

Replace the AWS value with your STORAGE_AWS_IAM_USER_ARN and the sts:ExternalId value with your STORAGE_AWS_EXTERNAL_ID from step 3a.

Select Update policy.

Option 2: AWS CLI

# First, get your current trust policy
aws iam get-role --role-name TrelicaDWRole --query 'Role.AssumeRolePolicyDocument' > trust-policy.json

# Edit trust-policy.json to add the Snowflake user (see JSON above)

# Update the trust policy
aws iam update-assume-role-policy --role-name TrelicaDWRole --policy-document file://trust-policy.json

3c. Verify the trust policy is configured

The AWS trust policy update should be immediate. You can proceed to step 4.

Step 4: Complete the Snowflake setup

Now that the IAM trust policy is configured, run the rest of the init_snowflake_s3.sql script in Snowflake:

-- Copy and paste the rest of init_snowflake_s3.sql
-- (file format, stage, external tables, and tables)

The script will:

  1. Create file format for Parquet
  2. Create stage pointing to your S3 location to validate access via IAM trust.
  3. Create materialized tables by loading data from Parquet files (fast queries for BI tools)

If you get authorization errors when creating tables:

  • Double-check the IAM user ARN matches exactly in the trust policy
  • Verify the External ID is correct. External IDs are case-sensitive.
  • Make sure the IAM role has S3 read permissions (see troubleshooting section)

You can verify the stage has access and tables were created:

-- List files in stage
LIST @dw_stage;

-- Show created tables
SHOW TABLES;

Step 5: Query your data

-- Query tables normally (fast - data is materialized in Snowflake)
SELECT * FROM person LIMIT 10;

-- Join across tables
SELECT
    p.name,
    p.email,
    t.name as team_name
FROM person p
JOIN person_team pt ON p.person_id = pt.person_id
JOIN team t ON pt.team_id = t.team_id;

Common queries

Schema inspection

-- Show tables
SHOW TABLES;

-- Describe table structure
DESCRIBE TABLE person;

-- Show views
SHOW VIEWS;

Data queries

-- Count records
SELECT COUNT(*) FROM person;
SELECT COUNT(*) FROM app_account;

-- Find people by email domain
SELECT name, email, department
FROM person
WHERE email LIKE '%@example.com';

-- License usage by team
SELECT
    t.name as team,
    COUNT(DISTINCT p.person_id) as person_count,
    COUNT(DISTINCT au.app_account_id) as app_account_count
FROM person p
LEFT JOIN person_team pt ON p.person_id = pt.person_id
LEFT JOIN team t ON pt.team_id = t.team_id
LEFT JOIN app_account au ON p.person_id = au.person_id
GROUP BY t.name
ORDER BY person_count DESC;

Working with multiple export dates

Query across multiple export dates by updating the stage URL:

-- Create stage for different date
CREATE STAGE dw_stage_oct21
  URL = 's3://<YOUR-BUCKET>/trelica/2025-10-21/'
  STORAGE_INTEGRATION = dw_s3_integration
  FILE_FORMAT = parquet_format;

-- Create view for historical data
CREATE VIEW person_oct21 AS
SELECT
  $1:person_id::VARCHAR as person_id,
  $1:email::VARCHAR as email,
  -- ... other columns ...
FROM @dw_stage_oct21/person.parquet;

-- Compare counts across dates
SELECT
    'Oct 21' as export_date,
    COUNT(*) as person_count
FROM person_oct21
UNION ALL
SELECT
    'Oct 22' as export_date,
    COUNT(*) as person_count
FROM person;

Data refresh

When new exports are uploaded to S3, refresh the tables to load the latest data:

Option 1: Re-run the entire script

# Download the latest script from new export date
aws s3 cp s3://<YOUR-BUCKET>/trelica/2025-10-23/init_snowflake_s3.sql .

# Run in Snowflake (only the table creation parts)
# This recreates all tables with fresh data

Option 2: Update stage and refresh individual tables

-- Update stage to point to new date
CREATE OR REPLACE STAGE dw_stage
  URL = 's3://<YOUR-BUCKET>/trelica/2025-10-23/'  -- New date
  STORAGE_INTEGRATION = dw_s3_integration
  FILE_FORMAT = parquet_format;

-- Refresh specific table by recreating it
CREATE OR REPLACE TABLE person AS
SELECT
  $1:person_id::VARCHAR as person_id,
  $1:email::VARCHAR as email,
  $1:name::VARCHAR as name,
  -- ... all other columns ...
FROM @dw_stage/person.parquet;

-- Check updated data
SELECT COUNT(*) FROM person;

Tip: Save the CREATE TABLE statements from the init script to make refreshes easy.

Performance optimization

Partitioning

For large datasets across multiple dates, use multiple stages or wildcard patterns:

-- Create views that span multiple dates
CREATE VIEW person_all_dates AS
SELECT
  $1:person_id::VARCHAR as person_id,
  $1:email::VARCHAR as email,
  -- ... other columns ...
  REGEXP_SUBSTR(METADATA$FILENAME, '\\d{4}-\\d{2}-\\d{2}') as export_date
FROM @dw_stage/../*/person.parquet;

-- Query with date filter
SELECT * FROM person_all_dates
WHERE export_date = '2025-10-22';

Clustering

For frequently queried columns, materialize the view data:

-- Materialize view data
CREATE TABLE person_materialized AS
SELECT * FROM person;

-- Add clustering
ALTER TABLE person_materialized
CLUSTER BY (email, department);

Cost management

Tables are loaded once from S3, then stored in Snowflake:

  • One-time S3 costs: Data transfer when tables are created/refreshed
  • Ongoing Snowflake costs: Storage and compute for queries

Cost optimization tips:

  1. Use clustering for frequently filtered columns (email, department, etc.)
  2. Refresh only when needed - Don't recreate tables unnecessarily
  3. Select specific columns - Only query columns you need
  4. Drop unused tables - Remove old tables if no longer needed
-- Good: Efficient filtering
SELECT email FROM person WHERE department = 'Engineering';

-- Bad: Reads entire table then filters
SELECT * FROM person LIMIT 1000000;

Troubleshooting

Access denied errors

Error: "User: arn:aws:iam::829545265689:user/externalstages/... is not authorized to perform: sts:AssumeRole"

This error occurs when creating tables or listing the stage if the IAM role trust policy hasn't been configured.

When does this happen?

  • When running CREATE TABLE commands (which read from the stage)
  • When running LIST @dw_stage
  • When querying stage files

Solution: Configure the IAM role trust policy (Step 3 in Quick Start above)

Quick fix:

  1. In Snowflake: DESC STORAGE INTEGRATION dw_s3_integration;
  2. Copy the STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID values
  3. Update your IAM role's trust policy in AWS Console (IAM → Roles → TrelicaDWRole → Trust relationships)
  4. Add the Snowflake user ARN and external ID to the trust policy
  5. Retry creating the tables

Other access errors

-- Check storage integration is configured
SHOW INTEGRATIONS LIKE 'dw_s3_integration';

-- Verify stage can access S3
LIST @dw_stage;

If LIST fails after trust policy is configured:

  1. Verify IAM role has S3 read permissions on the bucket
  2. Check the External ID matches exactly (case-sensitive)
  3. Confirm the IAM user ARN matches exactly
  4. Ensure the S3 bucket exists and the path is correct

No data in tables

-- Check if files exist in stage
LIST @dw_stage;

-- Verify table exists
SHOW TABLES LIKE 'person';

-- Check table row count
SELECT COUNT(*) FROM person;

-- Verify stage is accessible
SELECT * FROM @dw_stage/person.parquet LIMIT 1;

If tables are empty, recreate them using the CREATE TABLE statements from the init script.

Slow query performance

Tables are already materialized, so queries should be fast. If still slow:

-- Check query execution plan
EXPLAIN SELECT * FROM person WHERE email LIKE '%@example.com';

-- Add clustering keys for frequently filtered columns
ALTER TABLE person
CLUSTER BY (email, department);

-- Verify clustering
SELECT SYSTEM$CLUSTERING_INFORMATION('person', '(email, department)');

Security best practices

  1. Use storage integrations instead of access keys
  2. Limit STORAGE_ALLOWED_LOCATIONS to specific prefixes
  3. Grant minimal permissions on tables and views
  4. Use row-level security for multi-tenant access
  5. Enable query auditing for compliance
-- Grant read access to analyst role
GRANT USAGE ON DATABASE data_warehouse TO ROLE analyst;
GRANT USAGE ON SCHEMA data_warehouse.public TO ROLE analyst;
GRANT SELECT ON ALL VIEWS IN SCHEMA data_warehouse.public TO ROLE analyst;

-- Row-level security example
CREATE ROW ACCESS POLICY person_department_policy
  AS (department VARCHAR) RETURNS BOOLEAN ->
    CURRENT_ROLE() = 'ACCOUNTADMIN'
    OR department = CURRENT_USER();

ALTER VIEW person ADD ROW ACCESS POLICY person_department_policy ON (department);

Further reading

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Please sign in to leave a comment.