In real-world contracts, headers often carry subtle variations: numbering, punctuation, or word orderbut they refer to the same section. Your mission: unify these headers across multiple companies using AI-powered similarity and canonicalization, so downstream comparisons all speak the same language.
What you’ll do this week
- Write a Snowflake SQL pipeline that discovers “almost duplicates” among section headers using AI_SIMILARITY, then clusters them into stable groups.
- For each group, use AI_COMPLETE to choose one crisp canonical header (3–5 words, no numbering or punctuation).
- Map every original header in your dataset to its normalized label.
- Run diagnostics: show how many unique raw headers existed vs how many normalized header groups you produced, and present a few examples where messy variants collapse cleanly.
- All of this should run hands-off—no external tools, just Snowpark-native AI in Snowflake.
start-up code
CREATE OR REPLACE SCHEMA FROSTY_DB.week_148;
CREATE OR REPLACE TABLE week_148.HEADERS_SAMPLE (
COMPANY STRING,
DOC_ID STRING,
HEADER_ORDER INTEGER,
HEADER_TEXT STRING
);
INSERT INTO week_148.HEADERS_SAMPLE (COMPANY, DOC_ID, HEADER_ORDER, HEADER_TEXT) VALUES
-- Company A
('AIG', 'A1', 1, '1. Coverage Overview'),
('AIG', 'A1', 2, '1.1 Policy Benefits'),
('AIG', 'A1', 3, '2. Exclusions & Limitations'),
('AIG', 'A1', 4, '3. Claims Process'),
-- Company B (variants)
('Zurich', 'Z1', 1, 'Coverage - Overview'),
('Zurich', 'Z1', 2, 'Policy Benefit(s)'),
('Zurich', 'Z1', 3, 'Exclusions and Limitations'),
('Zurich', 'Z1', 4, 'Claims – Process'),
-- Company C (more variants)
('Allianz', 'AL1', 1, '01) COVERAGE OVERVIEW'),
('Allianz', 'AL1', 2, 'Policy: Benefits'),
('Allianz', 'AL1', 3, 'Exclusions / Limitations'),
('Allianz', 'AL1', 4, 'Claims Processing'),
-- Company D (edge cases)
('AXA', 'AX1', 1, 'Coverage overview & scope'),
('AXA', 'AX1', 2, 'Benefit of Policy'),
('AXA', 'AX1', 3, 'Limitations and Exclusions'),
('AXA', 'AX1', 4, 'Filing a Claim');
SELECT * FROM week_148.HEADERS_SAMPLE ORDER BY COMPANY, DOC_ID, HEADER_ORDER;
Your objective :






