Summit was nearly two months ago, and we’re still not through creating challenges on features put into GA and PuPr. This one is a pretty cool one, this week, we’re using Cortex Fine Tuning.
Before we begin, as of time of publishing, this feature is only available in:
- AWS US West 2 (Oregon)
- AWS US East 1 (N. Virginia)
- AWS Europe Central 1 (Frankfurt)
If you don’t have an account in any of those regions, get a trial account here (I’m sure one day the dream of a free tier will come).
THE CHALLENGE
You have two datasets:
You need an LLM to help you classify which line items belong to which account code. This is the prompt you are to use:
PROMP
“You are an agent that helps categorise invoices for our bookkeeping team.
They need to categories each line item per account code and account name,
the following account codes, and their account names:
8003 (Legal Fees),
7003 (Consulting Fees),
8001 (Cloud Subscriptions),
8011 (General IT),
7005 (Customs and Duties),
7002 (Travel Expenses),
7001 (Distribution Costs),
7004 (Marketing Expenses),
8007 (Facility Rentals),
8010 (Events & Entertainment),
5001 (Raw Materials),
6002 (Repairs and Maintenance),
8009 (Environmental Compliance),
5004 (Research and Development),
6005 (Training and Development),
5007 (Inventory Shrinkage),
5003 (Equipment Purchases),
5006 (Packaging Materials),
6006 (Waste Disposal),
8008 (Financial Services),
6001 (Employee Benefits),
6007 (Telecommunications),
8002 (Insurance),
8005 (Licensing Fees),
6003 (Utilities),
5002 (IoT Materials),
6004 (Office Supplies),
8004 (IT Services),
8006 (Security Services),
5005 (Quality Control)
Try doing it for this invoice line item.
The line item is: `<>` and the vendor is: `<>`.”
Load the PROCESSED DATA into an table and use the following query to see what COMPLETE with mistral-7b can do:
CODE
with data_prep as (
select
-- Turn invoice number into just a number
replace(invoice_number, 'INV-')::int as invoice_number,
-- Extract line item valies
value:"AccountCode"::string as account_code,
value:"AccountName"::string as account_name,
value:"Item"::string as item,
vendor_name
from accpay,
-- flatten line items
lateral flatten(input=>parse_json(line_items))
)
select
-- Create prompt
concat(
'You are an agent that helps categorise invoices for our bookkeeping team.
They need to categories each line item per account code and account name,
the following account codes, and their account names:\n\n' ||
(
select array_to_string(array_distinct(array_agg(
account_code || ' (' || account_name || ')')), ',\n'
) from data_prep
)
|| '\n\nTry doing it for this invoice line item.'
|| '\n\nThe line item is: `' || item
|| '` and the vendor is: `'|| vendor_name || '`.'
) as prompt,
-- Assert the real account_code as completion
account_code as completion,
-- Test the model
snowflake.cortex.complete(
'mistral-7b',
prompt
) as llm_response,
-- Get the account code from the response
regexp_substr(llm_response, '\\d{4}', 1, 1, 'e') as llm_account_code,
-- Check whether the model got the answer right
completion = llm_account_code as verification
from data_prep
Then:
- Create training and validation datasets with the PROCESSED DATA
- Create a fine tuning job on those datasets
- Load in the UNPROCESSED DATA
- Query the unprocessed data using your fine tuned model, which should return a result that looks a little like:
Leave a Reply
You must be logged in to post a comment.