Your favourite Professor has just got back from his holidays in India. While, evidently, a genius in the fields of data, cloud warehousing, AI and much more… he unfortunately lacks mastery of Hindi. In order to get by, of course, he had a Cortex app translating his every erudite observation for all to marvel at.
Clearly we wouldn’t expect you to go as fas as our eminent professor, but let us give you a challenge that translates Hindi into English.
To begin with, you have this function:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def scrape_text_from_url(url):
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0 Safari/537.36'
}
response = requests.get(url, headers=headers)
response.raise_for_status()
soup = BeautifulSoup(response.text, 'html.parser')
title_tag = soup.title.get_text().strip().upper() if soup.title else ""
main_content = soup.find('body')
if not main_content:
print("Main content not found.")
return None
elements = main_content.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p'])
start_collecting = False
filtered_elements = []
for element in elements:
if element.name == 'h1' and element.get_text(strip=True).upper() == 'प्राचीन भारत में किलों का इतिहास':
start_collecting = True
if start_collecting:
filtered_elements.append(element)
rows = []
if title_tag:
rows.append(["TITLE", title_tag])
for element in filtered_elements:
tag_name = element.name.upper()
text_content = element.get_text().strip()
rows.append([tag_name, text_content])
df = pd.DataFrame(rows, columns=["TAG", "CONTENT"])
return df
except requests.exceptions.RequestException as e:
print(f"Error fetching the URL: {e}")
return None
url = 'https://indianculture.gov.in/hi/node/2730054'
result_df = scrape_text_from_url(url)
Your task is to translate in Snowpark using Cortex and produce the following:
Good luck!!
Leave a Reply
You must be logged in to post a comment.