Week 123 – Intermediate

Your favourite Professor has just got back from his holidays in India. While, evidently, a genius in the fields of data, cloud warehousing, AI and much more… he unfortunately lacks mastery of Hindi. In order to get by, of course, he had a Cortex app translating his every erudite observation for all to marvel at.

Clearly we wouldn’t expect you to go as fas as our eminent professor, but let us give you a challenge that translates Hindi into English.

To begin with, you have this function:

import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_text_from_url(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0 Safari/537.36'
        }
        response = requests.get(url, headers=headers)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')
        title_tag = soup.title.get_text().strip().upper() if soup.title else ""

        main_content = soup.find('body')
        if not main_content:
            print("Main content not found.")
            return None

        elements = main_content.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'p'])

        start_collecting = False
        filtered_elements = []

        for element in elements:
            if element.name == 'h1' and element.get_text(strip=True).upper() == 'प्राचीन भारत में किलों का इतिहास':
                start_collecting = True
            if start_collecting:
                filtered_elements.append(element)

        rows = []

        if title_tag:
            rows.append(["TITLE", title_tag])

        for element in filtered_elements:
            tag_name = element.name.upper()
            text_content = element.get_text().strip()
            rows.append([tag_name, text_content])

        df = pd.DataFrame(rows, columns=["TAG", "CONTENT"])

        return df
    except requests.exceptions.RequestException as e:
        print(f"Error fetching the URL: {e}")
        return None


url = 'https://indianculture.gov.in/hi/node/2730054'

result_df = scrape_text_from_url(url)

Your task is to translate in Snowpark using Cortex and produce the following:

Good luck!!

Reader Interactions

Leave a Reply Cancel reply