GitHub Live Docs

Project Overview

Developed a streamlined data extraction module designed to convert Wikipedia's unstructured web tables into structured CSV files for analysis. This tool simplifies the data collection process for researchers and analysts.

Key Features

  • Automated Extraction: Scrapes tabular data directly from user-provided Wikipedia URLs.
  • Polite Scraping: Implements custom User-Agents to adhere to Wikipedia’s scraping policies and ensure respectful data retrieval.
  • Instant Conversion: Automatically transforms HTML tables into cleaned pandas DataFrames for immediate CSV export.

Impact & Results

  • Live Deployment: Hosted as a web application via Streamlit for public accessibility.
  • Versatility: Serves as a foundational module for larger data analysis pipelines and research projects.

Tech Stack

Python | BeautifulSoup4 | Pandas | Streamlit | Requests