Skip to content

ireapps/teaching-guide-python-scraping

Repository files navigation

Web scraping with Python conference teaching guide

Thank you for volunteering to teach this one-hour session at our conference! This teaching guide explains our setup and the material we try to cover.

The exercises live in the "Web scraping with Python" Jupyter notebook. The other notebook has a refresher on some basic Python syntax for folks who are new to Python or could use a reference.

Take it for a spin

At the conference, this repository will be on the conference computers, the virtual environment created and the dependencies (jupyterlab, requests, bs4) installed and tested.

If you're in a "BYO laptop" room, you might check out CoLab or a similar cloud environment that can load git repos.

Run locally

You'll need uv installed. (Or use your own dependency management software.)

  1. Clone or download/unzip this repo onto your computer
  2. cd into the folder
  3. uv sync
  4. uv run jupyter lab

Session description

This session will show you how to use the Python programming language to scrape data from simple websites.

This session is good for: People with some experience working with data. Experience with Python and/or HTML is a plus but not necessary.

Session goals

Some of the ground you want to cover:

  • How to write and run Python code in a Jupyter notebook
  • Browser tools for inspecting the source code of a web page
  • How to use the requests library to fetch the HTML code for a web page
  • How to use the beautifulsoup4 library to parse the HTML
  • Using beautifulsoup4's find() and find_all() methods to target and extract information
  • Writing the results of a scrape to a CSV (if time)
  • Where to find instructions for installing Python on their own machines (or tell them about JupyterLab desktop or direct them to your install guide of choice)
  • How to find help when they get stuck

General approach

I Do, We Do, You Do. Demonstrate a concept, work through it together, then give them plenty of time to experiment on their own while you and your coach walk around and answer questions (see sections marked ✍️ Try it yourself).

The pace will be slower than you think, and that's OK! It's not the end of the world if you don't get through everything. Many people who come to this class will have zero experience with programming.

Class setup

We'll have the latest version of Python 3 installed. We're using uv to manage the virtual environment and project dependencies (jupyterlab, bs4 and requests), which will already have been installed and tested prior to your session.

Class outline

Start up the notebook server

Begin the class by walking everyone through the process of activating their virtual environments and launching JupyterLab. Or, if you prefer to use a different tool such as the Jupyter plugin for VS Code or whatever, get them set up.

  1. Open the command-line interface
  2. cd into your class directory (or, if you're on a Mac, you could have them right-click on the class folder and select Services > New Terminal at folder)
  3. uv run jupyter lab

It will take everyone a few minutes to get going. You'll also probably get some questions about what you're doing at this step. Try to avoid a lengthy digression into virtual environments -- it's beyond the scope of this hourlong session, so maybe offer to talk to them after class, or send 'em our way: training@ire.org.

Once everyone is good to go, toggle back to the terminal and show them what's going on: A Jupyter server is running in the background, so don't close the terminal window.

Go over some notebook basics: Adding cells, writing code and running cells, etc. A common gotcha: Writing code that other cells depend on but forgetting to first run it to make it available.

Main course content

Start working your way through the notebook: Practice inspecting a web page, fetch a web page, parse the HTML, target and extract the data, write to CSV (if time). Pause frequently to ask if anyone has questions. There's a bunch of text at the beginning of the notebook that's mostly for them to read and reference, not necessarily a list of things to cover.

Any time you see ✍️ Try it yourself, hit the brakes and give everyone a little time to play around with whatever concept you're discussing.

In our experience, you'll want to budget more time than you'd think for showing how to parse data out of the BeautifulSoup object.

If you have Internet problems, you can pivot and work on the HTML file saved in this directory, sd-warn.html -- there's a cell with some commented-out code that folks can run to read in the HTML.

Debugging

If you can, find an opportunity when someone has gotten an error and take a few minutes to walk through basic debugging strategy: Reading the traceback error from bottom to top, strategic Googling, etc.

If you have extra time at the end

You can set them on the extra credit problems at the end of the notebook or oversee some unstructured lab time -- they can practice scraping other web pages or look up additional methods for navigating the souped HTML, etc.

Ending the session

  1. Have everyone close out of their notebook tabs
  2. In terminal, Ctrl+C to kill the server process
  3. Close the terminal window

About

Teaching guide for a one-hour hands-on session at an IRE/NICAR conference on scraping web data using Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors