Tracking Positive COVID-19 Cases in My School District

By Tomi Chen September 5, 2021

Last month, school started in person. I go to Castro Valley High School in Castro Valley, an unincorporated area in Alameda County, California. While there is a mask requirement, students are still crowded into classrooms with no social distancing.

With the pandemic still happening, our school district has published a Positive Case Dashboard that lists the number of students and staff on campuses that test positive for COVID-19. However, it only lists totals for each month, without information about the number of new cases each day. I decided to use Simon Willison’s git scraping technique to store historical data about positive cases and track changes over time.

The repository for this project is publicly available on GitHub. Check it out if you want to adapt it for yourself, or take a look at how cases have changed over time.

We’ll also be logging the data to a Google Sheet, which allows us to do some calculations and visualizations. Look at my hawt graphs (that are live updating too!)

graph of total cases over time graph of daily case change graph of estimated active cases

Fetching the data

To get the data from the district’s spreadsheet, convert it to CSV for a cleaner format, and save it back to the repository, I decided to use a Python script with the requests and pandas libraries. While you could probably do this with bash scripting, I was more familiar with Python.

I’m using GitHub Actions to run this script on a cron schedule and re-commit the data back to the repository.

name: Scrape data

on:
  workflow_dispatch:
  schedule:
    # run every 4 hours from 5:06 AM PDT (4 AM PST) to 9:06 PM PDT (8 PM PST) Mon-Fri
    - cron: '6 12,16,20,0,4 * * 1-5'

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Fetch latest data
        run: python main.py
        env:
          LOGGER_DEPLOYMENT_ID: $
          LOGGER_SECRET: $

      - name: Commit data
        run: |-
          if [[ `git status --porcelain` ]]; then
            git config user.name github-actions
            git config user.email github-actions@github.com
            git add -A
            git commit -m "add latest data"
            git push
          fi

Visualizing the data

After I got the git scraping thing to work, I decided I wanted to graph the data in a Google Sheet. Since I’m scared of Google’s API authentication system, I used a Google Apps Script deployed as a web app so I could avoid authenticating to Google in the Python script. If you haven’t tried it, Apps Script is a pretty powerful tool for interacting and connecting across Google Services. It’s just JavaScript and the documentation is decent, so I’d recommend trying it out!

To create an App Script attached to a Sheet, go to Tools > Script Editor, which should open up an Apps Script editor. Besides deploying as a web app, you can also write custom spreadsheet functions!

My App Script is below:

const SECRET_KEY = 'lol no'

function doGet(e) {
  // these are http query parameters
  // sig - hmac sha256 of "time|date|student|staff"
  // time - unix time in seconds
  const { sig, time } = e.parameter
  const { date, studentCases, staffCases } = e.parameter

  const currentTime = Math.round(Date.now() / 1000) // time in seconds
  if (
    Math.abs(currentTime - parseInt(time)) > 30 || // invalid if timestamp isn't within 30 seconds of now
    !verifySig(`${time}|${date}|${studentCases}|${staffCases}`, sig)
  ) {
    return ContentService.createTextOutput('invalid signature!')
  }

  const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Raw')
  sheet.appendRow([date, studentCases, staffCases])

  return ContentService.createTextOutput('ok!')
}

function verifySig(content, sig) {
  const computedSig = Utilities.computeHmacSha256Signature(
    content,
    SECRET_KEY
  ).reduce((acc, char) => {
    char = (char < 0 ? char + 256 : char).toString(16).padStart(2, '0')
    return acc + char
  }, '')

  return sig === computedSig
}

Because I’m exposing the script to the internet, I thought it might be better to add some kind of authentication using an HMAC of the data and a timestamp. This probably isn’t necessary since the deployment ID of the Apps Script is a secret in GitHub Actions, but it can’t hurt, right?

Spreadsheet organization

An example copy of the spreadsheet is available if you want to follow along.

My spreadsheet has two sub-sheets: Raw and Filtered. Raw data from the App Script gets appended to the bottom of the Raw sheet, and cleanup/analysis is performed in the Filtered sheet.

Since the GitHub Action runs every 4 hours and sends data to the App Script even if the data doesn’t change, the Raw sheet contains duplicates (that I occasionally manually remove). This means that we only need to keep the latest data for each date, discarding the rest. We can do this with the UNIQUE function, as well as our friends ARRAYFORMULA and VLOOKUP.

Cell Filtered!G2 de-dupes date values with =UNIQUE(Raw!$A$2:$A), and cells H2:G2 find the respective data point with =ARRAYFORMULA(IFNA(VLOOKUP($G$2:$G, Raw!$A$2:$D, 2))).

Let’s break down this formula:

ARRAYFORMULA allows us to use ranges in functions so we don’t need to copy the same formula into each cell
VLOOKUP tries to find a given value in the first column of a range, then returns some value in that row
IFNA prevents errors from being displayed if VLOOKUP fails

In this case, the ARRAYFORMULA does mess with VLOOKUP’s range input, but it still gives the desired effect (when the Raw sheet has the latest data closer to the bottom), so I’m not too concerned.

There is another, more important issue, though. The district’s data resets the numbers every month instead of displaying a running total, so if I directly graph the numbers, there will be a sudden drop at the beginning of each month. This is where the 10 other columns in the Filtered sheet come in.

At a high level, I’m pulling out the month and previous month from each data point (columns E:F), finding the last values from each month (columns K:N), then adding the final number from the previous month (columns A:D). After all this work, we can finally graph these columns.

Tip: If you don’t want these calculation columns to take up space, you can hide them! Select the columns, then right-click and select “Hide Columns E - N”.

Conclusion

By creating this COVID-19 tracker, we’ve combined different techniques to scrape, store, and visualize data. Hopefully, our district can keep these numbers low with continued masking, vaccinations, testing, and quarantine. But even if we don’t, at least we have some cool graphs!

PREV POST NEXT POST