Tomi Chen

Tracking Positive COVID-19 Cases in My School District

By Tomi Chen September 5, 2021

Last month, school started in person. I go to Castro Valley High School in Castro Valley, an unincorporated area in Alameda County, California. While there is a mask requirement, students are still crowded into classrooms with no social distancing.

With the pandemic still happening, our school district has published a Positive Case Dashboard that lists the number of students and staff on campuses that test positive for COVID-19. However, it only lists totals for each month, without information about the number of new cases each day. I decided to use Simon Willison’s git scraping technique to store historical data about positive cases and track changes over time.

The repository for this project is publicly available on GitHub. Check it out if you want to adapt it for yourself, or take a look at how cases have changed over time.

Fetching the data 🔗

To get the data from the district’s spreadsheet, convert it to CSV for a cleaner format, and save it back to the repository, I decided to use a Python script with the requests and pandas libraries. While you could probably do this with bash scripting, I was more familiar with Python.

I’m using GitHub Actions to run this script on a cron schedule and re-commit the data back to the repository.

name: Scrape data

on:
  workflow_dispatch:
  schedule:
    # run every 4 hours from 5:06 AM PDT (4 AM PST) to 9:06 PM PDT (8 PM PST) Mon-Fri
    - cron: '6 12,16,20,0,4 * * 1-5'

jobs:
  scrape:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Fetch latest data
        run: python main.py
        env:
          LOGGER_DEPLOYMENT_ID: $
          LOGGER_SECRET: $

      - name: Commit data
        run: |-
          if [[ `git status --porcelain` ]]; then
            git config user.name github-actions
            git config user.email github-actions@github.com
            git add -A
            git commit -m "add latest data"
            git push
          fi

Visualizing the data 🔗

After I got the git scraping thing to work, I decided I wanted to graph the data in a Google Sheet. Since I’m scared of Google’s API authentication system, I used a Google Apps Script deployed as a web app so I could avoid authenticating to Google in the Python script. If you haven’t tried it, Apps Script is a pretty powerful tool for interacting and connecting across Google Services. It’s just JavaScript and the documentation is decent, so I’d recommend trying it out!

To create an App Script attached to a Sheet, go to Tools > Script Editor, which should open up an Apps Script editor. Besides deploying as a web app, you can also write custom spreadsheet functions!

My App Script is below:

const SECRET_KEY = "lol no"

function doGet(e) {
  // these are http query parameters
  // sig - hmac sha256 of "time|date|student|staff"
  // time - unix time in seconds
  const { sig, time } = e.parameter
  const { date, studentCases, staffCases } = e.parameter

  const currentTime = Math.round(Date.now() / 1000) // time in seconds
  if (
    Math.abs(currentTime - parseInt(time)) > 30 || // invalid if timestamp isn't within 30 seconds of now
    !verifySig(`${time}|${date}|${studentCases}|${staffCases}`, sig)
  ) {
    return ContentService.createTextOutput("invalid signature!")
  }

  const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Raw")
  sheet.appendRow([date, studentCases, staffCases])

  return ContentService.createTextOutput("ok!");
}

function verifySig(content, sig) {
  const computedSig = Utilities.computeHmacSha256Signature(content, SECRET_KEY).reduce((acc, char) => {
    char = (char < 0 ? char + 256 : char).toString(16).padStart(2, "0");
    return acc + char
  }, "")

  return sig === computedSig
}

Because I’m exposing the script to the internet, I thought it might be better to add some kind of authentication using an HMAC of the data and a timestamp. This probably isn’t necessary since the deployment ID of the Apps Script is a secret in GitHub Actions, but it can’t hurt, right?

Spreadsheet organization 🔗

An example copy of the spreadsheet is available if you want to follow along.

My spreadsheet has two sub-sheets: Raw and Filtered. Raw data from the App Script gets appended to the bottom of the Raw sheet, and cleanup/analysis is performed in the Filtered sheet.

Since the GitHub Action runs every 4 hours and sends data to the App Script even if the data doesn’t change, the Raw sheet contains duplicates (that I occasionally manually remove). This means that we only need to keep the latest data for each date, discarding the rest. We can do this with the UNIQUE function, as well as our friends ARRAYFORMULA and VLOOKUP.

Cell Filtered!G2 de-dupes date values with =UNIQUE(Raw!$A$2:$A), and cells H2:G2 find the respective data point with =ARRAYFORMULA(IFNA(VLOOKUP($G$2:$G, Raw!$A$2:$D, 2))).

Let’s break down this formula:

In this case, the ARRAYFORMULA does mess with VLOOKUP’s range input, but it still gives the desired effect (when the Raw sheet has the latest data closer to the bottom), so I’m not too concerned.

There is another, more important issue, though. The district’s data resets the numbers every month instead of displaying a running total, so if I directly graph the numbers, there will be a sudden drop at the beginning of each month. This is where the 10 other columns in the Filtered sheet come in.

At a high level, I’m pulling out the month and previous month from each data point (columns E:F), finding the last values from each month (columns K:N), then adding the final number from the previous month (columns A:D). After all this work, we can finally graph these columns.

Tip: If you don’t want these calculation columns to take up space, you can hide them! Select the columns, then right-click and select “Hide Columns E - N”.

Conclusion 🔗

By creating this COVID-19 tracker, we’ve combined different techniques to scrape, store, and visualize data. Hopefully, our district can keep these numbers low with continued masking, vaccinations, testing, and quarantine. But even if we don’t, at least we have some cool graphs!

PREV POST