Tracking Positive COVID-19 Cases in My School District
By Tomi Chen September 5, 2021
Last month, school started in person. I go to Castro Valley High School in Castro Valley, an unincorporated area in Alameda County, California. While there is a mask requirement, students are still crowded into classrooms with no social distancing.
With the pandemic still happening, our school district has published a Positive Case Dashboard that lists the number of students and staff on campuses that test positive for COVID-19. However, it only lists totals for each month, without information about the number of new cases each day. I decided to use Simon Willison’s git scraping technique to store historical data about positive cases and track changes over time.
The repository for this project is publicly available on GitHub. Check it out if you want to adapt it for yourself, or take a look at how cases have changed over time.
We’ll also be logging the data to a Google Sheet, which allows us to do some calculations and visualizations. Look at my hawt graphs (that are live updating too!)
Fetching the data
To get the data from the district’s spreadsheet, convert it to CSV for a cleaner format, and save it back to the repository, I decided to use a Python script with the requests
and pandas
libraries. While you could probably do this with bash scripting, I was more familiar with Python.
I’m using GitHub Actions to run this script on a cron schedule and re-commit the data back to the repository.
name: Scrape data
on:
workflow_dispatch:
schedule:
# run every 4 hours from 5:06 AM PDT (4 AM PST) to 9:06 PM PDT (8 PM PST) Mon-Fri
- cron: '6 12,16,20,0,4 * * 1-5'
jobs:
scrape:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Install dependencies
run: pip install -r requirements.txt
- name: Fetch latest data
run: python main.py
env:
LOGGER_DEPLOYMENT_ID: $
LOGGER_SECRET: $
- name: Commit data
run: |-
if [[ `git status --porcelain` ]]; then
git config user.name github-actions
git config user.email github-actions@github.com
git add -A
git commit -m "add latest data"
git push
fi
Visualizing the data
After I got the git scraping thing to work, I decided I wanted to graph the data in a Google Sheet. Since I’m scared of Google’s API authentication system, I used a Google Apps Script deployed as a web app so I could avoid authenticating to Google in the Python script. If you haven’t tried it, Apps Script is a pretty powerful tool for interacting and connecting across Google Services. It’s just JavaScript and the documentation is decent, so I’d recommend trying it out!
To create an App Script attached to a Sheet, go to Tools > Script Editor, which should open up an Apps Script editor. Besides deploying as a web app, you can also write custom spreadsheet functions!
My App Script is below:
const SECRET_KEY = 'lol no'
function doGet(e) {
// these are http query parameters
// sig - hmac sha256 of "time|date|student|staff"
// time - unix time in seconds
const { sig, time } = e.parameter
const { date, studentCases, staffCases } = e.parameter
const currentTime = Math.round(Date.now() / 1000) // time in seconds
if (
Math.abs(currentTime - parseInt(time)) > 30 || // invalid if timestamp isn't within 30 seconds of now
!verifySig(`${time}|${date}|${studentCases}|${staffCases}`, sig)
) {
return ContentService.createTextOutput('invalid signature!')
}
const sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Raw')
sheet.appendRow([date, studentCases, staffCases])
return ContentService.createTextOutput('ok!')
}
function verifySig(content, sig) {
const computedSig = Utilities.computeHmacSha256Signature(
content,
SECRET_KEY
).reduce((acc, char) => {
char = (char < 0 ? char + 256 : char).toString(16).padStart(2, '0')
return acc + char
}, '')
return sig === computedSig
}
Because I’m exposing the script to the internet, I thought it might be better to add some kind of authentication using an HMAC of the data and a timestamp. This probably isn’t necessary since the deployment ID of the Apps Script is a secret in GitHub Actions, but it can’t hurt, right?
Spreadsheet organization
An example copy of the spreadsheet is available if you want to follow along.
My spreadsheet has two sub-sheets: Raw
and Filtered
. Raw data from the App Script gets appended to the bottom of the Raw
sheet, and cleanup/analysis is performed in the Filtered
sheet.
Since the GitHub Action runs every 4 hours and sends data to the App Script even if the data doesn’t change, the Raw
sheet contains duplicates (that I occasionally manually remove). This means that we only need to keep the latest data for each date, discarding the rest. We can do this with the UNIQUE
function, as well as our friends ARRAYFORMULA
and VLOOKUP
.
Cell Filtered!G2
de-dupes date values with =UNIQUE(Raw!$A$2:$A)
, and cells H2:G2
find the respective data point with =ARRAYFORMULA(IFNA(VLOOKUP($G$2:$G, Raw!$A$2:$D, 2)))
.
Let’s break down this formula:
- ARRAYFORMULA allows us to use ranges in functions so we don’t need to copy the same formula into each cell
- VLOOKUP tries to find a given value in the first column of a range, then returns some value in that row
- IFNA prevents errors from being displayed if VLOOKUP fails
In this case, the ARRAYFORMULA does mess with VLOOKUP’s range input, but it still gives the desired effect (when the Raw sheet has the latest data closer to the bottom), so I’m not too concerned.
There is another, more important issue, though. The district’s data resets the numbers every month instead of displaying a running total, so if I directly graph the numbers, there will be a sudden drop at the beginning of each month. This is where the 10 other columns in the Filtered
sheet come in.
At a high level, I’m pulling out the month and previous month from each data point (columns E:F), finding the last values from each month (columns K:N), then adding the final number from the previous month (columns A:D). After all this work, we can finally graph these columns.
Tip: If you don’t want these calculation columns to take up space, you can hide them! Select the columns, then right-click and select “Hide Columns E - N”.
Conclusion
By creating this COVID-19 tracker, we’ve combined different techniques to scrape, store, and visualize data. Hopefully, our district can keep these numbers low with continued masking, vaccinations, testing, and quarantine. But even if we don’t, at least we have some cool graphs!