Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add logic for separating old versions of records, consolidating different calendar dates. #24

Open
zimzoom opened this issue Mar 16, 2023 · 1 comment
Assignees
Labels

Comments

@zimzoom
Copy link
Collaborator

zimzoom commented Mar 16, 2023

Sometimes court records are updated. Right now, the scraper creates a hash of the html file, in order to differentiate whether the file has changed since the last time it was scraped. But it doesn't actually check the hash.

We want to keep old versions of the records, differentiating each version by a field called "revision id". The latest version of a record would then be the one with the highest number for "revision id", the first version of a record would have revision id = 1.

Add database migration:

  • add a field for revision id
  • add a field for hash (for easier lookup)
  • add a field for case number (for easier lookup)

Add the following logic to the scraper:

  1. if it's a brand new case number, store it with revision id = 1
  2. if it's an old case number, but a new hash, store it with a higher rev id
  3. if it's an old case number and the same hash, don't store the record

(Note the above logic should also solve the problem of multiple court calendar dates pointing to the same record, and currently saving them all. If that problem remains, create a new issue for it.)

@zimzoom zimzoom self-assigned this Mar 16, 2023
@tpadmanabhan
Copy link
Collaborator

Add Joshua L as an assignee

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

No branches or pull requests

4 participants