Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge difference in Total deaths with other RKI sources. #227

Open
joemrich opened this issue Nov 12, 2020 · 7 comments
Open

Huge difference in Total deaths with other RKI sources. #227

joemrich opened this issue Nov 12, 2020 · 7 comments
Labels
feedback feedback and questions.

Comments

@joemrich
Copy link

I'm working on my Bachelor Thesis and I am searching for the right parameters for a SIR-like Model my Professor created for CoV-19.
While your Data is really helpful for my researches I noticed huge differences in the Total deaths for Germany between this data and the data the RKI released in their Daily Situational Report. For Example on March 31st your Data states 2662 Deaths in Germany while the attached report mentions 583 deaths.
2020-03-31-de.pdf

I've noticed that there might be some delay in in reports as the number of total reported cases in these situational reports is like 2-3 days behind your data. But this gap is too huge to be delay-based. The total deaths also don't sum up to this much until 2 or 3 weeks later. Do you know or think of any reason why the difference is this huge?

If you want to look at all those daily reports (I guess you already know these, but here is the link anyways):
https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/Situationsberichte/Gesamt.html

Again thank you very much for your work and also thanks in advance for any answer you might give.
Cheers Jonas

@jgehrcke
Copy link
Owner

jgehrcke commented Nov 12, 2020

Hey. Thanks for the inquiry. In short, I am pretty confident that hat 2020-03-31-de.pdf is very much out of date, showing the data the RKI had at that time. The RKI has been back-filling data all the time, making the "history" much more correct/credible over time -- but only in their ArcGIS system, not in their old reports (especially not in report artifacts such as PDF files -- those are obviously not updated in hindsight). The always-improving view into the past is exactly the major motivation behind this repository, and the approach I have chosen for updating the data files. Given the big discrepancy that you've pointed out it's of course the right thing to validate, and re-validate, and re-re-validate :-).

Maybe you can try doing that yourself, by looking at other data sources providing 'time series'?

Otherwise, I'll try to get back to you soon, looking again at other places (such as at https://github.com/CSSEGISandData/COVID-19 and also at the Risklayer-provided time series data about deaths).

@jgehrcke jgehrcke added the feedback feedback and questions. label Nov 12, 2020
@joemrich
Copy link
Author

Thank you for the quick response.
I will try to find some sources to validate your data, although i wasn't very successfull with that so far.
If I can confirm or refute your data I will definately contact you about that.

Could you maybe Link your Data on the reported dead cases as I can't find that in your documentation and they are not listed per date on RKIs Arcgis.
If you happen to have any Data on the recovered Covid Patients I would love to see those aswell as I can only find very poorly guessed numbers.

Tanks so much in advance
Jonas

@joemrich
Copy link
Author

I just reviewed the Data presented by the JHU again, which is, by my knowledge, said to be the most accurate source in terms of taking report delays into account.
I noticed that the numbers are kind of on par with the ones you show in your deaths-rki-ags.csv with a 6-7 days Delay.
Could it be that the Dates shown on your dataset are just not fitted correctly to the data?
All other sources I found are around one week delayed to your data. (f.e. JHU, WHO, RKI)

@jgehrcke
Copy link
Owner

jgehrcke commented Nov 15, 2020

I just reviewed the Data presented by the JHU again, which is, by my knowledge, said to be the most accurate source in terms of taking report delays into account. I noticed that the numbers are kind of on par with the ones you show in your deaths-rki-ags.csv with a 6-7 days Delay.

Okay. Thanks for doing that. That of course motivated me to invest a little more time towards the re-re-validation I described above :-).

Key difference between the RL and RKI data sets

The JHU data set, as far as I remember, is largely based on the Risklayer GmbH-initiated crowd-sourcing effort.

So, let's ignore JHU for now and look at the RL (Risklayer) data.

I tend to explain the advantage of the RL data set as "most credible for now". Because they have the fastest pipeline from the individual Gesundheitsamt to their aggregation spreadsheet.

In the main README here in this project I therefore describe this (Risklayer) data set as

Crowdsourcing data (fresh view into the last 1-2 days)

In other words, the Risklayer data set is a reasonably good source for media to state things about today/yesterday.

Now, what's the decisive difference between the Risklayer data set and the RKI data set?

Risklayer do not seem to post-process data from the past to the same extent the RKI does.

My impression has always been that the RKI constantly updates history based on new insights. They apply corrections to historical data as they come up, and as they have time and resources. These amendments can reach back far into the past (weeks, months).

That is, the individual data point (say, the total number of deaths for all Germany for the specific day 2020-03-30) evolves in the RKI data set; over time, as they implement more and more corrections to their time series data.

For that reason, I describe the RKI data as

RKI data (most credible view into the past)

in the main README of this project here.

About:

I just reviewed the Data presented by the JHU again, which is, by my knowledge, said to be the most accurate source in terms of taking report delays into account.

"Yes", in the sense that RL/JHU data is good "for today". However, this statement is not true for historical data.

Total number of deaths for all Germany for 2020-03-30

Now, let's look at the specific discrepancy you've pointed out.

Let's call the metric of interest sum_deaths_germany_2020-03-30.

(the total number of deaths for all Germany for the specific day 2020-03-30)

sum_deaths_germany_2020-03-30 from RKI data set on Nov 15, 2020: ~2300

Indeed, in https://github.com/jgehrcke/covid-19-germany-gae/blob/0c91cc2a5c4033412337821684f608f773fad06e/deaths-rki-by-state.csv (from Nov 15, 2020) the sum_deaths data point for 2020-03-30 is 2288.

sum_deaths_germany_2020-03-30 from RL data set on Nov 15, 2020: ~700

Using this: https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/edit#gid=1875294686

Screenshot from 2020-11-15 17-39-31

(screenshot made today: Nov 15, 2020).

Intermediate conclusion

The discrepancy you've found for sum_deaths_germany_2020-03-30 between the RKI data set in this repository here and other data sources is real. Thank you for reporting!

The natural question that this raises is: did the RKI really update sum_deaths_germany_2020-03-30 over time, in a meaningful way? How does (or better: did) sum_deaths_2020-03-30 evolve over time in the RKI data set?

Let's have a look.

How did sum_deaths_germany_2020-03-30 evolve in the RKI data set?

This question can be answered by going through the history of this repository here. That is, there is quite an interesting and additional value of this data set / git repository provides over other data sources: it provides a history of history: through the git history we can see how time series data changed over time.

I took the time today to write a corresponding tool to walk the git history of the file deaths-rki-by-state.csv, extracted sum_deaths_germany_2020-03-30 for every commit to deaths-rki-by-state.csv in this repository, and plotted the result:

sum_deaths_germany_2020-03-30_evolution_rki_db

Observations and conclusions

So, how, did the total number of deaths for all Germany for the specific day 2020-03-30 evolve in the RKI data set?

sum_deaths_germany_2020-03-30 evolved gradually in more or less regular increments, from below 800 before April to over 2200 by the end of May. sum_deaths_germany_2020-03-30 evolved in a monotonically increasing fashion. The curve shows convergence: most of the change happened within April (from below 800 to 2000). Another significant correction was applied within May (from 2000 to 2200), and even within June and thereafter noticeable changes have been made.

This appears to be intentional.

Given the additional knowledge we have (what you describe as "a 6-7 days delay") I think it's fair to assume that the individual corrections intentionally back-dated certain incidents of death (as opposed to actually adding new incidents).

For specific details about these corrections we would have to ask the RKI about why they did this.

It's fair to assume that these corrections had the intent to make the time series more correct. I think we can stand by our conclusion: the RKI data set provides the most credible/correct view into the past (because they try to!).

As a final remark, the impact of these corrections (left-shifting the initial deaths spike by about a week) is probably rather small for most considerations and topics of discussion.

@jgehrcke
Copy link
Owner

for the record: added the tooling I used for the above's analysis/plot here: #234

@joemrich
Copy link
Author

Wow!
I really wasn't expecting such a detailed answer to my simple question.
Now I definitely have a better understanding on how the data in your files is put together and why it differs so much from broadly available data.
I think for my Thesis these rather 'small' shifts of a week will have a quite big impact, as I am only reviewing the first month of the Pandemic in Germany (March) or rather a smaller region like the 'Rhein-Main-Gebiet' before the Covid-safety measures like the quarantine or making masks mandatory in most public buildings were passed.

Thank you so much for all the work and help!
I will try to make good use of it.

@jgehrcke
Copy link
Owner

jgehrcke commented Nov 16, 2020

Thanks for the response, and the kind words!

Now I definitely have a better understanding on how the data in your files is put together and why it differs so much from broadly available data.

Thanks for acking this explicitly -- popularity of data is rarely correlated with its quality :P especially in times of the "AI/ML/data science" hype.

I think for my Thesis these rather 'small' shifts of a week will have a quite big impact

Ha, I guess that's for you to find out then :) Good luck!
In any case I am glad that we had this discussion!

Maybe -- for the purpose of your thesis -- why don't you reach out to the RKI, asking about the shift, maybe pointing them to this discussion thread here? :) If you do that: please report back -- super curious, and also like to keep the dots connected.

About the "quite big impact" -- I understand that certain models for certain dynamics could be rather sensitive to this shift. But (being a physicist myself) I think that it's important to be skeptical here -- to consider that this then might be an over-sensitivity. But yeah, just a superficial intuition.

Please keep coming back with good questions when you have them. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feedback feedback and questions.
Projects
None yet
Development

No branches or pull requests

2 participants