Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make sitemap.xml.gz slightly more reproducible #3460

Merged
merged 3 commits into from Dec 8, 2023
Merged

Make sitemap.xml.gz slightly more reproducible #3460

merged 3 commits into from Dec 8, 2023

Conversation

oprypin
Copy link
Contributor

@oprypin oprypin commented Nov 10, 2023

The gzip format stores a timestamp inside it, but there's no real point to it being correct.

If a site is rebuilt exactly the same twice, the timestamp metadata of files will be different sure, but this gzip file was the only one that also had actual content that is different each time.

Now instead the date of the gzip file will change only once per day, based on the pages' update date. The sitemap.xml itself also changes once per day already.

The gzip format stores a timestamp inside it, but there's no real point to it being correct.

If a site is rebuilt exactly the same twice, the timestamp *metadata* of files will be different sure, but this gzip file was the only one that also had *actual content* that is different each time.
@ultrabug
Copy link
Member

Don't you fear that setting a static metadata to a file could introduce unexpected behaviors from third party implementations?

On top of my head could be indexing robots that might look at it to update or not a sitemap and indexing but I don't know really, just sharing a thought.

@oprypin
Copy link
Contributor Author

oprypin commented Nov 14, 2023

Valid point. And I don't know how to be fully sure.

There are like 4 places where timestamps could come into question:

  • Server says that each file should be cached for N more seconds - just by default without any input
  • Timestamp of the .gz file (if a server decides to use timestamps and report them in Last-Modified, surely that's what it would expose)
  • Timestamp inside the content of the .gz file (who would choose to download an archive and then ignore it just based on a timestamp?) - I am removing this
  • Individual timestamps of files inside the sitemap e.g. https://www.mkdocs.org/sitemap.xml - of course this is already the content itself. I am not removing it, it's going to keep changing on a daily basis.

@ultrabug
Copy link
Member

Since sometimes code is faster than words, I'd like to humbly propose another approach with #3468

If you like the idea, I can work on fixing tests and improving the code ofc

@oprypin
Copy link
Contributor Author

oprypin commented Dec 1, 2023

I had an even better idea- the reproducible value won't be something fake, but instead the max of all the dates mentioned in the sitemap. I'll try to rework this PR to that idea.

@oprypin
Copy link
Contributor Author

oprypin commented Dec 1, 2023

Oh what, the sitemap populates lastmod based on just the current date, not the file's modification date? 🤦

@oprypin oprypin changed the title Make sitemap.xml.gz content always reproducible Make sitemap.xml.gz slightly more reproducible Dec 1, 2023
@oprypin
Copy link
Contributor Author

oprypin commented Dec 1, 2023

Now instead the date of the gzip file will change only once per day, based on the pages' update date. The sitemap.xml itself also changes once per day already.

Copy link
Member

@ultrabug ultrabug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a nice bargain indeed

@oprypin oprypin merged commit ccf011d into master Dec 8, 2023
30 checks passed
@oprypin oprypin deleted the gz branch December 8, 2023 20:39
@giordano
Copy link

Are there plans to release a new version which includes this change? Avoid needless changes to the compressed archive in static websites maintained on github would be nice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reproducible Builds
3 participants