Skip to content
This repository has been archived by the owner on Feb 25, 2022. It is now read-only.

Revisit the feed concept #794

Open
kptdobe opened this issue Mar 17, 2021 · 12 comments
Open

Revisit the feed concept #794

kptdobe opened this issue Mar 17, 2021 · 12 comments

Comments

@kptdobe
Copy link
Contributor

kptdobe commented Mar 17, 2021

For the theblog, I had to debug the feed concept and I think it must be revisited (or killed ?).

Cache issues

https://blog.adobe.com/feeds/jp.xml is cached empty. How can it be updated ? The live version is correct https://theblog--adobe.hlx.live/feeds/jp.xml.

More generally, if the xml pages are cached and we want the feeds to be up-to-date, then they all must be flushed on every query-index changes (a topic could be added / removed, a page could be added / removed...).

ESI include issues

A feed is a collection of 10 ESI includes which on hlx.page is impossible to get right: timeout or something wrong happens and the xml stream is cut in the middle. This should definitively be implemented differently.

cc @trieloff @davidnuescheler

@kptdobe
Copy link
Contributor Author

kptdobe commented Mar 17, 2021

Sidekick is out-of-scope. The problem is that the author neither go on the "source content" page (xml definition in the git repo) nor on the query-index. It should rather be a backend job that maintains all the feed pages.

@rofe
Copy link
Contributor

rofe commented Mar 17, 2021

Sidekick would only work as a browser extension on an xml or json document anyway. The bookmarklet won't load,

@kptdobe
Copy link
Contributor Author

kptdobe commented Mar 19, 2021

Some details on the ESI include issue.

The problem is easy to reproduce, just open https://theblog--adobe.hlx.page/feeds/jp.xml.

The definition of the feed is here: https://github.com/adobe/theblog/blob/master/feeds/jp.xml
The action code that executes the rendering is here: https://github.com/adobe/helix-pages/blob/master/cgi-bin/feed.js

The xml page ends up being a set of 10 ESI includes to individual {entry.id}.embed.html requests (definition is coming from an XLSX spreadsheet but idea is to dump an html block inside a <![CDATA[...]]> attribute). When it fails, you can flush the last (broken) include (curl -X PURGE {entry.id}.embed.html) and reload https://theblog--adobe.hlx.page/feeds/jp.xml: this usually produce a failure at a different place.

I have added some logs that are visible in the AWS console (CloudWatch > LogGroups > pages--cgi-bin-feed). But this will probably not help for the ESI include debugging.

cc @stefan-guggisberg tell me if this is not clear or you need more info.

@kptdobe
Copy link
Contributor Author

kptdobe commented Mar 19, 2021

For the cache issue, @trieloff mentioned that lowering the TTL to 15 or 30 mins on the outer CDN for feeds "should" do the trick. Something we can explore if we solve the ESI issue.

@stefan-guggisberg
Copy link
Contributor

stefan-guggisberg commented Mar 23, 2021

I could boil down the ESI include issue to the following simple example:

I put the following static XML snippet with 5 ESI includes in my blog fork: https://github.com/stefan-guggisberg/theblog/blob/master/entries.xml

When I request it through Fastly (https://theblog--stefan-guggisberg.hlx.page/entries.xml) the response is always corrupted.

With fewer ESI includes it works, e.g. https://theblog--stefan-guggisberg.hlx.page/entries2.xml

The issue we're facing seems to be a combination of some Fastly timeout for ESI processing (to be verified) and slow delivery of included resources.

@stefan-guggisberg
Copy link
Contributor

stefan-guggisberg commented Mar 25, 2021

Regarding the caching issues:

AFAICU, in order to completely purge a feed the following steps are required (in this exact order):

  1. purge the blog posts in the feed if they were modified
  2. purge the cgi-bin request, e.g. /cgi-bin/feed.xml?src=/jp/query-index.json%3Flimit%3D10&id=path&title=title&updated=date
  3. purge the feed request, e.g. /feeds/jp.xml

on

  1. inner CDN (hlx.page)
  2. outer CDN (hlx.live)
  3. Skyline (only the feed request, e.g. /feeds/jp.xml)

See also https://github.com/adobe/project-helix/pull/540, which probably lead to confusion.

@trieloff Please review

@kptdobe
Copy link
Contributor Author

kptdobe commented Mar 25, 2021

Thanks, this is helpful. Just one "detail": the request to a blog post for rendering in the feed uses the .embed selector. Do we really purge the path with this selector when we purge a blog posts ?

@stefan-guggisberg
Copy link
Contributor

Do we really purge the path with this selector when we purge a blog posts ?

I seriously doubt it. @rofe might know for sure.

@stefan-guggisberg
Copy link
Contributor

Re adobe/project-helix#540: I went ahead and applied the change to theblog--adobe.hlx.live

@trieloff
Copy link
Contributor

I've been thinking that we would probably benefit from fetching the included posts on the server side using helix-fetch instead of ESI to make use of the greater concurrency and better error handling this affords us.

This would also reduce the number of Fastly caches in play.

@kptdobe
Copy link
Contributor Author

kptdobe commented Mar 25, 2021

That's what I suggested as an alternative if we cannot solve the ESI issue. I am just afraid of the 60s limit to retrieve those 10 responses.

@trieloff
Copy link
Contributor

30 seconds. We can start all requests in parallel and skip the entries that are not fast enough.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants