Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer: cache runtime artifacts on filesystem #747

Merged
merged 6 commits into from May 6, 2020
Merged

Conversation

sebastianbenz
Copy link
Collaborator

This PR addresses the problem that currently transformations rely on working access to the internet:

  • a new FileSystemCache stores downloaded runtime artifacts (validator rules, latest amp runtime version, v0.css)
  • transformation never blocks on network access if a cached version is available
  • download AMP versions have a max-age of 10 minutes, after that we check if a new AMP version is available (not sure if this is a good frequency). If one is available it's downloaded in the background.
  • option to disable caching (used in tests)
  • add a postinstall hook that will download the latest runtime artifacts to prime the cache

Fixes #378 #719 #650

@mdmower would be great if you could check if this works on windows.

Copy link
Collaborator

@fstanis fstanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional thoughts:

  • FileSystemCache has no eviction mechanism and no size limit on values you put in the map. You may run out of memory. I'd limit the number of keys at least (but better to use a proper LRU cache).
  • You'll make copies of duplicate files (two urls returning same data). This can be avoided you keep an index of url -> hash(data) and use hash(data) instead of hash(url) for the filename.
  • If you can't write a generic fetch-compatible wrapper, perhaps you can narrow down what fetch does to something simpler, e.g. async function that only takes a URL and returns a string? That would make it easier to wrap in a cached call.

*/
'use strict';
const {join} = require('path');
const {rmdir, readdir, readFile, writeFile, unlink, existsSync, mkdirSync} = require('fs');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering you have this many imports, I'd argue destructuring isn't a good idea, some of these are more readable as part of fs (e.g. fs.unlink vs unlink).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol - yeah. Frog in boiling water...

'use strict';
const {join} = require('path');
const {rmdir, readdir, readFile, writeFile, unlink, existsSync, mkdirSync} = require('fs');
const {promisify} = require('util');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? Unless you want to support Node.js 9 and below (which is probably no longer needed... oldest supported LTS is 10), use fs.promises.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Old habit...fixed.

const unlinkAsync = promisify(unlink);

class FileSystemCache {
static get(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get to me implies a singleton-like pattern, but this actually creates a new instance (also, you used create below for a similar pattern). Also, do you really need this, since it doesn't do much other than provide a default (which you can just do in the constructor).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to create. Leftover from a previous implementation.

* limitations under the License.
*/
'use strict';
const {join} = require('path');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you're importing path twice. Also, I suggest not destructuring, join can be very confusing (as opposed to path.join).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

await Promise.all(
entries.map((entry) => {
let fullPath = path.join(dir, entry.name);
return entry.isDirectory() ? this.deleteDir_(fullPath) : unlinkAsync(fullPath);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is potentially dangerous: directories can be cyclic. That said, I guess this is not a big deal and hard to fix. I'd consider using fs.rmdir which is experimental or rimraf. Or add a max depth.

Actually, if I understand correctly, nested directories are never created by this cache. I'd throw if there are non-files in a directory.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rmdir is unfortunately only Node >=12. rimraf has to many dependencies for a non-dev dependency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further, there's another danger here: if there's a symlink outside this directory (worst case, to ~ or /), this will delete everything it can. You can argue that's not very likely, but I'd prefer if this was impossible.

A few options:

  • Ignore directories and only delete files (since directories are never created this way anyway).
  • Check if a directory is a symlink first.
  • Resolve symlinks and check if the directory has the initial directory as a prefix.

Regarding fs.rmdir, you can add an if to check if it's available and fallback to your own implementation if not. But probably not needed anyway. Just make sure it doesn't recursively delete my home folder. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sg - changed to only delete files in cache dir

/**
* @private
*/
async function downloadAmpRuntimeStyles_(config, runtimeCssUrl) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, consider moving to a more generic fetch wrapper.


async function warmupCaches() {
// run a dummy transformation to pre-fill the caches
await ampOptimizer.transformHtml('<h1>hello world</h1>', {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: just return it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

const cacheFile = this.createCacheFileName(key);
try {
const content = await readFileAsync(cacheFile, 'utf-8');
value = JSON.parse(content);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must it be JSON? If you use a raw buffer, your cache becomes binary-compatible (and less error prone and less code).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you got an example for this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const content = await fs.readFile(cacheFile); // returns Buffer
this.cache.set(key, value); // Buffer in Map
...
fs.writeFile(cacheFile, value); // writes Buffer to file, no need to specify encoding

JSON.parse also works on a Buffer, by coercing it to string first (i.e. JSON.parse(str) === JSON.parse(Buffer.from(str))).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, so the advantage is that you don't need to mess with encoding?

Comment on lines 42 to 44
await initValidatorRules(runtimeParameters, customRuntimeParameters, config);
await initRuntimeVersion(runtimeParameters, customRuntimeParameters, config);
await initRuntimeStyles(runtimeParameters, config);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will execute sequentially, but it can be done in parallel with one (annoying) tranformation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good tip! done

Copy link
Collaborator Author

@sebastianbenz sebastianbenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Filip! Comments:

  • re LRU Cache: not if that's a realistic scenario with releases once a week, but better save than sorry. Using an LRU Cache now.
  • same value for different URLs is not a use case as we only store version specific artifacts which are immutable.
  • fetch wrapper: I don't see the need here. There are three different fetches happening (ValidatorRules with built-int fetch, JSON and text), some require a max-age some don't. Unifying this feels like over-engineering at this point. This might change if we need to fetch more artifacts.

Comment on lines 42 to 44
await initValidatorRules(runtimeParameters, customRuntimeParameters, config);
await initRuntimeVersion(runtimeParameters, customRuntimeParameters, config);
await initRuntimeStyles(runtimeParameters, config);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good tip! done

*/
'use strict';
const {join} = require('path');
const {rmdir, readdir, readFile, writeFile, unlink, existsSync, mkdirSync} = require('fs');
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol - yeah. Frog in boiling water...

'use strict';
const {join} = require('path');
const {rmdir, readdir, readFile, writeFile, unlink, existsSync, mkdirSync} = require('fs');
const {promisify} = require('util');
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Old habit...fixed.

const unlinkAsync = promisify(unlink);

class FileSystemCache {
static get(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to create. Leftover from a previous implementation.

const cacheFile = this.createCacheFileName(key);
try {
const content = await readFileAsync(cacheFile, 'utf-8');
value = JSON.parse(content);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you got an example for this?


async function warmupCaches() {
// run a dummy transformation to pre-fill the caches
await ampOptimizer.transformHtml('<h1>hello world</h1>', {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

* limitations under the License.
*/
'use strict';
const {join} = require('path');
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

* @param {Number} timestampInMs time when max-age value was received
* @param {Number} value max-age value in seconds
**/
static fromJson(timestampInMs, value) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, renamed to from/toObject. I miss Java in these cases...

@sebastianbenz sebastianbenz requested a review from fstanis May 6, 2020 12:56
const cacheFile = this.createCacheFileName(key);
try {
const content = await readFileAsync(cacheFile, 'utf-8');
value = JSON.parse(content);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const content = await fs.readFile(cacheFile); // returns Buffer
this.cache.set(key, value); // Buffer in Map
...
fs.writeFile(cacheFile, value); // writes Buffer to file, no need to specify encoding

JSON.parse also works on a Buffer, by coercing it to string first (i.e. JSON.parse(str) === JSON.parse(Buffer.from(str))).

async set(key, value) {
const cacheFile = this.createCacheFileName(key);
try {
this.cache[key] = value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test worked because it was always getting a cache miss and loading from a file. You could maybe add a test that clears the folder on the filesystem after a set to make sure it loads it from the LRU.

await Promise.all(
entries.map((entry) => {
let fullPath = path.join(dir, entry.name);
return entry.isDirectory() ? this.deleteDir_(fullPath) : unlinkAsync(fullPath);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further, there's another danger here: if there's a symlink outside this directory (worst case, to ~ or /), this will delete everything it can. You can argue that's not very likely, but I'd prefer if this was impossible.

A few options:

  • Ignore directories and only delete files (since directories are never created this way anyway).
  • Check if a directory is a symlink first.
  • Resolve symlinks and check if the directory has the initial directory as a prefix.

Regarding fs.rmdir, you can add an if to check if it's available and fallback to your own implementation if not. But probably not needed anyway. Just make sure it doesn't recursively delete my home folder. :)

@sebastianbenz
Copy link
Collaborator Author

@fstanis agree, only removing the files now

Thanks everyone for the review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create one way to handle network dependencies (validator, etc.)
4 participants