New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: implement ErrorSnapshotter for error context capture #2332
feat: implement ErrorSnapshotter for error context capture #2332
Conversation
…280) This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling.
Hey @B4nan, I have completed the pull request that resolves the issue with saving Save screenshot/HTML on first occurrence of error in error statistics. However, when I was trying things out, I ran into a few hiccups. I used the debugger and set up a quick test file with Cheerio and the browser to check if everything was working fine. The good news is that the object passed the test, but I couldn't get around to testing the file saved in the key-value store properly. Normal tests ran fine, but seems there are errors but I don't believe they are related to my changes. Unfortunately, I was unable to run the e2e tests or build it due to some errors in the terminal, and I wasn't able to figure out what caused them. I'm using
|
Can you send the full build log you have from the build command? There should be more information 👀 |
Full build log:
|
Just checked CI logs too, and yeah.. probably updating imports with what I sent might fix it |
Yes, it's fixed now. |
A couple of things: 1- I need to handle saving the local file path to the screenshot and HTML, currently, I only save the Apify path which will not work locally.
It's not possible to save the HTML or take a screenshot if an error happens before navigation, and I have not yet looked at the case for post-navigation |
Please add some tests for the new functionality. Also, note that what you did (making the |
Thanks. I will check it a bit later this week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have some example stats output file that this now generates?
- For local dev, we could use a full local filesystem path (like
/Users/lukas/apify/public-actors/facebook/storage/key_value_stores/${kvName}/${filename}
but not sure what is the best practice in dev world - Seems good enough but I want to see example files
- Not sure too, you should test it manually with a scrape but also do unit tests for this
- For now to the configured default store, the same place where the stats are stored. We can add some configuration later, up to @B4nan how much of it we should have
- We should be able to capture snapshots of e.g. pages that navigated but then were rejected as blocked by status code. Those should already have a body and perhaps the page object. But I'm not sure if we can get hold of it in time.
Here is an example, clicking on the (path + cmd) opens the file:
|
@HamzaAlwan When adding tests, let's add a few real-world errors so we can properly trim very long error messages. I wonder if the 30 chars error length could be longer but let's see how the real world errors will look like. Thanks. |
Everything is done, please look it over and provide any feedback. What is left is this:
This code works for browsers because we can access the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@B4nan @vladfrangu Could you guys review the latest version? Thanks
@B4nan @vladfrangu @barjin Hey guys, let's get this merged pls. It is really useful for us and we believe for the community as well. We are ready to refactor it if needed. Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left few comments, nothing huge, overall it looks pretty good
i have two points outside of that:
- the use of
APIFY_IS_AT_HOME
: I pointed out the recent PR (fix: Fixed double extension for screenshots #2419) which might resolve this so we don't need it. overwise this would be literally the only place in the whole crawlee codebase we would rely on apify env vars (for anything else than improved logging). lets try to find a way around that and not use this env var. - the whole functionality seems to be hardcoded, this needs to be configurable - and i am not entirely sure it should be enabled by default
this.result = Object.create(null); | ||
this.total = 0; | ||
} | ||
|
||
add(error: ErrnoException) { | ||
async add(error: ErrnoException, context?: CrawlingContext) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am still not very keen of this breaking change either (making sync methods async)
and looking at the implementation, i'd say it can stay sync, you dont really need to wait for the screenshots to resolve, right? and we can still return the promise to allow optional awaiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, not awaiting the snapshot actually sounds like a good idea, just make sure it is try/caught inside so we don't get a promise rejection crash, then the add can return promise or error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and we can still return the promise to allow optional awaiting
I'm not sure how to implement this
packages/playwright-crawler/src/internals/utils/playwright-utils.ts
Outdated
Show resolved
Hide resolved
@HamzaAlwan Please look at @B4nan 's remarks. Let's prioritize this so we can finish it finally. cc @AndreyBykov |
…d revert changes to add function
Great that you addressed the comments just now @HamzaAlwan, we want to ship v3.10 in the afternoon most probably, so let's try to squeeze this in. Don't forget to rebase and resolve the conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another round of comments. its looking good, but i really want to get rid of the apify related hardcoded stuff with KVS and paths. i can try to deal with this myself if you don't know how
// v3.0.0 used `crawlee_storage` as the default, we changed this in v3.0.1 to just `storage`, | ||
// this function handles it without making BC breaks - it respects existing `crawlee_storage` | ||
// directories, and uses the `storage` only if it's not there. | ||
const defaultStorageDir = () => { | ||
if (pathExistsSync(resolve('./crawlee_storage'))) { | ||
return 'crawlee_storage'; | ||
} | ||
|
||
return 'storage'; | ||
}; | ||
|
||
this.KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores'; | ||
this.KEY_VALUE_STORE_LOCAL_PATH = `file://${process.env.PWD}/${defaultStorageDir()}/key_value_stores`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets remove this, there is no point in supporting crawlee 3.0.0 when this feature will land in 3.10.0
// v3.0.0 used `crawlee_storage` as the default, we changed this in v3.0.1 to just `storage`, | |
// this function handles it without making BC breaks - it respects existing `crawlee_storage` | |
// directories, and uses the `storage` only if it's not there. | |
const defaultStorageDir = () => { | |
if (pathExistsSync(resolve('./crawlee_storage'))) { | |
return 'crawlee_storage'; | |
} | |
return 'storage'; | |
}; | |
this.KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores'; | |
this.KEY_VALUE_STORE_LOCAL_PATH = `file://${process.env.PWD}/${defaultStorageDir()}/key_value_stores`; | |
this.KEY_VALUE_PLATFORM_PATH = 'https://api.apify.com/v2/key-value-stores'; | |
this.KEY_VALUE_STORE_LOCAL_PATH = `file://${process.env.PWD}/storage/key_value_stores`; |
i also very much dislike the fact that we hardcode some apify URL here (i dislike its here, its not relevant whether you get the value from env vars or hardcode it). why do you need to handle this manually? we already have abstractions in place, you should just use KVS and it will do something locally and something else on the platform, why is that not enough and you need to deal with different paths explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess Hamza would do that if he would be more aware of how the storage classes work behind the scenes :) Can you just suggest what to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would remove this manual URL handling completely and just call KVS.setValue (and deal with the URLs elsewhere, e.g. via that getPublicUrl
method in SDK (which we can add to crawlee too if we want it to provide URLs to the locally stored files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i can try to improve this myself if needed, would very much like to have this ready by the end of the day or tomorrow morning. also its probably not a blocker for merging, @HamzaAlwan please focus on the other parts if this one is not clear (but let's remove the crawlee_storage
conditional handling as suggested).
edit: maybe it will be better of left for myself, as it might require more internal adjustments... lets focus on the other comments mainly
if (process.env.APIFY_IS_AT_HOME) { | ||
const platformPath = `${this.KEY_VALUE_PLATFORM_PATH}/${keyValueStore.id}/records`; | ||
return { | ||
screenshotFileUrl: screenshotFilename ? `${platformPath}/${screenshotFilename}` : undefined, | ||
htmlFileUrl: htmlFileName ? `${platformPath}/${htmlFileName}` : undefined, | ||
}; | ||
} | ||
|
||
const localPath = `${this.KEY_VALUE_STORE_LOCAL_PATH}/${keyValueStore.name || 'default'}`; | ||
return { | ||
screenshotFileUrl: screenshotFilename ? `${localPath}/${screenshotFilename}` : undefined, | ||
htmlFileUrl: htmlFileName ? `${localPath}/${htmlFileName}` : undefined, | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above, this is bad, it hardcodes handling for platform vs local, which is completely abstracted away from user (based on the Actor.init
call, and different configuration options), you should just use a KVS and let it do its job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI we have this KVS.getPublicUrl()
method defined in the SDK variant of the class, i would rather see an approach in this direction instead of hardcoding things on wrong level.
https://github.com/apify/apify-sdk-js/blob/master/packages/apify/src/key_value_store.ts#L12
i don't understand why would you need to care about local paths, but we could define the same method on the crawlee variant of the KVS class to resolve to that, sounds like a much cleaner solution
errno?: number | undefined; | ||
code?: string | number | undefined; | ||
path?: string | undefined; | ||
syscall?: string | undefined; | ||
cause?: any; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the | undefined
, its not doing anything since you already define the props as optional via ?
also i have doubts about the cause
being required, that should be already defined on the base Error
class
looks like this was there already, but please lets fix it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was added there because of this:
this.add(error.cause);
this.add expects ErrnoException type and it shows an error, as cause
type is unknown in Error
class
test/core/error_tracker.test.ts
Outdated
@@ -157,7 +155,7 @@ test('no code is null code', () => { | |||
expect(tracker.getUniqueErrorCount()).toBe(1); | |||
}); | |||
|
|||
test('can hide error code', () => { | |||
test('can hide error code', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i mentioned one place, and you only fixed only one place. all of those changes in this file should be removed (except the import)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sorry I forgot the rest
@HamzaAlwan please try to work on this asap if you want to see this in v3.10.0 |
I might be late to the party, but why |
it does not store just screenshots, but also html snapshots, right? edit: in fact it does not store any screenshots when used with non-browser crawlers |
We use the term snapshot as "store representation of the page" like screenshot, HTML (potentially HAR , JS etc. if we would want that). I agree the term is confusing but we didn't find better one. We already have https://crawlee.dev/api/3.7/playwright-crawler/namespace/playwrightUtils#saveSnapshot |
…rty names for snapshot files
317a5a5
to
4a59c07
Compare
This commit introduces the ErrorSnapshotter class to the crawlee package, providing functionality to capture screenshots and HTML snapshots when an error occurs during web crawling.
This functionality is opt-in, and can be enabled via the crawler options:
Closes #2280