Skip to content

Releases: apify/crawlee

v0.21.5

30 Sep 12:23
Compare
Choose a tag to compare

This is a very minor release that fixes some issues that were preventing
use of the SDK with Node 14.

  • Update the request serialization process which is used in RequestList
    to work with Node 10+ and not only 10 and 12.
  • Update some TypeScript types that were preventing build due to changes
    in typed dependencies.

v0.21.4

02 Sep 20:06
Compare
Choose a tag to compare

The request statistics that you may remember from logs are now persisted in key-value store,
so you won't lose count when your actor restarts. We've also added a lot of useful
stats in there which can be useful to you after a run finishes. Besides that,
we fixed some bugs and annoyances and improved the TypeScript experience a bit.

  • Add persistence to Statistics class and automatically persist it in BasicCrawler.
  • Fix issue where inaccessible Apify Proxy would cause ProxyConfiguration to throw
    a timeout error.
  • Update default user agent to Chrome 85
  • Bump Puppeteer to 5.2.1 which uses Chromium 85
  • TypeScript: Fix RequestAsBrowserOptions missing some values and add RequestQueueInfo
    as a return value from requestQueue.getInfo()

v0.21.3

27 Jul 18:09
Compare
Choose a tag to compare
  • Fix useless logging in Session.

v0.21.2

27 Jul 17:24
Compare
Choose a tag to compare
  • Fix cookies with leading dot in domain (as extracted from Puppeteer) not being correctly added to Sessions.

v0.21.1

21 Jul 12:59
0017b47
Compare
Choose a tag to compare

We fixed some bugs, improved a few things and bumped Puppeteer to match latest Chrome 84.

  • Allow Apify.createProxyConfiguration to be used seamlessly with the proxy component
    of Actor Input UI.
  • Fix integration of plugins into CheerioCrawler with the crawler.use() function.
  • Fix a race condition which caused RequestQueueLocal to fail handling requests.
  • Fix broken debug logging in SessionPool.
  • Improve ProxyConfiguration error message for missing password / token.
  • Update Puppeteer to 5.2.0
  • Improve docs, update packages and so on.

v0.21.0

06 Jun 14:30
Compare
Choose a tag to compare

This release comes with breaking changes that will affect most, if not all of your projects. See the migration guide for more information and examples.

First large change is a redesigned proxy configuration. Cheerio and Puppeteer crawlers now accept a proxyConfiguration parameter, which is an instance of ProxyConfiguration. This class now exclusively manages both Apify Proxy and custom proxies. Visit the new proxy management guide

We also removed Apify.utils.getRandomUserAgent() as it was no longer effective in avoiding bot detection and changed the default values for empty properties in Request instances.

  • BREAKING: Removed Apify.getApifyProxyUrl(). To get an Apify Proxy url, use proxyConfiguration.newUrl([sessionId]).
  • BREAKING: Removed useApifyProxy, apifyProxyGroups and apifyProxySession parameters from all applications in the SDK. Use proxyConfiguration in crawlers and proxyUrl in requestAsBrowser and Apify.launchPuppeteer.
  • BREAKING: Removed Apify.utils.getRandomUserAgent() as it was no longer effective in avoiding bot detection.
  • BREAKING: Request instances no longer initialize empty properties with null, which means that:
    • empty errorMessages are now represented by [], and
    • empty loadedUrl, payload and handledAt are undefined.
  • Add Apify.createProxyConfiguration() async function to create ProxyConfiguration instances. ProxyConfiguration itself is not exposed.
  • Add proxyConfiguration to CheerioCrawlerOptions and PuppeteerCrawlerOptions.
  • Add proxyInfo to CheerioHandlePageInputs and PuppeteerHandlePageInputs. You can use this object to retrieve information about the currently used proxy in Puppeteer and Cheerio crawlers.
  • Add click buttons and scroll up options to Apify.utils.puppeteer.infiniteScroll().
  • Fixed a bug where intercepted requests would never continue.
  • Fixed a bug where Apify.utils.requestAsBrowser() would get into redirect loops.
  • Fix Apify.utils.getMemoryInfo() crashing the process on AWS Lambda and on systems running in Docker without memory cgroups enabled.
  • Update Puppeteer to 3.3.0.

v0.20.4

11 May 11:11
7558eb3
Compare
Choose a tag to compare
  • Add Apify.utils.waitForRunToFinish() which simplifies waiting for an actor run to finish.
  • Add standard prefixes to log messages to improve readability and orientation in logs.
  • Add support for async handlers in Apify.utils.puppeteer.addInterceptRequestHandler()
  • EXPERIMENTAL: Add cheerioCrawler.use() function to enable attaching CrawlerExtension
    to the crawler to modify its behavior. A plugin that extends functionality.
  • Fix bug with cookie expiry in SessionPool.
  • Fix issues in documentation.
  • Updated @apify/http-request to fix issue in the proxy-agent package.
  • Updated Puppeteer to 3.0.2

v0.20.3

14 Apr 17:05
91e0d3e
Compare
Choose a tag to compare
  • DEPRECATED: CheerioCrawlerOptions.requestOptions is now deprecated. Please use
    CheerioCrawlerOptions.prepareRequestFunction instead.
  • Add limit option to Apify.utils.enqueueLinks() for situations when full crawls are not needed.
  • Add suggestResponseEncoding and forceResponseEncoding options to CheerioCrawler to allow
    users to provide a fall-back or forced encoding of responses in situations where websites
    serve invalid encoding information in their headers.
  • Add a number of new examples and update existing ones to documentation.
  • Fix duplicate file extensions in Apify.utils.puppeteer.saveSnapshot() when used locally.
  • Fix encoding of multi-byte characters in CheerioCrawler.
  • Fix formatting of navigation buttons in documentation.

v0.20.2

09 Mar 17:06
Compare
Choose a tag to compare
  • Fix an error where persistence of SessionPool would fail if a cookie included invalid
    expires value.
  • Skipping one patch version because of an error in publishing via CI.

v0.20.0

03 Mar 13:06
Compare
Choose a tag to compare
  • BREAKING: Apify.utils.requestAsBrowser() no longer aborts request on status code 406
    or when other than text/html type is received. Use options.abortFunction if you want to
    retain this functionality.
  • BREAKING: Added useInsecureHttpParser option to Apify.utils.requestAsBrowser() which
    is true by default and forces the function to use a HTTP parser that is less strict than
    default Node 12 parser, but also less secure. It is needed to be able to bypass certain
    anti-scraping walls and fetch websites that do not comply with HTTP spec.
  • BREAKING: RequestList now removes all the elements from the sources array on
    initialization. If you need to use the sources somewhere else, make a copy. This change
    was added as one of several measures to improve memory management of RequestList
    in scenarios with very large amount of Request instances.
  • DEPRECATED: RequestListOptions.persistSourcesKey is now deprecated. Please use
    RequestListOptions.persistRequestsKey.
  • RequestListOptions.sources can now be an array of string URLs as well.
  • Added sourcesFunction to RequestListOptions. It enables dynamic fetching of sources
    and will only be called if persisted Requests were not retrieved from key-value store.
    Use it to reduce memory spikes and also to make sure that your sources are not re-created
    on actor restarts.
  • Updated stealth hiding of webdriver to avoid recent detections.
  • Apify.utils.log now points to an updated logger instance which prints colored logs (in TTY)
    and supports overriding with custom loggers.
  • Improved Apify.launchPuppeteer() code to prevent triggering bugs in Puppeteer by passing
    more than required options to puppeteer.launch().
  • Documented BasicCrawler.autoscaledPool property, and added CheerioCrawler.autoscaledPool
    and PuppeteerCrawler.autoscaledPool properties.
  • SessionPool now persists state on teardown. Before, it only persisted state every minute.
    This ensures that after a crawler finishes, the state is correctly persisted.
  • Added TypeScript typings and typedef documentation for all entities used throughout SDK.
  • Upgraded proxy-chain NPM package from 0.2.7 to 0.4.1 and many other dependencies
  • Removed all usage of the now deprecated request package.