Skip to content

Commit

Permalink
style: get rid of mentions of what is new, modern, current
Browse files Browse the repository at this point in the history
We're not historians of the web, we're trying to teach people,
and removing these will make the text more concise, future-proof,
and less subjective.
  • Loading branch information
honzajavorek committed Apr 24, 2024
1 parent 060ffbb commit d81fd3b
Show file tree
Hide file tree
Showing 20 changed files with 45 additions and 28 deletions.
2 changes: 1 addition & 1 deletion sources/academy/glossary/concepts/dynamic_pages.md
Expand Up @@ -11,7 +11,7 @@ slug: /concepts/dynamic-pages

---

In the modern web, single-page applications (SPAs) are becoming increasingly popular, especially due to JavaScript libraries like [React.js](https://reactjs.org/) and [Vue.js](https://vuejs.org/) pushing their development to the mainstream. Oftentimes, single-page applications (and loads of non-SPAs too) have dynamic content.
Oftentimes, web pages load additional information dynamically, long after their main body is loaded in the browser. A subset of dynamic pages takes this approach further and loads all of its content dynamically. Such style of constructing websites is called Single-page applications (SPAs), and it's widespread thanks to some popular JavaScript libraries, such as [React](https://reactjs.org/) or [Vue](https://vuejs.org/).

As you progress in your scraping journey, you'll quickly realize that different websites load their content and populate their pages with data in different ways. Some pages are rendered entirely on the server, some retrieve the data dynamically, and some use a combination of both those methods.

Expand Down
4 changes: 2 additions & 2 deletions sources/academy/glossary/concepts/html_elements.md
Expand Up @@ -17,7 +17,7 @@ You can also add **attributes** to an element to provide additional information
<img src="image.jpg" alt="A description of the image">
```

In modern JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page. For example, you can use the [`querySelector()` method](./querying_css_selectors.md) to select an element by its [CSS selector](./css_selectors.md), like this:
In JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page. For example, you can use the [`querySelector()` method](./querying_css_selectors.md) to select an element by its [CSS selector](./css_selectors.md), like this:

```js
const myElement = document.querySelector('#myId');
Expand All @@ -37,4 +37,4 @@ const myElements = document.getElementsByTagName('p');

Once you have selected an element, you can use JavaScript to change its content, style, or behavior.

In summary, an HTML element is a building block of a web page, and it is defined by a **tag**, it can also have **attributes** which provide additional information or control how the element behaves and in modern JavaScript, you can use the **DOM** (Document Object Model) to interact with elements on a web page.
In summary, an HTML element is a building block of a web page. It is defined by a **tag** with **attributes**, which provide additional information or control how the element behaves. You can use the **DOM** (Document Object Model) to interact with elements on a web page.
2 changes: 1 addition & 1 deletion sources/academy/platform/getting_started/apify_api.md
Expand Up @@ -47,7 +47,7 @@ https://api.apify.com/v2/acts/YOUR_USERNAME~adding-actor/run-sync-get-dataset-it

Additional parameters can be passed to this endpoint. You can learn about them [here](/api/v2#/reference/actors/run-actor-synchronously-and-get-dataset-items/run-actor-synchronously-with-input-and-get-dataset-items)

> Note: It is safer to put your API token in the **Authorization** header like so: `Authorization: Bearer YOUR_TOKEN`. This is very easy to configure in [Postman](../../glossary/tools/postman.md), [Insomnia](../../glossary/tools/insomnia.md), or any other modern HTTP client.
> Note: It is safer to put your API token in the **Authorization** header like so: `Authorization: Bearer YOUR_TOKEN`. This is very easy to configure in popular HTTP clients, such as [Postman](../../glossary/tools/postman.md), [Insomnia](../../glossary/tools/insomnia.md).
## Sending the request {#sending-the-request}

Expand Down
Expand Up @@ -91,7 +91,7 @@ The API tab gives you a quick overview of all the available API calls in case yo
## [](#scraping-theory) Scraping theory

Since this is a tutorial, we'll be scraping our own website. [Apify Store](https://apify.com/store) is a great candidate for some scraping practice. It's a page that uses modern web technologies and displays a lot of different items in various categories, just like an online store, a typical scraping target, would.
Since this is a tutorial, we'll be scraping our own website. [Apify Store](https://apify.com/store) is a great candidate for some scraping practice. It's a page built on popular technologies, which displays a lot of different items in various categories, just like an online store, a typical scraping target, would.

### [](#the-goal) The goal

Expand Down
Expand Up @@ -365,8 +365,7 @@ be automatically enqueued to the request queue. Use a label to let the scraper k
### [](#waiting-for-dynamic-content) Waiting for dynamic content

Before we talk about paginating, we need to have a quick look at dynamic content. Since Apify Store is a JavaScript
application (as many, if not most, modern websites are), the button might not exist in the page when the scraper
runs the `pageFunction`.
application (a popular approach), the button might not exist in the page when the scraper runs the `pageFunction`.

How is this possible? Because the scraper only waits with executing the `pageFunction` for the page to load its HTML.
If there's additional JavaScript that modifies the DOM afterwards, the `pageFunction` may execute before this
Expand Down
3 changes: 1 addition & 2 deletions sources/academy/tutorials/apify_scrapers/web_scraper.md
Expand Up @@ -263,8 +263,7 @@ be automatically enqueued to the request queue. Use a label to let the scraper k
### [](#waiting-for-dynamic-content) Waiting for dynamic content

Before we talk about paginating, we need to have a quick look at dynamic content. Since Apify Store is a JavaScript
application (as many, if not most, modern websites are), the button might not exist in the page when the scraper
runs the `pageFunction`.
application (a popular approach), the button might not exist in the page when the scraper runs the `pageFunction`.

How is this possible? Because the scraper only waits with executing the `pageFunction` for the page to load its HTML.
If there's additional JavaScript that modifies the DOM afterwards, the `pageFunction` may execute before this
Expand Down
Expand Up @@ -17,7 +17,7 @@ The `Target closed` error happens when you try to access the `page` object (or s

![Chrome crashed tab](./images/chrome-crashed-tab.png)

A modern browser creates a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot simply close your old tabs to free extra memory so it will usually kill your current memory hungry tab.
Browsers create a separate process for each tab. That means each tab lives with a separate memory space. If you have a lot of tabs open, you might run out of memory. The browser cannot simply close your old tabs to free extra memory so it will usually kill your current memory hungry tab.

### Memory solution

Expand Down
2 changes: 1 addition & 1 deletion sources/academy/tutorials/node_js/optimizing_scrapers.md
Expand Up @@ -21,7 +21,7 @@ Before we dive into the practical side of things, let us diverge with an analogy

## Game development analogy {#analogy}

Modern games are extremely complicated beasts. Every frame (usually 60 times a second), the game has to calculate the physics of the world, run AI, user input, and render everything into a beautiful scene. You can imagine that running all of that every 16 ms in a complicated game is a developer's nightmare. That's why a significant portion of game development is spent on optimizations. Every little waste matters.
Games are extremely complicated beasts. Every frame (usually 60 times a second), the game has to calculate the physics of the world, run AI, user input, and render everything into a beautiful scene. You can imagine that running all of that every 16 ms in a complicated game is a developer's nightmare. That's why a significant portion of game development is spent on optimizations. Every little waste matters.

This is mainly true in the programming heart of the game - the engine. The engine is responsible for the heavy lifting of performance critical parts like physics, animation, AI, and rendering. Once the engine is built, you can design the game on top of it. You can add different spells, conversation chains, items, animations etc. to make your game cool. Those extra things may not run every frame and don't need to be optimized as heavily as the engine itself.

Expand Down
Expand Up @@ -142,7 +142,7 @@ Those are similar to the ones above with an important caveat. Once you click the

## Frontend navigations

Modern websites typically won't navigate away just to fetch the next set of results. They will do it in the background and just update the displayed data. To paginate websites like that is quite easy actually and it can be done in both Web Scraper and Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Just click the next button to load the next set of courses.
Websites often won't navigate away just to fetch the next set of results. They will do it in the background and just update the displayed data. To paginate websites like that is quite easy actually and it can be done in both Web Scraper and Puppeteer Scraper. Try it on [Udemy](https://www.udemy.com/topic/javascript/) for example. Just click the next button to load the next set of courses.

```js
// Web Scraper\
Expand Down
2 changes: 1 addition & 1 deletion sources/academy/webscraping/anti_scraping/index.md
Expand Up @@ -24,7 +24,7 @@ If you don't have time to read about the theory behind anti-scraping protections

- Use high-quality proxies. [Residential proxies](/platform/proxy/residential-proxy) are the least blocked. You can find many providers out there like Apify, BrightData, Oxylabs, NetNut, etc.
- Set **real-user-like HTTP settings** and **browser fingerprints**. [Crawlee](https://crawlee.dev/) uses statistically generated realistic HTTP headers and browser fingerprints by default for all of its crawlers.
- Use a modern browser to pass bot capture challenges. We recommend [Playwright with Firefox](https://crawlee.dev/docs/examples/playwright-crawler-firefox) because it is not that common for scraping. You can also play with [non-headless mode](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#headless) and adjust other [fingerprint settings](https://crawlee.dev/api/browser-pool/interface/FingerprintGeneratorOptions).
- Use a browser to pass bot capture challenges. We recommend [Playwright with Firefox](https://crawlee.dev/docs/examples/playwright-crawler-firefox) because it is not that common for scraping. You can also play with [non-headless mode](https://crawlee.dev/api/playwright-crawler/interface/PlaywrightCrawlerOptions#headless) and adjust other [fingerprint settings](https://crawlee.dev/api/browser-pool/interface/FingerprintGeneratorOptions).
- Consider extracting data from **[private APIs](../api_scraping/index.md)** or **mobile app APIs**. They are usually much less protected.
- Increase the number of request retries significantly to at least 10 with [`maxRequestRetries: 10`](https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions#maxRequestRetries). Rotate sessions after every error with [`maxErrorScore: 1`](https://crawlee.dev/api/core/interface/SessionOptions#maxErrorScore)

Expand Down
29 changes: 24 additions & 5 deletions sources/academy/webscraping/api_scraping/index.md
Expand Up @@ -20,15 +20,34 @@ In this module, we will discuss the benefits and drawbacks of API scraping, how

## What's an API? {#what-is-api}

An API is a custom service that lives on the server of any given website. They provide an intuitive way for the website's client-side pages to send and receive data to and from the server, where it can be stored in a database, manipulated, or used to perform an operation. Though not **all** sites have APIs, the majority do - especially modern web-applications. Learn more about APIs [in this article](https://blog.apify.com/what-is-an-api/).
An API is a custom service that lives on the server of any given website. They provide an intuitive way for the website's client-side pages to send and receive data to and from the server, where it can be stored in a database, manipulated, or used to perform an operation. Though not **all** sites have APIs, many do, especially those built as complex web applications. Learn more about APIs [in this article](https://blog.apify.com/what-is-an-api/).

## Different types of APIs

The vast majority of APIs out there are standard REST APIs that have different endpoints. Even though this is the case, it is important to be aware that newer API technologies such as [GraphQL](https://graphql.org/) are becoming more popular, and therefore more utilized in modern web applications.
There are two mainstream approaches to APIs: REST and GraphQL. While REST is a vague architectural style based just on conventions, GraphQL has a specification everyone must follow.

A GraphQL API works differently from a standard API. All requests are `POST` requests to a single endpoint - typically `https://targetdomain.com/graphql`. Queries are passed as a payload, and are very specific (which can be difficult to deal with). Additionally, GraphQL is a language in of itself; therefore, one must at least skim their documentation for a light understanding of the technology in order to be able to effectively scrape an API that uses it.
The REST APIs usually consists of many so-called endpoints, to which you can send your requests. In the responses you are provided with information about various resources, such as users, products, etc. Examples of typical REST API requests:

> **Note:** these concepts will be covered more in-depth later on, but it is good to be aware of them now.
```text
GET https://api.example.com/users/123
GET https://api.example.com/comments/abc123?limit=100
POST https://api.example.com/orders
```

In a GraphQL API, all requests are `POST` and point to a single URL, typically something like `https://api.example.com/graphql`. To get data, you send along a query in the GraphQL query language, optionally with variables. Example of such query:

```graphql
query($number_of_repos: Int!) {
viewer {
name
repositories(last: $number_of_repos) {
nodes {
name
}
}
}
}
```

## Advantages of API scraping {#advantages}

Expand Down Expand Up @@ -60,7 +79,7 @@ Many APIs will require the session cookie, an API key, or some other special val

### 2. Potential overhead

For complex APIs that require certain headers and/or payloads in order to make a successful request, return encoded data, have rate limits, or that use modern technologies such as GraphQL, there can be a slight overhead in figuring out how to utilize them in a scraper.
For complex APIs that require certain headers and/or payloads in order to make a successful request, return encoded data, have rate limits, or that use GraphQL, there can be a slight overhead in figuring out how to utilize them in a scraper.

<!-- These will be articles in the future -->

Expand Down
Expand Up @@ -387,9 +387,9 @@ If we remember correctly, Facebook has 115 GitHub repositories (at the time of w

## Lazy-loading pagination {#lazy-loading-pagination}

Though page number-based pagination is quite straightforward to automate the pagination process with, and though it is still an extremely common implementation, [lazy-loading](https://en.wikipedia.org/wiki/Lazy_loading) is becoming extremely popular on the modern web, which makes it an important and relevant topic to discuss.
Pagination based on page numbers is straightforward to automate, but many websites use [lazy-loading](https://en.wikipedia.org/wiki/Lazy_loading) instead.

> Note that on websites with lazy-loading pagination, [API scraping](../../api_scraping/index.md) is usually a viable option, and a much better one due to reliability and performance.
> On websites with lazy-loading pagination, if [API scraping](../../api_scraping/index.md) is a viable option, it is a much better approach due to reliability and performance.
Take a moment to look at and scroll through the women's clothing section [on About You's website](https://www.aboutyou.com/c/women/clothing-20204). Notice that the items are loaded as you scroll, and that there are no page numbers. Because of how drastically different this pagination implementation is from the previous one, it also requires a different workflow to scrape.

Expand Down
Expand Up @@ -50,7 +50,7 @@ This means that when using TS (a popular acronym for "TypeScript") on a large pr
1. The ability to **optionally** [statically type](https://developer.mozilla.org/en-US/docs/Glossary/Static_typing) your variables and functions.
2. [Type Inference](https://www.typescriptlang.org/docs/handbook/type-inference.html), which provides you the benefits of using types, but without having to actually statically type anything. For example, if you create a variable like this: `let num = 5`, TypeScript will automatically infer that `num` is of a **number** type.
3. Access to the newest features in JavaScript before they are officially supported everywhere.
4. Fantastic support with [IntelliSense](https://en.wikipedia.org/wiki/Intelligent_code_completion) and epic autocomplete when writing functions, accessing object properties, etc. Most modern IDEs have TypeScript support.
4. Fantastic support with [IntelliSense](https://en.wikipedia.org/wiki/Intelligent_code_completion) and epic autocomplete when writing functions, accessing object properties, etc. Most IDEs have TypeScript support.
5. Access to exclusive TypeScript features such as [Enums](https://www.typescriptlang.org/docs/handbook/enums.html).
<!-- and [Decorators](https://www.typescriptlang.org/docs/handbook/decorators.html). -->

Expand Down
Expand Up @@ -29,7 +29,7 @@ Define any [constant variables](https://softwareengineering.stackexchange.com/qu

> If you have a whole lot of constant variables, they can be in a folder named **constants** organized into different files.
### Use modern ES6 JavaScript {#use-es6}
### Use ES6 JavaScript {#use-es6}

If you're writing your scraper in JavaScript, use [ES6](https://www.w3schools.com/js/js_es6.asp) features and ditch the old ones which they replace. This means using `const` and `let` instead of `var`, `includes` instead of `indexOf`, etc.

Expand Down
Expand Up @@ -18,7 +18,7 @@ A headless browser is simply a browser that runs without a user interface (UI).

## Building a Playwright scraper {#playwright-scraper}

> We'll focus on Playwright today, as it was developed by the same team that created Puppeteer, and it's a more modern library with extra features and better documentation.
> Our focus will be on Playwright, which boasts additional features and better documentation. Notably, it originates from the same team responsible for Puppeteer.
Building a Playwright scraper with Crawlee is extremely easy. To show you how easy it really is, we'll reuse the Cheerio scraper code from the previous lesson. By changing only a few lines of code, we'll turn it into a full headless scraper.

Expand Down
Expand Up @@ -72,7 +72,7 @@ If some of the code is hard for you to understand, please review the [Basics of

:::caution

We are using modern JS syntax like `import` statements and top-level `await`. If you see errors like Cannot use import outside of a module. Please review the [Project setup lesson](../data_extraction/project_setup.md#modern-javascript) where we explain how to enable those features.
We are using JavaScript features like `import` statements and top-level `await`. If you see errors like _Cannot use import outside of a module_, please review the [Project setup lesson](../data_extraction/project_setup.md#modern-javascript), where we explain how to enable those features.

:::

Expand Down

0 comments on commit d81fd3b

Please sign in to comment.