Skip to content

Commit

Permalink
fix: be specific about 'something'
Browse files Browse the repository at this point in the history
  • Loading branch information
honzajavorek committed Apr 24, 2024
1 parent 318025f commit 6c68ac1
Showing 1 changed file with 1 addition and 1 deletion.
Expand Up @@ -20,7 +20,7 @@ In this section, we will take a look at moving between web pages, which we call

## How do you crawl? {#how-to-crawl}

Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](../data_extraction/index.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case something happens to our scraper, we won't have to extract them again. And, finally, we will visit those URLs one by one.
Crawling websites is a fairly straightforward process. We'll start by opening the first web page and extracting all the links (URLs) that lead to the other pages we want to visit. To do that, we'll use the skills learned in the [Basics of data extraction](../data_extraction/index.md) course. We'll add some extra filtering to make sure we only get the correct URLs. Then, we'll save those URLs, so in case our scraper crashes with an error, we won't have to extract them again. And, finally, we will visit those URLs one by one.

At any point, we can extract URLs, data, or both. Crawling can be separate from data extraction, but it's not a requirement and, in most projects, it's actually easier and faster to do both at the same time. To summarize, it goes like this:

Expand Down

0 comments on commit 6c68ac1

Please sign in to comment.