Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test htmldate on further web pages and report bugs #8

Open
adbar opened this issue Jan 3, 2020 · 15 comments
Open

Test htmldate on further web pages and report bugs #8

adbar opened this issue Jan 3, 2020 · 15 comments
Labels
good first issue Good for newcomers up for grabs Good for (first) contributors

Comments

@adbar
Copy link
Owner

adbar commented Jan 3, 2020

I have mostly tested htmldate on a set of English, German and French web pages I had run into by surfing or during web crawls. There are definitely further web pages and cases in other languages for which the extraction of a date doesn't work so far.

Please install the dateparser library beforehand as it significantly extends linguistic coverage: pipor pip3 install -U dateparser or pip install -U htmldate[all].

Corresponding bug reports can either be filed as a list in an issue like this one or in the code as XPath expressions in core.py (see DATE_EXPRESSIONS and ADDITIONAL_EXPRESSIONS).

Thanks!

@adbar adbar added good first issue Good for newcomers up for grabs Good for (first) contributors labels Jan 3, 2020
@adbar
Copy link
Owner Author

adbar commented Sep 16, 2021

@rahulbot
Copy link
Contributor

rahulbot commented Jul 21, 2022

@adbar
Copy link
Owner Author

adbar commented Jul 21, 2022

The first example is especially tricky, the date in the right column is tagged as a proper date in the HTML whereas the date in the main content isn't.

@kinoute
Copy link

kinoute commented Aug 3, 2022

from htmldate import find_date
import requests
resp = requests.get('https://www.lesechos.fr/1991/01/saddam-hussein-menace-larabie-saoudite-939083')

find_date(
  resp.content.decode(errors='ignore'),
  extensive_search=True,
  outputformat='%Y-%m-%d %H:%M:%S',
)
  • results : 2022-07-26 00:00:00

But in the HTML source code there is a meta entry with the correct date:

<meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00"/>

I thought htmldate will look at this in the first place or am I missing something?

@adbar
Copy link
Owner Author

adbar commented Aug 3, 2022

Hi @kinoute, htmldate considers that the date 1991-01-02 isn't valid. You can try to set the parameter min_date in find_date() to change this, e.g. min_date="1990-01-01".

@kinoute
Copy link

kinoute commented Aug 3, 2022

@adbar It still doesn't work with your min_date

Here is the debugging without the min_date:

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

With min_date at "1990-01-01":

DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:published_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.extractors:custom parse test: 1991-01-02T01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 01:01:00+01:00
DEBUG:htmldate.validators:date not valid: 1991-01-02 00:00:00
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:examining meta property: <meta data-rh="true" property="article:modified_time" content="1991-01-02T01:01:00+01:00">
DEBUG:htmldate.validators:date not valid: 1991-01-02
DEBUG:htmldate.core:analyzing (HTML): <footer class="sc-1lhe64-3 kPYMmr"><div class="sc-123ocby-3 fjTtGI"><div class="sc-aamjrj-0 sc-15kkm
DEBUG:htmldate.extractors:custom parse test: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:send to external parser: Les Echos1991Janvier 1991
DEBUG:htmldate.extractors:found partial date in URL: /1991/01//01
DEBUG:htmldate.validators:date not valid: 1991-01-01 00:00:00
DEBUG:htmldate.core:extensive search started
DEBUG:htmldate.extractors:custom parse test: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.extractors:send to external parser: Publié le 2 janv. 1991 à 1:01
DEBUG:htmldate.core:looking for copyright/footer information
DEBUG:htmldate.core:3 components
DEBUG:htmldate.validators:no potential year: 1991-01-02
DEBUG:htmldate.validators:no potential year: 1991-01-31
DEBUG:htmldate.core:firstselect: [('2022-07-26', 22), ('2022-07-25', 6), ('2020-01-29', 2), ('2022-06-28', 2)]
DEBUG:htmldate.core:bestones: [('2022-07-26', 22), ('2022-07-25', 6)]
DEBUG:htmldate.validators:date found for pattern "re.compile('\\D([0-9]{4}[/.-][0-9]{2}[/.-][0-9]{2})\\D')": 2022-07-26
'2022-07-26 00:00:00'

@adbar
Copy link
Owner Author

adbar commented Aug 4, 2022

@kinoute Thanks for pointing that out, it's a bug.

@dideler
Copy link

dideler commented Aug 22, 2023

htmldate==1.2.3 used in https://github.com/ofou/graham-essays is incorrectly extracting dates. See output. The essays have MONTH YEAR below the title but that's not being picked up. Example: http://www.paulgraham.com/greatwork.html

In a fork I tried updating to the latest version and it has the same issue.

@adbar
Copy link
Owner Author

adbar commented Aug 30, 2023

@dideler Thanks, the year is detected correctly but not the month which is contained in a <font> tag. I'll see what I can do.

@stevesong
Copy link

Thank you for this wonderful tool! It would be great to see this news source added.

Capacity Media e.g. https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity

    <div class="ArticlePage-datePublished">
            February 13, 2023 11:42 AM
    </div>

@adbar
Copy link
Owner Author

adbar commented Jan 16, 2024

@stevesong It already works:

$ htmldate -u "https://web.archive.org/web/20240111084001/https://www.capacitymedia.com/article/2b9xz33jdzmp2r77gr280/news/tizeti-expansion-to-boost-nigerian-connectivity"
2023-02-13

@stevesong
Copy link

stevesong commented Jan 16, 2024

Wow, thanks! I must have been using an older version. Passing urls through archive.org appears to have a normalising effect on some websites in that htmldate works on the archive.org versions but not the original?

@adbar
Copy link
Owner Author

adbar commented Jan 17, 2024

It's not supposed to normalize anything, I'm just using archived versions to be able to replicate issues at some point in the future.

@stevesong
Copy link

Ok, understood, but there does appear to be something interesting happening there.

$ htmldate -u https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
# ERROR no valid result for url: https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/

$ htmldate -u https://web.archive.org/web/https://www.datacenterdynamics.com/en/news/africa-data-centres-breaks-ground-on-nairobi-expansion-in-kenya/
2023-01-20

@adbar
Copy link
Owner Author

adbar commented Jan 17, 2024

I guess it's because the download fails, there are websites which restrict access to the download utility, see
https://trafilatura.readthedocs.io/en/latest/troubleshooting.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers up for grabs Good for (first) contributors
Projects
None yet
Development

No branches or pull requests

5 participants