Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mojo::DOM doesn't handle nested descendant selectors correctly #2025

Open
mauke opened this issue Jan 23, 2023 · 5 comments
Open

Mojo::DOM doesn't handle nested descendant selectors correctly #2025

mauke opened this issue Jan 23, 2023 · 5 comments

Comments

@mauke
Copy link
Contributor

mauke commented Jan 23, 2023

  • Mojolicious version: 9.31
  • Perl version: v5.36.0
  • Operating system: Linux (Ubuntu 22.04)

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('<div> ' x 100 . '</div> ' x 100);

is_deeply
    $dom->find('#gobbledygook * * * *')->map(sub { $_->to_string })->to_array,
    [],
    'non-existent elements have no descendants';

done_testing;

Expected behavior

Test passes.

Since the document does not contain an element with id="gobbledygook", the search should fail immediately.

Actual behavior

Test hangs; doesn't finish.

@kraih
Copy link
Member

kraih commented Jan 29, 2023

Selectors match right to left, so the slow matching behaviour makes sense. While the performance could be better, i'm not gonna call this a bug.

@mauke
Copy link
Contributor Author

mauke commented Jan 29, 2023

I call it a bug for two reasons:

  1. It's counterintuitive. As a user, I would expect a selector that starts with #foo to only examine elements under id="foo" and do no work otherwise. In Mojo::DOM, the situation is reversed: If the ID is found, the code stops quickly; it only hangs if there is no such ID in the document.
  2. There is a potential for DoS in software like NewsExtractor: https://metacpan.org/release/GUGOD/NewsExtractor-v0.45.0/source/lib/NewsExtractor/GenericExtractor.pm#L110
        elsif ($guess = $dom->at('#details_block .left .date, .article_header > .author > span:last-child')) {
    If someone were to (accidentally or on purpose) feed that code a document structure like '<div class="left date">' x 1000 . '</div>' x 1000, it would make the code spin for a long time (at least several minutes), burning CPU.

So: If you take seemingly harmless code using a selector like #details_block .left .date, you can't run it on untrusted input because it could cause a denial of service. I feel that should be at least pointed out in the documentation.

@kraih
Copy link
Member

kraih commented Jan 29, 2023

Right to left matching is actually standard, and used just the same in browsers. If you get better results there it merely means that they have additional optimisations for the case.

@mauke
Copy link
Contributor Author

mauke commented Jan 29, 2023

I mean, I get better results both right-to-left and left-to-right: https://github.com/mauke/HTML-Blitz/blob/9f0496ac28260675f8bbaabb68ddad477c09231a/t/selector-combinators.t#L155-L169

(The "optimisation" here is not to use backtracking; cf. https://research.swtch.com/glob.)

@kraih
Copy link
Member

kraih commented Jan 29, 2023

PRs with backtracking optimisations would of course be very welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants