Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content.Text empty despite response code OK and Content stream contains data #238

Open
seanarmstrong87 opened this issue Nov 28, 2022 · 0 comments

Comments

@seanarmstrong87
Copy link

seanarmstrong87 commented Nov 28, 2022

I am trying to crawl this page

https://www.tzb-info.cz/kontakty

By passing it to validUri in the following code:

        var pageRequester = new PageRequester(new CrawlConfiguration(), new WebContentExtractor());

        var crawledPage = await pageRequester.MakeRequestAsync(validUri).ConfigureAwait(false);
            
        Log.Logger.Information("{@Result}", new
        {
            url = crawledPage.Uri,
            status = Convert.ToInt32(crawledPage.HttpResponseMessage.StatusCode)
        });

        return crawledPage.Content.Text;

That website has a less common chartset in the header set like this

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-2">

The result is the Content.Text is always empty despite the response code being successful.

If I try to read the response stream directly I get this exception:

The character set provided in ContentType is invalid. Cannot read content as string using an invalid character set.

If I change the ChartSet on the response manually I am then able to read the stream:

args.CrawledPage.HttpResponseMessage.Content.Headers.ContentType.CharSet = @"ISO-8859-1";

This is my workaround for now.

Is this a bug that the "iso-8859-2" charset is not being interpreted correctly ? Or am I missing something from the configuration or setup in order to handle this charset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant