Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

innerText implementation #1245

Open
vsemozhetbyt opened this issue Sep 25, 2015 · 59 comments · May be fixed by #3584
Open

innerText implementation #1245

vsemozhetbyt opened this issue Sep 25, 2015 · 59 comments · May be fixed by #3584
Labels
feature layout Blocked on implementing a layout engine

Comments

@vsemozhetbyt
Copy link
Contributor

jsdom is a great tool for web scraping. However the textContent is a very inconvenient way to get readable text for html2text conversion.

There is a wonderful article about usefulness of negligible innerText in many cases:

http://perfectionkills.com/the-poor-misunderstood-innerText/

The author suggests getSelection().toString() as a very slow workaround, but getSelection is not implemented in the jsdom yet.

Could you consider an implementing of the innerText in the jsdom? The author has done a great exploration about it, he has even added a simple spec at the end.

@vsemozhetbyt
Copy link
Contributor Author

And what a pity that rangy Selection and innerText library is not compatible with jsdom: timdown/rangy#348

@domenic
Copy link
Member

domenic commented Sep 25, 2015

So, innerText is not standard, and not implemented in at least one major engine (Firefox). Without a standard, I don't think we should implement it.

Edit by @TimothyGu (April 24, 2023): this comment was accurate when it was written (in 2015!) but is no longer accurate, since a spec has been created. We welcome any contribution to implement that feature in jsdom.

@Sebmaster
Copy link
Member

Looks like there's some movement in this whole thing with a draft spec here. See also all the references. There are no issues on the repo though, so I wonder how complete it already is / how quick progress will be.

@vsemozhetbyt
Copy link
Contributor Author

Firefox has implemented: https://bugzilla.mozilla.org/show_bug.cgi?id=264412

WHATWG semms to approve: whatwg/compat#5 (comment)

@inikulin
Copy link
Contributor

From the spec it's seems like we can't implement innerText properly without basic layout support.

@domenic
Copy link
Member

domenic commented Jan 25, 2016

Yeah, this is not really going to be implementable in jsdom anyway, without a lot of infrastructure work... nobody get their hopes up :(.

@vsemozhetbyt
Copy link
Contributor Author

As to layout support requirement: rocallahan/innerText-spec#2

r4j4h added a commit to r4j4h/jasmine-phantom-utils that referenced this issue Jun 17, 2016
…erText` usage to `textContent` based on [this discussion](jsdom/jsdom#1245). Added tests for many evaluators.
@domenic domenic added the feature label Jul 2, 2016
@vsemozhetbyt
Copy link
Contributor Author

Is there any plan to implement it because of WHATWG adoption?

@domenic
Copy link
Member

domenic commented Aug 27, 2016

Yeah... Although the spec requires a lot of stuff jsdom doesn't have, around CSS boxes :(. Not sure what to do.

@vsemozhetbyt
Copy link
Contributor Author

Is there any lib for this to plug along with jsdom?

@snuggs
Copy link
Contributor

snuggs commented Aug 29, 2016

@domenic care to drop some knowledge on why this is such an infrastructure overhaul? We thought the 800lb gorilla in the room would leave lo-key. But looks like it's not going anywhere. As you know have been wrapping my head around the innards of jsdom. Where would be a great place in the repo to start reviewing code to a jsdom newb?

Thanks in advance 🙏 /cc @vsemozhetbyt

@dmethvin
Copy link
Contributor

The primary issue is the fact that innerText leans on the layout engine for guidance, and jsdom has no layout engine. See https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute and
http://perfectionkills.com/the-poor-misunderstood-innerText/ . From the second link:

Notice how innerText almost precisely represents exactly how text appears on the page. textContent, on the other hand, does something strange — it ignores newlines created by
and around styled-as-block elements ( in this case). But it preserves spaces as they are defined in the markup.

@vsemozhetbyt
Copy link
Contributor Author

Still out of scope and no workaround?

@coreh
Copy link

coreh commented May 24, 2017

Apparently the spec says:

If this element is not being rendered, or if the user agent is a non-CSS user agent, [emphasis added] then return the same value as the textContent IDL attribute on this element.

I think a workaround would be then to simply return textContent.

@domenic
Copy link
Member

domenic commented May 25, 2017

We implement enough CSS that I don't think that applies. We just don't implement the layout parts...

@Suzii
Copy link

Suzii commented Jan 24, 2018

Hi guys, any news on this one?

@Bnaya
Copy link

Bnaya commented Jan 25, 2018

Just use headless chrome :)

truongsinh added a commit to truongsinh/Semantic-UI-React that referenced this issue Apr 10, 2018
@Zirro Zirro added the layout Blocked on implementing a layout engine label May 27, 2018
@Janpot
Copy link

Janpot commented Aug 5, 2018

@domenic from that spec that @coreh mentioned:
https://html.spec.whatwg.org/multipage/dom.html#the-innertext-idl-attribute

If this element is not being rendered, or if the user agent is a non-CSS user agent, then return the same value as the textContent IDL attribute on this element.

https://html.spec.whatwg.org/multipage/rendering.html#being-rendered

An element is being rendered if it has any associated CSS layout boxes, SVG layout boxes, or some equivalent in other styling languages.

If jsdom doesn't implement the layout parts, doesn't that mean "not being rendered" applies?

@andrewcartwright1
Copy link

andrewcartwright1 commented Nov 5, 2023

Not the nicest of way's to do it but I had some cases where sometimes HTMLElement would work and sometimes it wouldnt.. I couldnt figure out why, so i changed it to be even more hacky. I mocked it using jest spies.

innerText.ts

import sanitizeHtml from 'sanitize-html';

let spyGet;
let spySet;

export const createSpies = () => {
  Object.defineProperty(Object.prototype, 'innerText', {
    get: () => undefined,
    set: () => undefined,
    configurable: true
  });

  // eslint-disable-next-line @typescript-eslint/ban-types
  spyGet = jest.spyOn(Object.prototype, 'innerText' as keyof Object, 'get');
  spyGet.mockImplementation(function () {
    if (this.textContent === undefined) {
      return undefined;
    }

    return sanitizeHtml(this.textContent, {
      allowedTags: [],
      allowedAttributes: {}
    })
      .split('\\n')
      .filter((text: string) => text && !text.match(/^\\s+$/))
      .map((text: string) => text.trim())
      .join('\\n');
  });

  // eslint-disable-next-line @typescript-eslint/ban-types
  spySet = jest.spyOn(Object.prototype, 'innerText' as keyof Object, 'set');
  spySet.mockImplementation(function (value: unknown) {
    this.textContent = value;
  });
};

export const resetSpies = () => {
  spyGet?.mockRestore();
  spySet?.mockRestore();
};

Then you can call it using createSpies and resetSpies in your jest tests.

You can use it without sanitize-html, I've used that to strip out any html

@macgyver
Copy link

macgyver commented Nov 5, 2023 via email

@DanKaplanSES
Copy link
Contributor

DanKaplanSES commented Nov 30, 2023

Can this please be documented as a limitation in the README? Speaking of which, is there a link that lists all the JSDOM limitations? If so, can this link be added to the README? If it's already there, it seems like it should be found on the "Unimplemented parts" section.

EDIT: Here's my attempt at documenting this: #3631

DanKaplanSES added a commit to DanKaplanSES/jsdom that referenced this issue Nov 30, 2023
@DanKaplanSES
Copy link
Contributor

DanKaplanSES commented Nov 30, 2023

It could at least fail with an exception instead of quietly returning undefined.

@AntonKhorev that makes sense indeed, but it would be a breaking change. Some people may not have innerText explicitly in their tests, but dependencies may use it internally.

In my opinion, that's even scarier to think about. It's hard enough to troubleshoot when it's in your own project, imagine how hard it would be when you're degrees of separation away from the code that caused the problem.

I'm not arguing for a breaking change, but maybe:

  1. Document it
  2. Provide some configuration that warns developers when it's depended on
  3. Provide some configuration that throws an error when it's depended on
  4. Use "the next best thing" as a fake response, such as textContent.

IMO, any one of these would improve the current situation.

rrahir added a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

Task: 3754944
robodoo pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

closes #3894

Task: 3754944
Signed-off-by: Lucas Lefèvre (lul) <lul@odoo.com>
rrahir added a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

Task: 3754944
X-original-commit: a11b5dc
robodoo pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

closes #3915

Task: 3754944
X-original-commit: a11b5dc
Signed-off-by: Lucas Lefèvre (lul) <lul@odoo.com>
Signed-off-by: Rémi Rahir (rar) <rar@odoo.com>
rrahir added a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

Task: 3754944
X-original-commit: 91d7a19
robodoo pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

closes #3919

Task: 3754944
X-original-commit: 91d7a19
Signed-off-by: Lucas Lefèvre (lul) <lul@odoo.com>
Signed-off-by: Rémi Rahir (rar) <rar@odoo.com>
fw-bot pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

Task: 3754944
X-original-commit: ed7d3fa
fw-bot pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

Task: 3754944
X-original-commit: ed7d3fa
fw-bot pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

Task: 3754944
X-original-commit: ed7d3fa
robodoo pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

closes #3922

Task: 3754944
X-original-commit: ed7d3fa
Signed-off-by: Lucas Lefèvre (lul) <lul@odoo.com>
Signed-off-by: Rémi Rahir (rar) <rar@odoo.com>
robodoo pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

closes #3921

Task: 3754944
X-original-commit: ed7d3fa
Signed-off-by: Lucas Lefèvre (lul) <lul@odoo.com>
Signed-off-by: Rémi Rahir (rar) <rar@odoo.com>
robodoo pushed a commit to odoo/o-spreadsheet that referenced this issue Mar 25, 2024
The issue addressed in PR #3351 unfortunately used a
`contentEditable` value that is not compatible with Firefox [1]

This revision implements the first strategy explored by the
aforementioned PR by removing the full format from  the contenteditable
span when we stop the edition.

Testing required to define some properties as JSdom does not properly
support the implementation of innerText (see [2])

[1]: https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/contenteditable#browser_compatibility
[2]: jsdom/jsdom#1245

closes #3920

Task: 3754944
X-original-commit: ed7d3fa
Signed-off-by: Lucas Lefèvre (lul) <lul@odoo.com>
Signed-off-by: Rémi Rahir (rar) <rar@odoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature layout Blocked on implementing a layout engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.