Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing large HTML files uses way too much memory #31

Open
mischov opened this issue Jan 25, 2018 · 3 comments
Open

Parsing large HTML files uses way too much memory #31

mischov opened this issue Jan 25, 2018 · 3 comments

Comments

@mischov
Copy link
Owner

mischov commented Jan 25, 2018

I have received reports that parsing large (~18MB) HTML files resulted in completely insane memory use (1GB+).

This needs to be confirmed, and if so, addressed.

@mischov
Copy link
Owner Author

mischov commented Jan 27, 2018

I have at least partially confirmed high memory use for large HTML files.

Summary

The word size of the result of parsing HTML can be significantly larger than the source HTML, increasing with node density.

An 19MB HTML file is parsed by Meeseeks to an estimated 280MB document (excluding binary data size), while a 500KB HTML file parsed to an estimated 3MB (excluding binary data size).

This increase is less significant when parsing into a tuple-tree, and Html5ever (the Elixir library) parses the same files to tuple-trees around 112MB and 1.3MB (excluding binary data size) respectively.

Data

For a 500KB (on disk) HTML file parsing to about 11,000 nodes, :erts_debug.flat_size/1 returns the following when used on the result of parsing:

  • Meeseeks (parse/1): ~375,000 words (3MB at 8 bytes per word on 64-bit system, or 272 bytes/node)
  • Html5ever (flat_parse/1): ~350,000 words (2.8MB, or 254 bytes/node)
  • Html5ever (parse/1): ~160,000 words (1.3MB, or 116 bytes/node)

This excludes most of the binary data (any larger than 64 bytes), but already gives an example of how a large HTML document uses more memory when parsed.

Both implementations that use flat-maps to represent the HTML document weigh in at ~6x the original document size, and even the tuple-tree representation weighs in at ~2.5x.

So how does this scale up to even larger documents? To find out I created a file duplicating the largest portion of the original document (a big <table>) 100 times, leaving me with a 19MB (on disk) HTML document that parsed to around 1,030,000 nodes.

  • Meeseeks (parse/1): ~35,000,000 words (280MB, or 271 bytes/node)
  • Html5ever (flat_parse/1): ~30,000,000 words (240MB, or 233 bytes/node)
  • Html5ever (parse/1): ~14,000,000 words (112MB, or 109 bytes/node)

The bytes per node have scaled close to linearly, but given the density of nodes, the flat-map representations are ~14x and ~12x the original document size, while the tuple-tree representation is ~5.5x.

In both examples, the flat-map representations seem to be about 2.5x the size of the tuple-tree representation.

Memory Pressure

Sampling total memory allocation with :erlang.memory/0 before and after parsing reflects a difference in memory efficiency between tuple-trees (more efficient) and flat-maps similar to that suggested by word-size, though the Html5ever flat-map is seems particularly inefficient.

500KB File:

  • Meeseeks (parse/1): ~+8.5MB
  • Html5ever (flat_parse/1): ~+18.5MB
  • Html5ever (parse/1): ~+3MB

19MB FIle:

  • Meeseeks (parse/1): ~+775MB
  • Html5ever (flat_parse/1): ~+1300MB
  • Html5ever (parse/1): ~+225MB

Discussion

It is not unexpected that the flat-map representation would be bigger than the tuple-tree representation, since flat-maps both make explicit the relationship between nodes that a tuple-tree holds implicitly (more data), and represent nodes using maps instead of tuples (less memory efficient).

It is unfortunate, however, that parsing a 19MB (albeit node-dense) HTML file can yield ~300-800MB of memory usage.

Next Steps

I'm not sure yet. I'm open to suggestions.

@michalmuskala
Copy link

Regarding the difference between maps and tuples one solution is to just wait. OTP 21 will include an optimisation that should make structs (and all other maps where keys are statically known at compile-time) have a comparable size to tuples. Tuples consume 1 + n words, where n is the number of elements. Maps consume (assuming it's freshly created each time) 4 + 2n words. With the optimisation, static maps will consume just 3 + n, so the difference compared to tuples is not that big. This should be achievable from NIFs today - create a "prototype" map and then call map_update to share some elements of the structure.

It would be nice to have some API allowing updating multiple keys at once - I think this is something that should be consulted with the OTP team. This should significantly decrease the amount of garbage created in the process.

When it comes to optimising the size of the tree itself, I think the primary way would be through increasing sharing. There's often a lot of empty nodes (such as br tags) or nodes with common attribute values. I think it might be a good idea to keep a cache of empty nodes and a cache of common attributes. Some that are shared especially frequently would be things like class or name and type for inputs and boolean attributes. This would allow reusing the values instead of recreating them each time.

@mischov
Copy link
Owner Author

mischov commented Feb 28, 2019

Two aspects of OTP 21 have combined to make the situation better, though still not ideal.

  1. As @michalmuskala alluded to, static maps are smaller in OTP 21, with the
    :erts_debug.flat_size for the 19MB HTML file above going down almost 20% from ~35,000,000 words to ~29,000,000 words.

  2. OTP 21 introduced enif_make_map_from_arrays which as of v0.11.0 I use to create all the structs and maps in one step rather than in many steps (each of which creating their own intermediate version of the map). This combined with smaller static maps means that the memory pressure from parsing the 19MB HTML file went down around 50% from ~775MB to ~375MB.

I also gave Michał's term cache idea a shot, but my first naive solution didn't show any positive results. I'll probably explore it more in the future.

All in all, OTP 21 is a pretty big win for Meeseeks memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants