Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add gcgarbage collector support for StringViewArray and BinaryViewArray #5513

Open
Tracked by #5374
alamb opened this issue Mar 14, 2024 · 2 comments · May be fixed by #5707
Open
Tracked by #5374

Add gcgarbage collector support for StringViewArray and BinaryViewArray #5513

alamb opened this issue Mar 14, 2024 · 2 comments · May be fixed by #5707
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@alamb
Copy link
Contributor

alamb commented Mar 14, 2024

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
This is part of the larger project to implement StringViewArray -- see #5374

In #5481 we added support for StringViewArray and ByteViewArray.

This ticket tracks adding a gc method to StringViewArray and ByteViewArray

After calling filter or take on a StringViewArray or ByteViewArray the backing variable length buffer may be much larger than necessary to store the results

So before an array may look like the following with significant "garbage" space

                                       ┌──────┐                 
                                       │......│                 
                                       │......│                 
┌────────────────────┐       ┌ ─ ─ ─ ▶ │Data1 │   Large buffer  
│       View 1       │─ ─ ─ ─          │......│  with data that 
├────────────────────┤                 │......│ is not referred 
│       View 2       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data2 │ to by View 1 or 
└────────────────────┘                 │......│      View 2     
                                       │......│                 
   2 views, refer to                   │......│                 
  small portions of a                  └──────┘                 
     large buffer                                               
                                                                
                                                                

After GC it should look like

┌────────────────────┐                 ┌─────┐    After gc, only 
│       View 1       │─ ─ ─ ─ ─ ─ ─ ─▶ │Data1│     data that is  
├────────────────────┤       ┌ ─ ─ ─ ▶ │Data2│    pointed to by  
│       View 2       │─ ─ ─ ─          └─────┘     the views is  
└────────────────────┘                                 left      
                                                                 
                                                                 
        2 views                                                  
                                                                 
                                                                 
                                                                 

Describe the solution you'd like
I would like to add a method called StringViewArray::gc (and ByteViewArray::gc) that will compact

I expect users of the arrow crates to invoke this function, not any of the arrow kernels themselves

Describe alternatives you've considered

We could also add the gc functionality as its own standalone kernel (e.g. kernels::gc) rather than a method on the array.

Additional context
This GC is what is described in https://pola.rs/posts/polars-string-type/

What I consider the biggest downside is that we have to do garbage collection. When we gather/filter from an array with allocated long strings, we might keep strings alive that are not used anymore. This requires us to use some heuristics to determine when we will do a garbage collection pass on the string column. And because they are heuristics, sometimes they will be unneeded.

@alamb
Copy link
Contributor Author

alamb commented May 6, 2024

BTW @RinChanNOWWW has implemented StringViewArray / BinaryViewArray --> StringArray / BinaryArray in #5704

I think the same basic pattern can be applied to implement gc (aka copy each view value into a destination buffer)

@alamb
Copy link
Contributor Author

alamb commented May 9, 2024

Another potential idea came up in discord today which was to also implement some way of "interning" strings (aka track which strings have been seen before and remove duplicates)

https://discord.com/channels/885562378132000778/885562378132000781/1238127788788285521

The arrow dictionary array builder does this already so we could borrow its implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant