Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to handle large data files #989

Open
pripley123 opened this issue Feb 11, 2020 · 2 comments
Open

Ability to handle large data files #989

pripley123 opened this issue Feb 11, 2020 · 2 comments
Labels
est:Major Major effort to implement est:Score=8 Score for estimate of effort required (scale of 1 upwards) f:Feature-request This issue is a request for a new feature priority:Low

Comments

@pripley123
Copy link

As a data provider
I want the ability to add metadata for my large data files (10+ million rows of data)
so that I can create metadata for large datasets (currently Data Curator seems to slow down with greater than 10K rows of data on my laptop)

Possible approach might be to take a sample of the data which could be used with "Guess" feature to generate initial metadata which could then be modified by the user as desired. Upon exporting a data package it would be ideal for the full dataset to be included in the package.

Windows 10 (64 bit)
8 GB Ram
2.3Ghz

@ghost ghost added est:Major Major effort to implement est:Score=8 Score for estimate of effort required (scale of 1 upwards) f:Feature-request This issue is a request for a new feature labels Apr 22, 2020
@ghost
Copy link

ghost commented Apr 22, 2020

Hi @pripley123
Yes you're right - as much as we did work on getting as efficient as possible at the time for certain dataset sizes, there will be limitations with much larger datasets.
One idea I had was similar I think to what you're suggesting whereby the amount of data iterated/streamed is tied to how much can actually be shown at any time - so that it is 'lazily loaded' as needed - we can do the same for 'Guess', as you've suggested, just using a sample. Not quite sure at this stage how much effort involved for something like this, but I'll certainly check with sponsor to see if they have had similar need for this with their typical data sets.
The amount of testing/benchmarking for this alone however probably means, unless our sponsor also has this need, it will not make it into this upcoming release - apologies.

@ghost ghost added the support This issue is a candidate to complete under the support agreement label Apr 23, 2020
@ghost
Copy link

ghost commented May 18, 2020

Sorry @pripley123 Don't think we're going to have the bandwidth this release to get this one in due to the work involved. It is an important issue though (especially as datasets keep getting larger) and it is something that I've discussed with our sponsors as a worthy look-in for future development.

@ghost ghost removed the support This issue is a candidate to complete under the support agreement label May 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
est:Major Major effort to implement est:Score=8 Score for estimate of effort required (scale of 1 upwards) f:Feature-request This issue is a request for a new feature priority:Low
Projects
None yet
Development

No branches or pull requests

2 participants