Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add database:import command for non-atomic import #5396

Merged
merged 18 commits into from
Mar 17, 2023

Conversation

tohhsinpei
Copy link
Member

@tohhsinpei tohhsinpei commented Jan 5, 2023

Description

b/261669787
API proposal: go/firebase-database-import

Implementation:

  • Take JSON file input and stream top-level objects using JSONStream. Assume individual objects can fit in memory.
  • For each JSON object, split into 1 MB chunks (partially based on firebase-import). PUT chunks in parallel.

@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 7047897 to 3d1bd1a Compare January 5, 2023 21:11
@tohhsinpei tohhsinpei requested a review from fredzqm January 5, 2023 21:12
@tohhsinpei tohhsinpei changed the title Import large data file using chunked writes Add database:import command for non-atomic import Jan 10, 2023
Copy link
Contributor

@fredzqm fredzqm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love your implementation. Really concise and clean.
It's good to start with for now.

Using PUT allows us to override one path at a time.
However, consider a case with a large list of small elements. (The total list exceeds 1MB) This would create tons of splits.
If we multi-path PATCH, we can batch many small path and have fewer tiny chunks.

This is an optional optimization.


const inputString =
options.data ||
(await utils.streamToString(infile ? fs.createReadStream(infile) : process.stdin));
Copy link
Contributor

@fredzqm fredzqm Jan 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allocates memory for the entire file.

However, reading firebase-import. I think it also loads the whole file into a JSON obj in memory.

https://github.com/FirebaseExtended/firebase-import/blob/66eac3f50f2e4035e79782c9124eba09649e93f3/src/firebase-import.js#L140

Just a note here. I actually don't know if there is a good way to parse JSON in a streaming way. We don't know if the JSON is valid or not until the whole file is parsed. 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I have JSONStream parse and stream the top-level objects. We assume that individual objects can fit in memory.

@fredzqm fredzqm assigned tohhsinpei and unassigned fredzqm Jan 11, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.13 🎉

Comparison is base (d659b9d) 56.12% compared to head (477cd94) 56.25%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5396      +/-   ##
==========================================
+ Coverage   56.12%   56.25%   +0.13%     
==========================================
  Files         317      318       +1     
  Lines       21510    21577      +67     
  Branches     4391     4397       +6     
==========================================
+ Hits        12072    12139      +67     
  Misses       8376     8376              
  Partials     1062     1062              
Impacted Files Coverage Δ
src/database/import.ts 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@tohhsinpei tohhsinpei requested a review from lepatryk January 13, 2023 19:19
constructor(private dbUrl: URL, file: string, private chunkSize = MAX_CHUNK_SIZE) {
let data;
try {
data = { json: JSON.parse(file), pathname: dbUrl.pathname };

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like this command to support at least 100GB json files. So parsing everything to memory at once isn't really practical. https://github.com/FirebaseExtended/firebase-import uses JSONStream. Can we use it here as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Although, note that

  1. firebase-import still loads the whole file into memory, not sure why, we'll not do that here
  2. JSONStream assumes that the top-level objects can fit into memory but that the entire file should be streamed. I think this is a reasonable, maybe even necessary assumption.

@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 49723d4 to de2e28b Compare January 26, 2023 16:23
@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 46442b3 to 972302a Compare January 26, 2023 17:35
@tohhsinpei
Copy link
Member Author

@fredzqm TBC on doing multi-path PATCH for now. If we do, we'll need to do something else to make sure that each batch of updates also doesn't exceed the request payload size. But shouldn't be hard.

@tohhsinpei tohhsinpei assigned fredzqm and unassigned tohhsinpei Jan 26, 2023
@fredzqm
Copy link
Contributor

fredzqm commented Jan 26, 2023

I would leave whether to use multi-path PATCH in the first version up to you. @tohhsinpei

Singe-path PUT would deliver a lot of value already, so I'm happy to get that algorithm in and iterate later.

@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch 2 times, most recently from 3fb53d7 to bdfc5ee Compare March 1, 2023 16:57
@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch from bdfc5ee to 714dbc3 Compare March 1, 2023 16:58
@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 4fb53fe to 90991b9 Compare March 1, 2023 19:01

export const command = new Command("database:import <path> [infile]")
.description("non-atomically import JSON file to the specified path")
.option("-f, --force", "pass this option to bypass confirmation prompt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't in the API proposal.

I was thinking of:

  1. If --force-override is added, then proceed
  2. Check if the imported path is empty, if yes, then proceed
  3. Prompt user to confirm importing to an non-empty path.

Sadly, it gets lost in the discussion.

@lepatryk What do you think about the behavior when them imported path isn't empty?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--force just bypasses the ? You are about to import data to https://<instance>.firebaseio.com/. Are you sure? (y/N) prompt. We still check if the imported path is empty and tell the user to delete if not.

const getJoinedPath = this.joinPath.bind(this);

const readChunks = new stream.Transform({ objectMode: true });
readChunks._transform = function (chunk: { key: string; value: any }, _, done) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. I am reading its docs: https://nodejs.org/api/stream.html#implementing-a-readable-stream

Thinking about how to control the size of the chunk emitted.
Shall we set highWaterMark higher to have larger chunk size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. We're in objectMode, so highWaterMark defaults to 16 (objects, not bytes). It's hard to know how many objects we should buffer.

Copy link
Contributor

@fredzqm fredzqm Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you experiment with a large file and log each request's size?

If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

@bkendall bkendall requested review from joehan and removed request for bkendall March 2, 2023 18:39
.before(requireDatabaseInstance)
.before(populateInstanceDetails)
.before(printNoticeIfEmulated, Emulators.DATABASE)
.action(async (path: string, infile, options) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We support a global flag --json, which makes a command output whatever is returned by the action function as JSON. To that end, it would be nice to include a return type for the function in action.

Right now, this is Promise, which is fine. However, think about if there is anything that this should return that would be useful for CI/automated use cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe total number of bytes written. Would be a good follow up. I'd prefer to get this PR in first if that's ok.

@tohhsinpei tohhsinpei requested review from joehan and fredzqm March 13, 2023 20:17
const getJoinedPath = this.joinPath.bind(this);

const readChunks = new stream.Transform({ objectMode: true });
readChunks._transform = function (chunk: { key: string; value: any }, _, done) {
Copy link
Contributor

@fredzqm fredzqm Mar 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you experiment with a large file and log each request's size?

If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

@fredzqm
Copy link
Contributor

fredzqm commented Mar 13, 2023

Impl LGTM. Curious about the actual performance of it.

Copy link
Contributor

@joehan joehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, please minimize anys in this code.

@tohhsinpei tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 477cd94 to 2650a89 Compare March 15, 2023 22:20
@tohhsinpei
Copy link
Member Author

tohhsinpei commented Mar 16, 2023

Can you experiment with a large file and log each request's size?

If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

@fredzqm I just increased MAX_CHUNK_SIZE to 10 MB. So it will be able to write objects of up to 10 MB in one go. If the file is a lot of small objects at the top level then this won't help, though.

@tohhsinpei tohhsinpei requested a review from joehan March 16, 2023 15:06
@fredzqm
Copy link
Contributor

fredzqm commented Mar 16, 2023

Can you experiment with a large file and log each request's size?
If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

@fredzqm I just increased MAX_CHUNK_SIZE to 10 MB. So it will be able to write objects of up to 10 MB in one go. If the file is a lot of small objects at the top level then this won't help, though.

Yeah, I know. Long list cannot be chunked unless we use PATCH. It's OK for now.

@tohhsinpei tohhsinpei enabled auto-merge (squash) March 17, 2023 19:30
@tohhsinpei tohhsinpei merged commit d491071 into master Mar 17, 2023
@tohhsinpei tohhsinpei deleted the hsinpei/database-set-chunked branch March 17, 2023 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants