Add database:import command for non-atomic import #5396

tohhsinpei · 2023-01-05T17:10:57Z

Description

b/261669787
API proposal: go/firebase-database-import

Implementation:

Take JSON file input and stream top-level objects using JSONStream. Assume individual objects can fit in memory.
For each JSON object, split into 1 MB chunks (partially based on firebase-import). PUT chunks in parallel.

src/databaseChunkUploader.ts

fredzqm

Love your implementation. Really concise and clean.
It's good to start with for now.

Using PUT allows us to override one path at a time.
However, consider a case with a large list of small elements. (The total list exceeds 1MB) This would create tons of splits.
If we multi-path PATCH, we can batch many small path and have fewer tiny chunks.

This is an optional optimization.

fredzqm · 2023-01-11T06:39:36Z

src/commands/database-import.ts

+
+    const inputString =
+      options.data ||
+      (await utils.streamToString(infile ? fs.createReadStream(infile) : process.stdin));


This allocates memory for the entire file.

However, reading firebase-import. I think it also loads the whole file into a JSON obj in memory.

https://github.com/FirebaseExtended/firebase-import/blob/66eac3f50f2e4035e79782c9124eba09649e93f3/src/firebase-import.js#L140

Just a note here. I actually don't know if there is a good way to parse JSON in a streaming way. We don't know if the JSON is valid or not until the whole file is parsed. 🤷

For now I have JSONStream parse and stream the top-level objects. We assume that individual objects can fit in memory.

src/test/database/import.spec.ts

src/commands/database-import.ts

src/database/import.ts

codecov-commenter · 2023-01-12T18:59:58Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.13 🎉

Comparison is base (d659b9d) 56.12% compared to head (477cd94) 56.25%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5396      +/-   ##
==========================================
+ Coverage   56.12%   56.25%   +0.13%     
==========================================
  Files         317      318       +1     
  Lines       21510    21577      +67     
  Branches     4391     4397       +6     
==========================================
+ Hits        12072    12139      +67     
  Misses       8376     8376              
  Partials     1062     1062

Impacted Files	Coverage Δ
src/database/import.ts	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

lepatryk · 2023-01-14T00:05:51Z

src/database/import.ts

+  constructor(private dbUrl: URL, file: string, private chunkSize = MAX_CHUNK_SIZE) {
+    let data;
+    try {
+      data = { json: JSON.parse(file), pathname: dbUrl.pathname };


I'd like this command to support at least 100GB json files. So parsing everything to memory at once isn't really practical. https://github.com/FirebaseExtended/firebase-import uses JSONStream. Can we use it here as well?

Done. Although, note that

firebase-import still loads the whole file into memory, not sure why, we'll not do that here

JSONStream assumes that the top-level objects can fit into memory but that the entire file should be streamed. I think this is a reasonable, maybe even necessary assumption.

tohhsinpei · 2023-01-26T21:33:59Z

@fredzqm TBC on doing multi-path PATCH for now. If we do, we'll need to do something else to make sure that each batch of updates also doesn't exceed the request payload size. But shouldn't be hard.

fredzqm · 2023-01-26T21:41:16Z

I would leave whether to use multi-path PATCH in the first version up to you. @tohhsinpei

Singe-path PUT would deliver a lot of value already, so I'm happy to get that algorithm in and iterate later.

src/commands/database-import.ts

fredzqm · 2023-03-01T19:35:24Z

src/commands/database-import.ts

+
+export const command = new Command("database:import <path> [infile]")
+  .description("non-atomically import JSON file to the specified path")
+  .option("-f, --force", "pass this option to bypass confirmation prompt")


This isn't in the API proposal.

I was thinking of:

If --force-override is added, then proceed

Check if the imported path is empty, if yes, then proceed

Prompt user to confirm importing to an non-empty path.

Sadly, it gets lost in the discussion.

@lepatryk What do you think about the behavior when them imported path isn't empty?

--force just bypasses the ? You are about to import data to https://<instance>.firebaseio.com/. Are you sure? (y/N) prompt. We still check if the imported path is empty and tell the user to delete if not.

src/database/import.ts

fredzqm · 2023-03-01T20:00:51Z

src/database/import.ts

+    const getJoinedPath = this.joinPath.bind(this);
+
+    const readChunks = new stream.Transform({ objectMode: true });
+    readChunks._transform = function (chunk: { key: string; value: any }, _, done) {


Very cool. I am reading its docs: https://nodejs.org/api/stream.html#implementing-a-readable-stream

Thinking about how to control the size of the chunk emitted.
Shall we set highWaterMark higher to have larger chunk size?

Not sure. We're in objectMode, so highWaterMark defaults to 16 (objects, not bytes). It's hard to know how many objects we should buffer.

Can you experiment with a large file and log each request's size?

If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

CHANGELOG.md

src/commands/database-import.ts

joehan · 2023-03-02T19:10:29Z

src/commands/database-import.ts

+  .before(requireDatabaseInstance)
+  .before(populateInstanceDetails)
+  .before(printNoticeIfEmulated, Emulators.DATABASE)
+  .action(async (path: string, infile, options) => {


Note: We support a global flag --json, which makes a command output whatever is returned by the action function as JSON. To that end, it would be nice to include a return type for the function in action.

Right now, this is Promise, which is fine. However, think about if there is anything that this should return that would be useful for CI/automated use cases.

Maybe total number of bytes written. Would be a good follow up. I'd prefer to get this PR in first if that's ok.

src/database/import.ts

src/test/database/import.spec.ts

package.json

fredzqm · 2023-03-13T23:51:27Z

src/database/import.ts

+    const getJoinedPath = this.joinPath.bind(this);
+
+    const readChunks = new stream.Transform({ objectMode: true });
+    readChunks._transform = function (chunk: { key: string; value: any }, _, done) {


Can you experiment with a large file and log each request's size?

If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

fredzqm · 2023-03-13T23:55:16Z

Impl LGTM. Curious about the actual performance of it.

joehan

If possible, please minimize anys in this code.

src/database/import.ts

tohhsinpei · 2023-03-16T15:02:33Z

Can you experiment with a large file and log each request's size?

If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

@fredzqm I just increased MAX_CHUNK_SIZE to 10 MB. So it will be able to write objects of up to 10 MB in one go. If the file is a lot of small objects at the top level then this won't help, though.

fredzqm · 2023-03-16T21:39:26Z

Can you experiment with a large file and log each request's size?
If it ends up sending lots of tiny requests, it would be very slow.
We want to make sure the batch size is reasonable large for this command to be efficient and useful.

@fredzqm I just increased MAX_CHUNK_SIZE to 10 MB. So it will be able to write objects of up to 10 MB in one go. If the file is a lot of small objects at the top level then this won't help, though.

Yeah, I know. Long list cannot be chunked unless we use PATCH. It's OK for now.

Import large data file using chunked writes

731a6cb

tohhsinpei commented Jan 5, 2023

View reviewed changes

src/databaseChunkUploader.ts Outdated Show resolved Hide resolved

Add unit tests

3d1bd1a

tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 7047897 to 3d1bd1a Compare January 5, 2023 21:11

tohhsinpei requested a review from fredzqm January 5, 2023 21:12

tohhsinpei assigned fredzqm Jan 5, 2023

Separate database:import command; add unit test for request

29d33d7

tohhsinpei changed the title ~~Import large data file using chunked writes~~ Add database:import command for non-atomic import Jan 10, 2023

fredzqm requested changes Jan 11, 2023

View reviewed changes

fredzqm assigned tohhsinpei and unassigned fredzqm Jan 11, 2023

tohhsinpei added 3 commits January 11, 2023 17:12

Add test case for array in JSON data

f104a69

Address PR feedback

25e9c4e

Disallow STDIN and arg data input

afa6795

tohhsinpei requested a review from lepatryk January 13, 2023 19:19

lepatryk reviewed Jan 14, 2023

View reviewed changes

tohhsinpei added 2 commits January 25, 2023 20:51

Stream top-level JSON objects

077cc97

Support importing at data path

de2e28b

tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 49723d4 to de2e28b Compare January 26, 2023 16:23

Update dependencies

972302a

tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 46442b3 to 972302a Compare January 26, 2023 17:35

tohhsinpei assigned fredzqm and unassigned tohhsinpei Jan 26, 2023

Merge branch 'master' into hsinpei/database-set-chunked

c13b9d9

tohhsinpei requested review from fredzqm and bkendall March 1, 2023 16:49

tohhsinpei assigned bkendall Mar 1, 2023

tohhsinpei force-pushed the hsinpei/database-set-chunked branch 2 times, most recently from 3fb53d7 to bdfc5ee Compare March 1, 2023 16:57

Minor rewording of CLI prompt

714dbc3

tohhsinpei force-pushed the hsinpei/database-set-chunked branch from bdfc5ee to 714dbc3 Compare March 1, 2023 16:58

tohhsinpei added 2 commits March 1, 2023 17:05

Add CHANGELOG entry

4e444b5

Run npm install

90991b9

tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 4fb53fe to 90991b9 Compare March 1, 2023 19:01

fredzqm requested changes Mar 1, 2023

View reviewed changes

bkendall requested review from joehan and removed request for bkendall March 2, 2023 18:39

joehan requested changes Mar 2, 2023

View reviewed changes

Address PR feedback

4956dcc

tohhsinpei requested review from joehan and fredzqm March 13, 2023 20:17

fredzqm approved these changes Mar 13, 2023

View reviewed changes

joehan reviewed Mar 14, 2023

View reviewed changes

src/database/import.ts Outdated Show resolved Hide resolved

src/database/import.ts Outdated Show resolved Hide resolved

src/database/import.ts Outdated Show resolved Hide resolved

tohhsinpei added 2 commits March 15, 2023 22:17

Address PR feedback

8d70afc

Merge branch 'master' into hsinpei/database-set-chunked

2650a89

tohhsinpei force-pushed the hsinpei/database-set-chunked branch from 477cd94 to 2650a89 Compare March 15, 2023 22:20

Increase chunk size to 10 MB

505767e

tohhsinpei requested a review from joehan March 16, 2023 15:06

joehan approved these changes Mar 16, 2023

View reviewed changes

Merge branch 'master' into hsinpei/database-set-chunked

2a9b658

tohhsinpei enabled auto-merge (squash) March 17, 2023 19:30

tohhsinpei merged commit d491071 into master Mar 17, 2023

tohhsinpei deleted the hsinpei/database-set-chunked branch March 17, 2023 19:51

tohhsinpei mentioned this pull request Apr 25, 2023

Do multi-path PATCH instead of PUT in database:import #5735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add database:import command for non-atomic import #5396

Add database:import command for non-atomic import #5396

tohhsinpei commented Jan 5, 2023 •

edited

Loading

fredzqm left a comment •

edited

Loading

fredzqm Jan 11, 2023 •

edited

Loading

tohhsinpei Jan 26, 2023

codecov-commenter commented Jan 12, 2023 •

edited

Loading

lepatryk Jan 14, 2023

tohhsinpei Jan 26, 2023

tohhsinpei commented Jan 26, 2023

fredzqm commented Jan 26, 2023 •

edited

Loading

fredzqm Mar 1, 2023

tohhsinpei Mar 10, 2023

fredzqm Mar 1, 2023

tohhsinpei Mar 10, 2023

fredzqm Mar 13, 2023 •

edited

Loading

joehan Mar 2, 2023

tohhsinpei Mar 13, 2023

fredzqm Mar 13, 2023 •

edited

Loading

fredzqm commented Mar 13, 2023

joehan left a comment

tohhsinpei commented Mar 16, 2023 •

edited

Loading

fredzqm commented Mar 16, 2023

Add database:import command for non-atomic import #5396

Add database:import command for non-atomic import #5396

Conversation

tohhsinpei commented Jan 5, 2023 • edited Loading

Description

fredzqm left a comment • edited Loading

Choose a reason for hiding this comment

fredzqm Jan 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 12, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tohhsinpei commented Jan 26, 2023

fredzqm commented Jan 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredzqm Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fredzqm Mar 13, 2023 • edited Loading

Choose a reason for hiding this comment

fredzqm commented Mar 13, 2023

joehan left a comment

Choose a reason for hiding this comment

tohhsinpei commented Mar 16, 2023 • edited Loading

fredzqm commented Mar 16, 2023

tohhsinpei commented Jan 5, 2023 •

edited

Loading

fredzqm left a comment •

edited

Loading

fredzqm Jan 11, 2023 •

edited

Loading

codecov-commenter commented Jan 12, 2023 •

edited

Loading

fredzqm commented Jan 26, 2023 •

edited

Loading

fredzqm Mar 13, 2023 •

edited

Loading

fredzqm Mar 13, 2023 •

edited

Loading

tohhsinpei commented Mar 16, 2023 •

edited

Loading