Skip to content
This repository has been archived by the owner on Nov 11, 2022. It is now read-only.

Reading multiple files and need the names in processing #581

Open
parwaniravi248 opened this issue Jun 11, 2017 · 3 comments
Open

Reading multiple files and need the names in processing #581

parwaniravi248 opened this issue Jun 11, 2017 · 3 comments

Comments

@parwaniravi248
Copy link

Hi,
I have a requirement where I am reading 2K files and want to parse and validate the same. In the process of validating I need to know the file names in order to reject the same post validation.
I have created a list of file names and I read them using loop, but the issue I have here is while submitting the job the JSON file for submission crosses the 10 MB limit and it fails. If I am running the job with 800-1000 files, its running fine.

How would you suggest to implement this?

Do I have to split the file list and run it multiple no of times for small no of files?

Thanks
Ravi Parwani

@lukecwik
Copy link
Contributor

lukecwik commented Jun 12, 2017

Until there is an implementation of SplittableDoFn that works with Dataflow or Dataflow increases the maximum job size description, it seems that splitting the list of files and running multiple pipelines is the simplest solution.

@parwaniravi248
Copy link
Author

Thanks Lukecwik. Can we run the pipeline in loop (lets say 1k file in each loop) until entire files are consumed? Do you have good example where we run the pipeline in loop? Appreciate your help and guidance.

@lukecwik
Copy link
Contributor

int maxNumFiles = 1000;
List<String> files = ...
for (int i = 0; i < files.size(); i += maxNumFiles) {
  buildAndRunPipeline(files.sublist(i, Math.min(files.size(), i + maxNumFiles)));
}

buildAndRunPipeline(List<String> files) {
  Pipeline p = ...
  // build my pipeline over the smaller list of files

  // Will launch the pipeline and not wait till it finishes
  // effectively allowing you to run multiple pipelines in
  // parallel. You'll want to guard against running too
  // many pipelines because you may hit quota limits.
  p.run(); 

  OR

  // Launch and wait till each pipeline finishes before
  // launching the next one.
  p.run().waitUntilFinish();
}

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants