Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MergeFiles improvements #1309

Open
adamdecaf opened this issue Oct 19, 2023 · 15 comments
Open

MergeFiles improvements #1309

adamdecaf opened this issue Oct 19, 2023 · 15 comments
Labels
enhancement go Pull requests that update Go code help wanted

Comments

@adamdecaf
Copy link
Member

adamdecaf commented Oct 19, 2023

The Moov ACH library has a few ways of merging ACH files together, which can be useful to minimize the number of files sent over the network and cost from vendors, Fed, or other providers.

The performance and utility of MergeFiles can be improved and this issue sets out to discuss a few options.

Previous issues: #1041, #1276, #1282

@adamdecaf adamdecaf added enhancement help wanted go Pull requests that update Go code labels Oct 19, 2023
@midepeter
Copy link

midepeter commented Jan 11, 2024

Hello @adamdecaf,

Just got to know about moov and its OSS community. I must say this is great and beautiful. I am golang engineer and recently looking into perfomance engineering. I will love to help take a look at issue and investigate and give a report on how it can be carried out if given the opportunity to.

Looking forward to hearing from you.

Thank you
Peter.

@adamdecaf
Copy link
Member Author

Sounds good! Let me know if you need anything or have questions. Do you have a place to start?

@midepeter
Copy link

@adamdecaf

Yes, I do. Will try to familiarize myself with the codebase and investigate properly. If i do have questions, i will relate them to you.

Thank you!

@midepeter
Copy link

@adamdecaf

  • Is it faster to progressively merge files (from disk/io.Reader, *ach.File instances, or another input?)

I have some questions regarding this statement. Do you mean I should verify which is more efficient between merging files directly from the disk as they are being read and merging the files after they have been read and converted to *ach.File instance before merging.

@adamdecaf
Copy link
Member Author

There are a few options available. We currently need the files in memory (as an []*File) which is inefficient because each file has to be read from disk (or created in memory) before calling MergeFiles. There's a use-case in achgateway where we are reading files from disk and merging them, so a function to merge them faster in that path would be desired.

There may be faster ways to merge []*File as well, which would be an improvement. We need to keep the existing MergeFiles function for backwards compatibility.

I've linked a few PR's in the description which add benchmarks or previous performance improvements.

@midepeter
Copy link

Yea! Thanks for that

@midepeter
Copy link

Hello @adamdecaf

Not sure I was able to pull this off. But i will love to learn more about this codebase in general and probably pick up smaller tasks so i can get used to it because I could not really understand some parts of the code itself.

The only recommedation i have got is why we are using []*file and not []file because the latter has more memory perks than the former.

Thank you. If there are any smaller tasks i could take on that could get me used to the codebase, I will be really happy to pick it up.

@adamdecaf
Copy link
Member Author

Why would []File have better memory perks than []*File? Passing around an array of pointers is lighter-weight than passing around the structs.

@midepeter
Copy link

I think that it is a general rule of thumb to try as much as possible to reduce the use of pointers in a program because it thereby reduces the amount of heap allocations done. It is not always so actually but it is so common in golang programs

-- compiler generates a lot of checks when using pointer which makes programs a bit slower
-- pointers often have poor locality of reference
-- it gives the garbage collector more work

Just some opinions regarding that

@adamdecaf
Copy link
Member Author

We'd need to compare benchmarks between MergeFiles with and without pointers. Typically pointers to larger structs are more efficient and faster. See https://medium.com/@meeusdylan/when-to-use-pointers-in-go-44c15fe04eac

@midepeter
Copy link

Yes

Hm, Looking at it now, trying generate becnchmark reports for both cases. Quite a deep change in to the codebase. Not sure if we are ready to risk that at this point or i can go ahead.

@adamdecaf
Copy link
Member Author

You can make whatever changes on a branch to compare pointers vs no pointers.

@midepeter
Copy link

Oh okay perfect

@midepeter
Copy link

midepeter commented Feb 20, 2024

  • Is it faster to progressively merge files (from disk/io.Reader, *ach.File instances, or another input?)

@adamdecaf

By the way just to note here that it has this very approach proven that this is will improve the memory performance but reduce the cpu time significantly which i believe is not what we desire.

here some resources I found regarding that

  1. https://gophers.slack.com/archives/C0VP8EF3R/p1708362344757979
  2. https://sourcegraph.com/blog/slow-to-simd [ In this article emphasizing on One possible mitigation would be to move the vectors to disk, but loading them from disk at search time can add [significant latency](https://colin-scott.github.io/personal_website/research/interactive_latency.html), especially on slow disks. Instead, we chose to compress our vectors with int8 quantization. ]

@adamdecaf
Copy link
Member Author

We have a big usecase in achgateway to merge files on disk. We are currently reading files in groups to merge which has much better memory pressure, but takes more cpu time. If you can create a better method of merging files on disk that would be very helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement go Pull requests that update Go code help wanted
Projects
None yet
Development

No branches or pull requests

2 participants