-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for publication dates #169
Comments
I was just thinking through this the other day. I know @DuaneOBrien and I have talked about it in depth as well. |
Spent a little time on this last night and I think I want to put a more formal proposal together for this. I think it might require some changes to how data is stored / retrieved, but I think it's for the better. @mkw do you have a preference for how we do this? Would it be better to collaborate over a pull request or in something like a Google doc? |
Actually... this might now be as bad as I was originally thinking.. Logically, data is laid out as such:
We're already associating the edge back to the source. So a good starting point would be to set a timestamp on a source and resolve the associated source when the edges are pulled. Moving over to supporting a source for each version will be interesting. We really need to handle server-side filtering of data.. Even if it's programmatic in go at first, it'll be a huge step. I think there will be a big performance penalty though. I think we can mitigate this by making it an opt-in feature on the indexer. thoughts?... |
OK. So I'm feeling more confident in this. I spent some time with the database. I always knew the queries could be improved just wasn't sure when the right time to do it was. Right now, the database uses a JOIN to grab the node and associated edge data. The problem is that it creates a lot of duplicate data. I was able to rewrite one of the queries using a
This conveniently lays out the data such that nodes and their associated edges are next to one another in the result set and should half the amount of data that is being returned (even with the additional fetch for source information). I haven't tried doing this with the v1alpha schema but I'm pretty sure this rewritten query won't map as cleanly over. The nice thing is I'm pretty sure this will be the EOL for the v1alpha schema. I was able to move a good portion of the indexing side over to using the new manifest services (#176). All that's left is moving the existing APIs over to using the v1beta storage layer. |
I'm a little concerned about swapping over to the nested query. If performance gets rough, we can look at using a temp table to help improve things. My gut feeling is that it won't be too bad since we're hitting indexes and the entire dataset is more or less treated as key-value data. |
Sorry -- I'm slow catching up. re: where to store the date -- it's fine to normalize. re: performance -- In general, I think that you should leverage the fact that the data deps.cloud collects is effectively immutable once collected and start taking advantage of materialized views. All such deep joins can be eliminated at the cost of more storage. Materialized views are ideal for this because updates can be triggered manually in an "optimization" phase after initial indexing, minimizing total time. |
What kind of views are you imagining here? I don't mind trading off some of the latency for additional storage, but I'm just not convinced that there's a whole lot of value having a materialized view... for instance, one option would be to snapshot all the subgraphs, but that doesn't necessarily solve the problem.. |
Ok.. I ran a bunch of performance tests last night.. The nested select there causes the performance test to hang against sqlite... I managed to be able to craft a join equivalent which actually works pretty well and doesn't impact qps too much... (granted, sqlite has different perf characteristics when compared to other dbs). I'm going to work on putting a repo together with some more of this code so I can run this eval on a more regular basis with other DBs too... I do agree, we could definitely be storing more. I don't think views are quite right? I almost want some kind of synthesized report store.. I kinda had something similar in the predecessor, but storage was client-side. I'll spend some more time thinking more about this. |
Sorry -- again super slow getting back. For materialized views, I was thinking that we could flatten things, but I believe that I was wrong. I think that what you got to is fine. In the long run, it might make sense to create in-memory indexes of the graph for searching for those that are willing to pay the RAM costs. But, that would require a lot more effort unless there is some OSS graph library that does so. |
OK. Time to follow up here.
|
It would be nice to be able to compute library "freshness" in terms of library age, not just version counts. (See https://libyear.com for an example of this) using data in deps.cloud. To do this, publication dates would need to be added to artifact versions. Assorted thoughts:
The text was updated successfully, but these errors were encountered: