Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for publication dates #169

Open
mkw opened this issue Dec 2, 2020 · 10 comments
Open

Add support for publication dates #169

mkw opened this issue Dec 2, 2020 · 10 comments

Comments

@mkw
Copy link

mkw commented Dec 2, 2020

It would be nice to be able to compute library "freshness" in terms of library age, not just version counts. (See https://libyear.com for an example of this) using data in deps.cloud. To do this, publication dates would need to be added to artifact versions. Assorted thoughts:

  • Always and unconditionally UTC.
  • Prefer dates encoded in the scanned artifacts when present (rare, only published artifacts).
  • Otherwise, for artifact repositories, use the date of the artifact.
  • For source repository branches, use the last change date of the scanned branch
  • For source repository tags, use the date of the tag.
@mjpitz
Copy link
Member

mjpitz commented Dec 3, 2020

I was just thinking through this the other day. I know @DuaneOBrien and I have talked about it in depth as well.

@mjpitz
Copy link
Member

mjpitz commented Dec 4, 2020

Spent a little time on this last night and I think I want to put a more formal proposal together for this. I think it might require some changes to how data is stored / retrieved, but I think it's for the better. @mkw do you have a preference for how we do this? Would it be better to collaborate over a pull request or in something like a Google doc?

@mjpitz
Copy link
Member

mjpitz commented Dec 4, 2020

Actually... this might now be as bad as I was originally thinking.. Logically, data is laid out as such:

k1 k2 k3 data
a a source{ ... }
a b manages{ ... }
b b module{ ... }
b c a depends{ ... }
b d a depends{ ... }
c c module{ ... }
d d module{ ... }

We're already associating the edge back to the source. So a good starting point would be to set a timestamp on a source and resolve the associated source when the edges are pulled.

Moving over to supporting a source for each version will be interesting. We really need to handle server-side filtering of data.. Even if it's programmatic in go at first, it'll be a huge step. I think there will be a big performance penalty though. I think we can mitigate this by making it an opt-in feature on the indexer.

thoughts?...

@mjpitz
Copy link
Member

mjpitz commented Dec 10, 2020

@mkw

OK. So I'm feeling more confident in this. I spent some time with the database. I always knew the queries could be improved just wasn't sure when the right time to do it was. Right now, the database uses a JOIN to grab the node and associated edge data. The problem is that it creates a lot of duplicate data. I was able to rewrite one of the queries using a UNION ALL (worked with SQLite so I think it'll be fine with the others but we can always cross that bridge if needed:

SELECT * FROM graph_data WHERE k2 IN (
    SELECT distinct(k2) as keys
    FROM graph_data
    where k1 IN ('k1')
      AND k1 != k2
      AND date_deleted IS NULL
)
AND date_deleted IS NULL
UNION ALL

SELECT * FROM graph_data WHERE k1 IN (
    SELECT distinct(k3) as keys
    FROM graph_data
    where k1 IN ('k1')
      AND k1 != k2
      AND date_deleted IS NULL
)
AND k1 == k2
AND date_deleted IS NULL

ORDER BY k2;

This conveniently lays out the data such that nodes and their associated edges are next to one another in the result set and should half the amount of data that is being returned (even with the additional fetch for source information). I haven't tried doing this with the v1alpha schema but I'm pretty sure this rewritten query won't map as cleanly over. The nice thing is I'm pretty sure this will be the EOL for the v1alpha schema. I was able to move a good portion of the indexing side over to using the new manifest services (#176). All that's left is moving the existing APIs over to using the v1beta storage layer.

@mjpitz
Copy link
Member

mjpitz commented Dec 10, 2020

I'm a little concerned about swapping over to the nested query. If performance gets rough, we can look at using a temp table to help improve things. My gut feeling is that it won't be too bad since we're hitting indexes and the entire dataset is more or less treated as key-value data.

@mkw
Copy link
Author

mkw commented Dec 10, 2020

Sorry -- I'm slow catching up.

re: where to store the date -- it's fine to normalize.

re: performance -- In general, I think that you should leverage the fact that the data deps.cloud collects is effectively immutable once collected and start taking advantage of materialized views. All such deep joins can be eliminated at the cost of more storage. Materialized views are ideal for this because updates can be triggered manually in an "optimization" phase after initial indexing, minimizing total time.

@mjpitz
Copy link
Member

mjpitz commented Dec 10, 2020

What kind of views are you imagining here? I don't mind trading off some of the latency for additional storage, but I'm just not convinced that there's a whole lot of value having a materialized view... for instance, one option would be to snapshot all the subgraphs, but that doesn't necessarily solve the problem..

@mjpitz
Copy link
Member

mjpitz commented Dec 20, 2020

Ok.. I ran a bunch of performance tests last night.. The nested select there causes the performance test to hang against sqlite... I managed to be able to craft a join equivalent which actually works pretty well and doesn't impact qps too much... (granted, sqlite has different perf characteristics when compared to other dbs).

I'm going to work on putting a repo together with some more of this code so I can run this eval on a more regular basis with other DBs too...

I do agree, we could definitely be storing more. I don't think views are quite right? I almost want some kind of synthesized report store.. I kinda had something similar in the predecessor, but storage was client-side. I'll spend some more time thinking more about this.

@mkw
Copy link
Author

mkw commented Dec 30, 2020

Sorry -- again super slow getting back. For materialized views, I was thinking that we could flatten things, but I believe that I was wrong. I think that what you got to is fine. In the long run, it might make sense to create in-memory indexes of the graph for searching for those that are willing to pay the RAM costs. But, that would require a lot more effort unless there is some OSS graph library that does so.

@mjpitz
Copy link
Member

mjpitz commented Feb 20, 2021

OK. Time to follow up here.

  • I knocked out the last of the v1beta changes. After v0.3.0 (TBR), data will be written to the new v1beta store. The v1alpha store will still exist for read-only purposes, but the data there will go stale. (I'll detail this further in the v0.3.0 release notes).
  • New SQL queries sped up MySQL reads.
  • I looked around for a few OSS graph libraries. There were some out there at the time, the hard part was finding ones that were persistent. As soon as I started to look at persistence, that's when I started down the graph database path (which I was trying to avoid for adoption reasons). Another option I had considered initially but ruled out was Redis (which arguably wouldn't be a bad option here). They have a graph module that we could explore a bit more.
  • We can also look into is using something like mailguns' groupcache fork to cache data at the tracker nodes. It can reduce a number of trips the database and it works pretty well with proto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants