Article: options and tradeoffs around data import parameters #1176

lidel · 2022-06-10T19:19:27Z

Most people are ok with whatever chunker and hash function is the current default in commands that import data to IPFS.
In case of go-ipfs, these are ipfs add, ipfs dag put, and ipfs block put.

However, one can not only use custom --chunker and --hash function when doing ipfs add, but also choose to produce TrickleDAG instead of MErkleDAG by passing --trickle, enable or disable --raw-leaves, or even write own software that chunks and hashes and assembles UnixFS DAG in novel ways.

One can go beyond that, and import a JSON data as dag-json or dag-cbor, creating data structures beyond regular files and directories.

We need an article that explains:

what is the current default when importing files and why
- chunker (why we use size-based, when to use rabin or buzzhash)
- hash (why we use sha2-256)
- raw leaves (possible and default when cidv1 is used, but legacy implementations used cidv0 without this)
- cid version
  - we should document cid v1 as the default, but note that legacy implementations may use v0
- dag type ( --trickle better suited for append-only data such as logs?)
what are the knobs one can change during import, and what is their impact/tradeoffs
things to hitn at, but no need to go to deep
- note dag-pb alternatives exist, mention dag-json and dag-cbor, and hint when using non-Unixfs DAGs make sense

Prior art:

--help explainer around different chunkers Expand rolling chunker documentation kubo#8952
DAG metadata impacting final CID Hash changes if we change our metadata #1152

The text was updated successfully, but these errors were encountered:

RubenKelevra · 2022-06-10T20:39:08Z

Hey @lidel ,feel free to assign me. Got time tomorrow for that. :)

RubenKelevra · 2022-06-14T07:50:52Z

@lidel wrote:

what is the current default when importing files and why

chunker (why we use size-based...)

I may need some input here. I actually can't think of a reasonable explanation why size-based is better than a rolling chunker.

Maybe someone like @Stebalien can chime in here and tell me why the decision was made to use a size-based chunker by default. :)

RubenKelevra · 2022-06-14T07:55:29Z

dag type ( --trickle better suited for append-only data such as logs?)

Correct me when I'm wrong, but it's just a little bit less overhead for data which is read from front to back anyway. So any file type with random access will be slowed down.

Logs are not large enough to make any significant difference here, as you can easily fit a list of all chunks of a log in one block.

So while one may think of zip-like archives, iso files or videos, that's also actually not the case. Zip files are random access and iso files can be mounted without reading the full iso as a whole, and video streaming with seeking is pretty much the norm.

I also cannot think of a really good usecase here - so I would flag it as "stable, but experimental" option.

RubenKelevra · 2022-06-14T07:57:45Z

hash (why we use sha2-256)

I feel like I may not be the right person after all to write this article :D I wrote a ticket to change this default actually – and I still think blake2b is the better default. :)

So I guess "standards?" Or "legacy stuff we not dare to change?"

RubenKelevra · 2022-06-14T08:01:37Z

So overall, just the "why?" and rationale is the blocker for me to write it.

As, I have the opinion that these should be the standards – and don't see good reason to use anything else. :)

Rolling chunker aka buzhash
cidv1
raw-leaves
blake2b-256

And I use them everywhere.

So @lidel if you could just give some rationale for the whys (doesn't even need to be full sentences) I'm happy to write it. Just stop me if it gets too detailed ;)

lidel · 2022-06-14T11:17:01Z

@RubenKelevra no need to write the whole thing, it is perfectly fine if you only write sections that you care about (even if it is only chunker) and open a PR draft with that, we will fill the gaps :)

You are right, many choices like default chunker are legacy decisions – just write that and note that different implementations of IPFS are free to choose different defaults (e.g. blake2b).

Totally, will be useful to even give some "Recipes" like the one you listed with blake and buzzhash, and elaborate why one would prefer that over the "safe"/legacy defaults. :)

RubenKelevra · 2022-06-14T12:10:54Z

@RubenKelevra no need to write the whole thing, it is perfectly fine if you only write sections that you care about (even if it is only chunker) and open a PR draft with that, we will fill the gaps :)

Alright. :)

You are right, many choices like default chunker are legacy decisions – just write that and note that different implementations of IPFS are free to choose different defaults (e.g. blake2b).

Totally, will be useful to even give some "Recipes" like the one you listed with blake and buzzhash, and elaborate why one would prefer that over the "safe"/legacy defaults. :)

Maybe we should just add a "--use-legacy-defaults" to the daemon (and as global flags for all commands) as a flag to free us up from those considerations that people rely on them.

This would also free us up for the long discussed default ports for example, which we also not dare to change for similar reasons. :)

This way we can document the "legacy defaults" once and why they were chosen and then elaborate why the new defaults are better.

I feel that would make more sense when reading - and also more sense when using ipfs.

ElPaisano · 2023-08-22T00:00:10Z

@lidel triaging old issues, would you say this is still relevant?

lidel · 2023-08-22T17:11:28Z

@ElPaisano yes, I believe that this is untapped potential in IPFS ecosystem, and having some introductory docs might empower people to innovate in this area.

There is need for two articles (or one with two sections):

introductory style that explains on defaults and knobs in software like Kubo and Helia
DYI style on writing your own data onboarding tools which do custom chunking (good example in specs here and JS code here)

The goal would be to convey that chunking details are userland feature: anyone can use default chunking or roll their own.

lidel added the need/triage Needs initial labeling and prioritization label Jun 10, 2022

This was referenced Jun 10, 2022

Expand rolling chunker documentation ipfs/kubo#8952

Closed

Hash changes if we change our metadata #1152

Closed

lidel assigned RubenKelevra Jun 10, 2022

lidel added dif/hard Having worked on the specific codebase is important P2 Medium: Good to have, but can wait until someone steps up effort/days Estimated to take multiple days, but less than a week and removed need/triage Needs initial labeling and prioritization labels Jun 10, 2022

RubenKelevra mentioned this issue Jun 16, 2022

[Draft] A cache sweeper for kubo (go-ipfs) ipfs/notes#428

Open

35 tasks

ElPaisano removed the P2 Medium: Good to have, but can wait until someone steps up label Apr 4, 2023

ElPaisano linked a pull request Sep 26, 2023 that will close this issue

Article on data import options & tradeoffs #1715

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Article: options and tradeoffs around data import parameters #1176

Article: options and tradeoffs around data import parameters #1176

lidel commented Jun 10, 2022 •

edited

RubenKelevra commented Jun 10, 2022

RubenKelevra commented Jun 14, 2022

RubenKelevra commented Jun 14, 2022

RubenKelevra commented Jun 14, 2022 •

edited

RubenKelevra commented Jun 14, 2022

lidel commented Jun 14, 2022

RubenKelevra commented Jun 14, 2022

ElPaisano commented Aug 22, 2023

lidel commented Aug 22, 2023 •

edited

Article: options and tradeoffs around data import parameters #1176

Article: options and tradeoffs around data import parameters #1176

Comments

lidel commented Jun 10, 2022 • edited

RubenKelevra commented Jun 10, 2022

RubenKelevra commented Jun 14, 2022

RubenKelevra commented Jun 14, 2022

RubenKelevra commented Jun 14, 2022 • edited

RubenKelevra commented Jun 14, 2022

lidel commented Jun 14, 2022

RubenKelevra commented Jun 14, 2022

ElPaisano commented Aug 22, 2023

lidel commented Aug 22, 2023 • edited

lidel commented Jun 10, 2022 •

edited

RubenKelevra commented Jun 14, 2022 •

edited

lidel commented Aug 22, 2023 •

edited