Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring for image based gadgets #2512

Merged
merged 25 commits into from
Mar 27, 2024
Merged

Refactoring for image based gadgets #2512

merged 25 commits into from
Mar 27, 2024

Conversation

flyth
Copy link
Member

@flyth flyth commented Feb 16, 2024

In our teams' efforts to quickly move forward with the "image based gadgets" feature/architecture (which is really a great achievement!), many things have been built into an interface that was previously used for all other single gadgets - basically all of the functionality has landed in a gadget simply called "run". Yet, plans for the image based gadgets have outgrown the capabilities of that interface: in the future, those gadget images should for example be able to extract multiple event streams at the same time (which previously would have required two separate gadget instances) that then can be consumed in different ways.

Also, while we previously knew which gadget expected which kind of enrichment, it was easy to just actively extend our Go data structures and wire the gadgets up in a way that they could get the information needed. We use operators for that, which check for certain interface implementations on the gadgets itself and then activate themselves.
With the image based gadgets, however, we don't have those clear Go types anymore - instead, we try to mostly infer data layouts from BTF and then we manually leverage operators depending on the outcome of the inference and simply reserve space for all the information that could be added. This has added quite a lot of direct dependencies to the run gadget that are not necessarily required all the time.

This refactoring addresses a couple of things:

  • it makes IG more modular again
    • image based gadgets are evaluated by their layers (ebpf, wasm, ...) and respective operators will handle them
    • operators don't need to know about each other again
  • data will be sent/received through completely abstract DataSources instead that operators subscribe to to extend, modify or consume data
  • Fields inside DataSources can be addressed using names and tags
  • multiple DataSources per gadget are possible with the refactoring, which addresses handling multiple data streams emitted or polled from the gadget
  • using operators as sinks, they can consume data and forward it to for example external services or output them to the CLI or a prometheus backend
  • the underlying data can be efficiently serialized through protocol buffers (instead of JSON, which is used right now)
  • DataSources (or even their fields) can be addressed and explicitly requested; if a field is not requested by the user/client, operators handling it can choose to not fill or enrich it in the first place, thus saving CPU & bandwidth
  • it makes it much easier to provide new user-space helpers like converters
  • the new interfaces for the DataSources should be implemented for WASM at a later stage, to be able to provide the same level of access in user supplied code
  • it simplifies the Go API to call gadgets and makes it easier to phase out legacy code later on
  • prepares IG to be able to handle other planned features more easily
  • ...tba

This is still a draft and still needs a bit of polishing, especially around documentation & testing.

I will add more information and the open todo items over the next couple of days.

Data flow

sequenceDiagram
    participant client as Client
    participant grpc as gRPC Runtime
    participant lctx as Local GadgetContext
    participant cli as CLI Operator
    participant svc as gRPC Service
    participant oci as OCI Handler
    participant ctx as GadgetContext
    participant ebpf as eBPF Operator
    participant mgr as KubeManager
    participant fmt as Formatter
    box Client
        participant client
        participant grpc
        participant lctx
        participant cli
    end
    box Server
        participant svc
        participant oci
        participant ctx
        participant ebpf
        participant mgr
        participant fmt
    end
    client ->> grpc: Call run-experimental
    grpc ->> svc: RunOCIGadget(url)
    grpc ->> lctx: Create New
    svc ->> oci: RunOCIGadget(url)
    Note over oci: ensure gadget is available<br/>download if necessary<br/>look at layers and call operators
    oci ->> ctx: Create New
    Note over oci: find handler for layer (eBPF in this example)
    oci ->> ebpf: Instantiate()
    Note over ebpf: load image<br/>parse BTF<br/>populate everything<br/>create DataSources
    oci ->> mgr: Instantiate()
    Note over mgr: iterate over gadgetCtx().DataSources()<br/>if mntnsid found, add fields to that DataSource<br/>subscribe to that DataSource (unused though in GetGadgetInfo())
    Note over mgr: gets FieldAccessors and caches them
    mgr ->> ctx: add fields / subscriptions
    oci ->> fmt: Instantiate()
    Note over fmt: iterate over gadgetCtx().DataSources()<br/>if interesting fields found, adjust DataSource<br/>subscribe to that DataSource (unused)
    Note over fmt: gets FieldAccessors and caches them
    fmt ->> ctx: add fields / subscriptions
    Note over oci: serialize GadgetInfo from GadgetContext
    oci ->> svc: Send GadgetInfo
    svc ->> grpc: 
    grpc ->> lctx: Deserialize GadgetInfo
    Note over lctx: create DataSources
    grpc ->> cli: init
    Note over cli: creates Columns helpers from DataSources
    Note over cli: iterate over gadgetCtx().DataSources()<br/>subscribes
    cli ->> lctx: subscribe
    oci ->> ctx: subscribe to DataSources as sink (last in chain)
    oci ->> ebpf: Start()
    Note over ebpf: installs programs<br/>installs readers
    oci ->> mgr: Start()
    oci ->> fmt: Start()
    Note over ebpf: read data from buffer<br/>create new Data and use cached FieldAccessors to fill in data
    ebpf ->> ctx: EmitAndRelease() data
    ctx ->> mgr: forward Data to subscriber
    Note over mgr: uses cached FieldAccessors to read and write data
    ctx ->> fmt: forward Data to subscriber
    Note over fmt: uses cached FieldAccessors to read and write data
    ctx ->> oci: forward Data to subscriber
    Note over oci: serialize to protobuf
    oci ->> svc: forward GadgetEvent
    svc ->> grpc: 
    Note over grpc: deserialize from protobuf
    grpc ->> lctx: EmitAndRelease() data
    lctx ->> cli: forward to subscriber
    Note over cli: output

Todos

  • JSON output to include all fields (not a big fan as this introduces new exceptions)
  • remove viper (but might move that out of the PR if that's okay for now)
  • update docs / workflow chart

Changes to the run gadget implemented while this PR is/was open and should be addressed here or in a follow-up PR. Feel free to add.

@flyth flyth force-pushed the michael/refactoring-oci branch 3 times, most recently from 779539f to e95a194 Compare February 16, 2024 22:58
Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I have some comments on the first commit.
I will take a deeper look to get a broader understanding later, nonetheless:

  • a mermaid diagram would be welcomed to help understanding the links between the different structures.
  • this PR adds 6000 lines of code, I understand protobuf generated code accounts for a big part of it but I really want a follow-up PR removing plenty of code.
  • this PR has 24 commits, some of them add helpers which would have been already helpful and could have been merged before. I am wondering if you should not split it.

Best regards.

pkg/oci/oci.go Outdated Show resolved Hide resolved
pkg/datasource/accessors.go Outdated Show resolved Hide resolved
pkg/datasource/accessors.go Outdated Show resolved Hide resolved
pkg/datasource/accessors.go Outdated Show resolved Hide resolved
pkg/datasource/accessors.go Outdated Show resolved Hide resolved
pkg/datasource/columns.go Outdated Show resolved Hide resolved
pkg/datasource/data.go Outdated Show resolved Hide resolved
ds.fieldMap[f.Name] = (*field)(f)
}
ds.registerPool()
return ds, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You never return an error, so drop this unused error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More tests could be added here that return errors when loading the untrusted "user" data here - I'd rather leave the error in to avoid refactoring the callers, but I'll add a TODO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid refactoring the callers

gopatch can handle this.
You are already refactoring, so let's merge correct code and only what we need.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid refactoring the callers

gopatch can handle this. You are already refactoring, so let's merge correct code and only what we need.

I don't agree with this. Why to have to change this in the future if this is planned to return errors later on? It's better to define the right function signature from the beginning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was that I meant: you are already doing a refactoring here, so let's focus on refactor right now, instead of adding refactoring as future work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolving as the error is still present but never used!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding we agreed to keep it this way to not have to change the function signature later on and have to edit all the callers.

pkg/datasource/data.go Show resolved Hide resolved
pkg/datasource/data.go Show resolved Hide resolved
Copy link
Member

@alban alban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

I think part of this PR can be added progressively with separate PRs, such as:

  • changes to btfhelpers
  • changes on CString

There are probably others... then, this PR will be a bit smaller.

Then, there are big blocks which could be split, such as:

  1. Changes to the service grpc API
  2. Introduction of the Layer operators
  3. datasource.FieldAccessor

It seems they can be implemented, reviewed and merged one after another in this order.

cmd/common/oci.go Outdated Show resolved Hide resolved
cmd/common/oci.go Outdated Show resolved Hide resolved
Comment on lines 73 to 82
// Before running the gadget, we need to get the gadget info to perform
// different tasks like creating the parser and setting flags for the
// gadget's parameters.
actualArgs := cmd.Flags().Args()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I think dynamic flags for gadget parameters are a mistake. I don't like that gadget parameters like --ignore-failed are mixed up with operators/localmanager flags like --host and with generic flags like --insecure.

$ sudo -E ./ig run trace_exec -h
...
  -h, --help                   help for run
      --host                   Show data from both the host and containers
      --ignore-failed          Ignore failed events (default true)
      --insecure               Allow connections to HTTP only registries

I would prefer if it was --param ignore-failed=true. It would make it a lot easier for the user to understand the category of the flag. And I suspect the implementation would be easier without the utils.ParseEarlyFlags.

Also, I'd like to have an inspect command that would show, among other things, the parameters:

$ sudo -E ./ig image inspect trace_exec

In this way, the --help command would never try to connect to internet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the --param issue and also with having an inspect call to show all available information (from the accumulated gadget metadata).

In this way, the --help command would never try to connect to internet.

Why would it change behavior here? (I mean: which requests would be saved here?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would it change behavior here? (I mean: which requests would be saved here?)

I don't understand the question.

What I meant is that when the user does not really understand how the CLI is supposed to be used and adds a --help, it does not print any help message and tries to download something from the internet. I find that confusing for users.

$ sudo -E ./ig run trace exec --help
INFO[0000] Experimental features enabled                
Error: getting gadget info: getting gadget image: pulling image "trace": downloading to local repository: failed to resolve ghcr.io/inspektor-gadget/gadget/trace:latest: ghcr.io/inspektor-gadget/gadget/trace:latest: not found

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay - so you mean the gadget doesn't need to be downloaded directly for that call with a --help at the end. I don't think that's possible, though - because operators also have their params and those can depend on the actual gadget - e.g. if the gadget has mountnsid, LocalManager or KubeManager needs to be loaded - and those also register params.
Or would you want those params also to move to the more generic solution?

Copy link
Member

@alban alban Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or would you want those params also to move to the more generic solution?

Yes, I would like that. Something like --operator-opt foo:bar, so the CLI flag --operator-opt is known without downloading anything. And the user could discover the possible operator params with the image inspect command.

Similarly, Docker has the flag --security-opt= which is also not fully described in --help because I think it is passed to some inner component:

  • --security-opt=seccomp:unconfined
  • --security-opt=no-new-privileges
  • --security-opt=label:disable

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if we go that route, we could also use our internal conventions (e.g. --param operators.localmanager.foo=abc). I'm certainly not against it, but it would make things a bit less straightforward than it was before for users.

Another thing I like about it, is that it can differentiate between local and remote operators; the cli operator for example is always a local running one, while most of the others are run on the server side. So cli could still register params directly, while remote params could be used through the detour. Hmmm.....

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, but I think dynamic flags for gadget parameters are a mistake. I don't like that gadget parameters like --ignore-failed are mixed up with operators/localmanager flags like --host and with generic flags like --insecure.

I think this could be fixed by #935.

$ sudo -E ./ig run trace_exec -h
...
  -h, --help                   help for run
      --host                   Show data from both the host and containers
      --ignore-failed          Ignore failed events (default true)
      --insecure               Allow connections to HTTP only registries

I would prefer if it was --param ignore-failed=true. It would make it a lot easier for the user to understand the category of the flag.

What about parameters that are introduced by operators? (--host, --namespace, --pod, etc?) Should we use a --op-param prefix too?

Also, I'd like to have an inspect command that would show, among other things, the parameters:

Do we really want to introduce another command for it? Isn't it better to have this under a well-known flag / or command like --help?
I was thinking we could have in the metadata data a help section that is shown when the user run ig run foo --help, this will allow the gadget developer to write a documentation specific for their gadget.

What I meant is that when the user does not really understand how the CLI is supposed to be used and adds a --help, it does not print any help message and tries to download something from the internet. I find that confusing for users.

I'd say that specific case not well handled currently. We can improve this message by indicating the gadget isn't valid.

(I'd suggest to move this discussion to a different issue to help focusing the review on specific changes introduced by this PR)

cmd/common/oci.go Outdated Show resolved Hide resolved
cmd/common/oci.go Outdated Show resolved Hide resolved
pkg/btfhelpers/btfhelpers.go Outdated Show resolved Hide resolved
pkg/gadget-service/api/api.proto Outdated Show resolved Hide resolved
pkg/gadget-service/api/api.proto Show resolved Hide resolved
pkg/gadget-service/api/api.proto Outdated Show resolved Hide resolved
pkg/gadgets/helpers_all.go Outdated Show resolved Hide resolved
@flyth
Copy link
Member Author

flyth commented Feb 19, 2024

You're both right - I will try to move all helpers to another PR.

a mermaid diagram would be welcomed to help understanding the links between the different structures.

Yeah, totally agree and this will follow soon.

but I really want a follow-up PR removing plenty of code.

This PR doesn't remove the run gadget - when we switch to this, the run gadget and some of its dependencies can be removed.

@mauriciovasquezbernal
Copy link
Member

This PR doesn't remove the run gadget - when we switch to this, the run gadget and some of its dependencies can be removed.

When do you think we should remove the run gadget? IMO it should be done in this same PR.

Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some very initial comments. I still need to check most of it.

Just to be sure we're aligned, current status of this PR isn't working, right?
I tried it and all what I got was:

$ $ sudo -E ig run-experimental trace_open
INFO[0000] Experimental features enabled
WARN[0000] unknown type for variable "gadget_filter_by_mntns": Var:"gadget_filter_by_mntns"
WARN[0000] unknown type for variable "gadget_filter_by_mntns": Var:"gadget_filter_by_mntns"

pkg/operators/operators.go Outdated Show resolved Hide resolved
pkg/operators/ebpf/ebpf.go Outdated Show resolved Hide resolved
pkg/operators/ebpf/ebpf.go Outdated Show resolved Hide resolved
tags: nil,
}

i.vars[varName] = newVar

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be i.maps?

pkg/operators/ebpf/ebpf.go Outdated Show resolved Hide resolved
pkg/gadget-service/api/api.proto Outdated Show resolved Hide resolved
pkg/gadget-service/api/api.proto Outdated Show resolved Hide resolved
pkg/gadget-service/api/consts.go Outdated Show resolved Hide resolved
pkg/runtime/grpc/oci.go Outdated Show resolved Hide resolved
// },
{
prefixFunc: func(s string) (string, bool) {
// Exceptions for backwards-compatibility

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backwards-compatibility with what?

Copy link
Member Author

@flyth flyth Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gadget_filter_by_mntns is handled like a var in the operator, which is preferred (in here) to work with the prefix gadget_var_. In the end, it would be better to have the former be called gadget_var_filter_by_mntns to fit the concept. I'm open for discussions around that 😄

Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Some comments, I am still reading it.

Best regards.

pkg/datasource/accessors.go Outdated Show resolved Hide resolved
pkg/datasource/data.go Show resolved Hide resolved
pkg/gadget-context/gadget-context.go Outdated Show resolved Hide resolved
pkg/gadget-context/gadget-context.go Show resolved Hide resolved
pkg/gadget-context/gadget-context.go Show resolved Hide resolved
pkg/operators/ebpf/struct.go Outdated Show resolved Hide resolved
pkg/operators/ebpf/struct.go Show resolved Hide resolved
pkg/operators/ebpf/struct.go Outdated Show resolved Hide resolved
pkg/operators/ebpf/struct.go Outdated Show resolved Hide resolved
pkg/operators/ebpf/attach.go Show resolved Hide resolved
Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Here are some other comments.
I will take a deeper look later and I will particularly focus on getting a big picture understanding, so mermaid diagram would be welcomed.

With regard to the code itself, you can still factorize several parts, thus reducing how many lines you want to add.
Moreover, I would like to get an estimation on how many lines we will be able to remove once this PR would be merged.
Indeed, this is not the first time we refactor the code base and we did not remove all the code we should have removed for specific reasons.

Best regards.

pkg/operators/localmanager/wrappers.go Outdated Show resolved Hide resolved
pkg/operators/formatters/formatters.go Outdated Show resolved Hide resolved
pkg/operators/formatters/formatters.go Show resolved Hide resolved
pkg/runtime/local/local.go Outdated Show resolved Hide resolved
pkg/operators/kubemanager/kubemanager.go Outdated Show resolved Hide resolved
pkg/operators/kubemanager/wrappers.go Outdated Show resolved Hide resolved
@flyth
Copy link
Member Author

flyth commented Feb 20, 2024

Just to be sure we're aligned, current status of this PR isn't working, right?

No, it should work. I seem to have messed up the local output with recent changes, sorry - I'll look into it. ig --daemon + gadgetctl or using on kubernetes should work.

Comment on lines 532 to 565
if !host {
if l.manager.igManager == nil {
return fmt.Errorf("container-collection isn't available")
}

// Create mount namespace map to filter by containers
mountnsmap, err := l.manager.igManager.CreateMountNsMap(id.String(), containerSelector)
if err != nil {
return commonutils.WrapInErrManagerCreateMountNsMap(err)
}

log.Debugf("set mountnsmap for gadget")
gadgetCtx.SetVar(gadgets.MntNsFilterMapName, mountnsmap)

l.mountnsmap = mountnsmap
} else if l.manager.igManager == nil {
log.Warn("container-collection isn't available: container enrichment and filtering won't work")
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this logic really belongs to this operator. The mount ns map is something that can only be used from eBPF programs, so wouldn't it make more sense to move this logic into the eBPF operator?

Copy link
Member Author

@flyth flyth Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you would introduce a hard dependency on that from the eBPF operator - and we have two implementations (KubeManager + LocalManager). Also, we want to add the possibility to completely deactivate certain operators. I think having this contained in an operator helps keeping the code clean.

The mount ns map is something that can only be used from eBPF programs

It's probably going to be only eBPF accessing it in this specific case - but for other features, I think other operators might be interested in eBPF maps as well. Weird example: imagine an operator that exposes itself using a JSON/REST interface over HTTP that accepts data on a url like /api/:gadgetID/:mapName/set?key=abc&val=def where it would edit map contents that way.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you would introduce a hard dependency on that from the eBPF operator

What is the issue with this? The eBPF operator takes care of loading and attaching eBPF programs to the kernel, why can't it also take care of creating the filtering map that is used by those programs to discard events from the kernel?

I see that the {Local,Kube}Manager operators are (1) creating the ebpf map for filtering purposes (2) enriching data, do these two operations belong to the same operator? I know both features need the container collection, but other than that I'd say they're independent.

and we have two implementations (KubeManager + LocalManager).

Are those two operator really different? We're trying to close the gap between ig and ig-k8s (see #1403 & #2100), so probably those two operators are going to finish being a single one.

Also, we want to add the possibility to completely deactivate certain operators. I think having this contained in an operator helps keeping the code clean.

It's a good point. Two things:

  • Perhaps instead of disabling the whole operator it could be an option inside the ebpf operator?
  • I see a point on disabling some operators, but {Local,Kube}Manager are so important that I don't see a point in disabling them. Running the gadgets without providing any enrichment nor filtering capabilities?

It's probably going to be only eBPF accessing it in this specific case - but for other features, I think other operators might be interested in eBPF maps as well. Weird example: imagine an operator that exposes itself using a JSON/REST interface over HTTP that accepts data on a url like /api/:gadgetID/:mapName/set?key=abc&val=def where it would edit map contents that way.

How would this work? The ebpf operator puts all maps into the gadget context, and then this "ebpf-map-updater" uses this information to update them? Besides map, which other things should the ebpf operator expose and how will other operators understand this information is available on the context? In other words, what's the contract between different operators?

Copy link
Member Author

@flyth flyth Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you would introduce a hard dependency on that from the eBPF operator

What is the issue with this? The eBPF operator takes care of loading and attaching eBPF programs to the kernel, why can't it also take care of creating the filtering map that is used by those programs to discard events from the kernel?

I see that the {Local,Kube}Manager operators are (1) creating the ebpf map for filtering purposes (2) enriching data, do these two operations belong to the same operator? I know both features need the container collection, but other than that I'd say they're independent.

Hmm, I think I see it this way: the eBPF operator (or other layer operators) provides the framework / tooling, while data operators provide the additional content / features. IMHO it's best to split that as much as possible. What advantages do you see other than removing this particular abstraction?

and we have two implementations (KubeManager + LocalManager).

Are those two operator really different? We're trying to close the gap between ig and ig-k8s (see #1403 & #2100), so probably those two operators are going to finish being a single one.

Yeah, it would be great if we could avoid duplicate code, but IMHO the combined operator should still be handled independently of the eBPF operator.

Also, we want to add the possibility to completely deactivate certain operators. I think having this contained in an operator helps keeping the code clean.

It's a good point. Two things:

  • Perhaps instead of disabling the whole operator it could be an option inside the ebpf operator?
  • I see a point on disabling some operators, but {Local,Kube}Manager are so important that I don't see a point in disabling them. Running the gadgets without providing any enrichment nor filtering capabilities?

If we can reduce direct dependencies, I think we should. This is especially the case for the Go API that we want to provide - if using IG as a lib, it would be better to import the least amount of code to achieve what you want. Importing for example the KubeManager brings a lot of dependencies on its own (basically the whole k8s ecosystem with tons of other packages) - if you only want to use IG as a tool to load eBPF programs and leverage the tooling / exporters etc, that might be overkill. I know the enrichments are probably our KSP right now, but that doesn't mean that the rest of IG isn't worth being used on its own 😉
I think the price of having things independent of each other after this refactoring is really low and that's why I'm all for going this way. 😄 But I'm happy to discuss further.

It's probably going to be only eBPF accessing it in this specific case - but for other features, I think other operators might be interested in eBPF maps as well. Weird example: imagine an operator that exposes itself using a JSON/REST interface over HTTP that accepts data on a url like /api/:gadgetID/:mapName/set?key=abc&val=def where it would edit map contents that way.

How would this work? The ebpf operator puts all maps into the gadget context, and then this "ebpf-map-updater" uses this information to update them? Besides map, which other things should the ebpf operator expose and how will other operators understand this information is available on the context? In other words, what's the contract between different operators?

OTTOMH, this could be achieved by exposing a map as a datasource through the eBPF Operator, tagging it using metadata with something like "exported" or "public" and then having the JSON/REST operator that would expose that datasource - this would mean we'd have to extend the DataSource to be able to fetch specific data, which we probably need sooner or later anyway. (A more hacky solution for a quick PoC could probably be done somehow using the Vars on the context)

@eiffel-fl
Copy link
Member

Thank you for the mermaid graph, can you please explain why the following:

iterate over gadgetCtx().DataSources()if mntnsid found, add fields to that DataSourcesubscribe to that DataSource (unused though in GetGadgetInfo())gets accessors and caches them

is shared by both KubeManager and Formatter?
Same for this:

uses accessors to read and write data

Shouldn't we only have component which takes care of this rather than two?

@flyth
Copy link
Member Author

flyth commented Feb 21, 2024

Thank you for the mermaid graph, can you please explain why the following:

iterate over gadgetCtx().DataSources()if mntnsid found, add fields to that DataSourcesubscribe to that DataSource (unused though in GetGadgetInfo())gets accessors and caches them

is shared by both KubeManager and Formatter? Same for this:

uses accessors to read and write data

Shouldn't we only have component which takes care of this rather than two?

Both, KubeManager & Formatter are operator examples in the flow and they just do what operators are intended to do: look at what is there and decide whether they can be helpful by adding fields. So they're iterating over DataSources to see what's there and getting accessors for existing fields and fields they add - which later on are used to read from the existing fields and writing to their newly added fields. And as they're both (KubeManager & Formatter) adding different features in isolation, they both need to do that (this is where the initialization phase and getting the accessors is the "expensive" part while using the accessors later on when the data is flowing is pretty cheap).

I hope that answers the question.

Copy link
Member

@mauriciovasquezbernal mauriciovasquezbernal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more few comments. I jumping through the code trying to understand the different pieces.

pkg/datasource/router.go Outdated Show resolved Hide resolved
pkg/operators/registry.go Outdated Show resolved Hide resolved
pkg/datasource/datasource.go Outdated Show resolved Hide resolved
pkg/datasource/data.go Show resolved Hide resolved
pkg/operators/ebpf/struct.go Outdated Show resolved Hide resolved
Comment on lines 36 to 66
func (f *field) ReflectType() reflect.Type {
switch Kind(f.Type) {
default:
return nil
case Int8:
return reflect.TypeOf(int8(0))
case Int16:
return reflect.TypeOf(int16(0))
case Int32:
return reflect.TypeOf(int32(0))
case Int64:
return reflect.TypeOf(int64(0))
case Uint8:
return reflect.TypeOf(uint8(0))
case Uint16:
return reflect.TypeOf(uint16(0))
case Uint32:
return reflect.TypeOf(uint32(0))
case Uint64:
return reflect.TypeOf(uint64(0))
case Float32:
return reflect.TypeOf(float32(0))
case Float64:
return reflect.TypeOf(float64(0))
case Bool:
return reflect.TypeOf(false)
case String:
return reflect.TypeOf("")
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Array and Slice aren't considered here? I tried to add them but ig exploded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they (and String) need special handling because it's no simple conversion - if there's no reflect type, a field will be handled as binary blob / CString for now. A gadget author should be able to specify in more detail in the metadata file how the field should be encoded/decoded (so that JSON would have hex/base64 data instead of a string representation in some cases, for example). That's not a part of the PR right now - do you think it should be?

pkg/datasource/columns.go Show resolved Hide resolved
Copy link
Member

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Some comments with regard to some commits:

Best regards.

pkg/columns/columns.go Outdated Show resolved Hide resolved
pkg/columns/columninfo.go Outdated Show resolved Hide resolved
pkg/metadata/v1/metadata.go Outdated Show resolved Hide resolved
pkg/datasource/accessors.go Show resolved Hide resolved
pkg/datasource/columns.go Outdated Show resolved Hide resolved
pkg/datasource/data.go Show resolved Hide resolved
Comment on lines +294 to +361
return true
ds.lock.RLock()
defer ds.lock.RUnlock()
return ds.requestedFields[fieldName]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Euh... I am just surprised the compiler does not catch this 😮.
So, please remove this return true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, check this one!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still in progress, will resolve once the implementation is finalized.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is still not addressed!

pkg/datasource/data.go Outdated Show resolved Hide resolved
pkg/operators/oci/oci.go Outdated Show resolved Hide resolved
pkg/operators/oci/oci.go Outdated Show resolved Hide resolved
@eiffel-fl
Copy link
Member

Thank you for the mermaid graph, can you please explain why the following:

iterate over gadgetCtx().DataSources()if mntnsid found, add fields to that DataSourcesubscribe to that DataSource (unused though in GetGadgetInfo())gets accessors and caches them

is shared by both KubeManager and Formatter? Same for this:

uses accessors to read and write data

Shouldn't we only have component which takes care of this rather than two?

Both, KubeManager & Formatter are operator examples in the flow and they just do what operators are intended to do: look at what is there and decide whether they can be helpful by adding fields. So they're iterating over DataSources to see what's there and getting accessors for existing fields and fields they add - which later on are used to read from the existing fields and writing to their newly added fields. And as they're both (KubeManager & Formatter) adding different features in isolation, they both need to do that (this is where the initialization phase and getting the accessors is the "expensive" part while using the accessors later on when the data is flowing is pretty cheap).

I hope that answers the question.

This definitely makes sense! Thank you for the precision!

@flyth flyth force-pushed the michael/refactoring-oci branch 2 times, most recently from 32b0b7a to bc29d4f Compare February 22, 2024 12:50

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Perhaps we can move these changes to a new file so it's easier in the future to remove the things we don't need. (same for other places where it could apply)

pkg/runtime/local/local.go Outdated Show resolved Hide resolved
Comment on lines 40 to 86
func (o *OciHandler) ParamDescs() params.ParamDescs {
return params.ParamDescs{
// Hardcoded for now
{
Key: authfileParam,
Title: "Auth file",
DefaultValue: oci.DefaultAuthFile,
TypeHint: params.TypeString,
},
{
Key: validateMetadataParam,
Title: "Validate metadata",
Description: "Validate the gadget metadata before running the gadget",
DefaultValue: "true",
TypeHint: params.TypeBool,
},
{
Key: insecureParam,
Title: "Insecure connection",
Description: "Allow connections to HTTP only registries",
DefaultValue: "false",
TypeHint: params.TypeBool,
},
{
Key: pullParam,
Title: "Pull policy",
Description: "Specify when the gadget image should be pulled",
DefaultValue: oci.PullImageMissing,
PossibleValues: []string{
oci.PullImageAlways,
oci.PullImageMissing,
oci.PullImageNever,
},
TypeHint: params.TypeString,
},
{
Key: pullSecret,
Title: "Pull secret",
Description: "Secret to use when pulling the gadget image",
TypeHint: params.TypeString,
},
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering how this will be handled. We can't rely on GetOCIGadgetInfo because these parameters are needed to pull the image, so I suppose somehow we'll have to hardcode them in the client so they're already available when calling GetOCIGadgetInfo.

Also, the commit message introducing this file says it's not an operator anymore, so how will be these parameters added to the CLI?

I gave it a try on 4f754cf, is this aligned with the ideas you had?

flyth and others added 15 commits March 27, 2024 19:33
Adds support to serialize Data units from DataSources to JSON.

Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
This implements the new gadget lifecycle used by the image based gadgets.

Signed-off-by: Michael Friese <mfriese@microsoft.com>
This adds a new generic handler for image based gadgets for all our cobra based
binaries.

Signed-off-by: Michael Friese <mfriese@microsoft.com>
This adds a generic operator that can have multiple callbacks installed without
building a dedicated operator. This is useful to for example only extract data
from the DataSources.

Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Mauricio Vásquez <mauriciov@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Signed-off-by: Michael Friese <mfriese@microsoft.com>
Copy link

@vfalico vfalico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@flyth
Copy link
Member Author

flyth commented Mar 27, 2024

Thank you all so much for all the great and super in-depth reviews on all levels! I know this was again tough, but I'm so glad that we managed to get this through!

I'll compile a list of open issues and link them in a new issue so we can follow up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants