Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans to implement telemetry when using Bamboo deliver functions? #599

Open
aleDsz opened this issue May 12, 2021 · 12 comments
Open

Any plans to implement telemetry when using Bamboo deliver functions? #599

aleDsz opened this issue May 12, 2021 · 12 comments

Comments

@aleDsz
Copy link

aleDsz commented May 12, 2021

Hi, my team was thinking about how we can implement instrumentations around Bamboo context and I thought about if we implement :telemetry app inside Bamboo, we can get all data from start, stop and exception events.

So, only to check y'all have plans to do it, we'll wait. If not, I'm going to think about contributing this repo.

@germsvel
Copy link
Collaborator

Hi @aleDsz I'm glad you asked this question! I've been thinking that telemetry would be a nice addition to Bamboo. But I haven't had the need for it yet in a production project, so I don't know which events would be helpful to create.

I'd like to start the conversation by trying to figure out which events we should send.

I thought about if we implement :telemetry app inside Bamboo, we can get all data from start, stop and exception events

I like that idea for start, stop, and exception events for each of the deliver functions. There are two cases we need to handle:

  • The deliver_now case seems somewhat straightforward since the email is sent synchronously. From my understanding of the telemetry library, we could use :telemetry.span/3 for that, and it would capture all those events for us.

  • But I am curious what you think we should do with deliver_later, since that is executed asynchronously.

    • Would we want the start to be when the email is enqueued for delivery and stop when the email is sent? Or,
    • Would start be when the email is about to be sent in the deliver later strategy (e.g. TaskSupervisorStrategy) and stop when the email is sent?

I'd love to hear your thoughts on what events would be good for deliver_later

And if you're interested, these are some other question I have in my mind related to this work:

  • What metadata should we send?
  • What should we name the events? (I'm thinking [:bamboo, :deliver_now, :start], [:bamboo, :deliver_now, :stop], etc.)
  • Are there other events that you think might be useful?
  • Should adapters trigger their own events in addition to or instead of Bamboo? (similar to how Ecto has adapter-specific events)
  • How should we provide a set of standard functions so that users who have custom delivery strategies can trigger the same telemetry events?

@aleDsz
Copy link
Author

aleDsz commented May 14, 2021

hi @germsvel.

I liked ur second option for deliver_later, it would be awesome to track all async delivery, so we can log, trace and audit everything.

I'm working in a company that uses Bamboo, and we use a lot of telemetry handlers to trace every action inside our system, as we can observe if:

  • there's something with a bad performance;
  • when we need to audit some of our actions, or
  • we need to trace an entire flow to get where's the problem when an incident occours

So, answering your questions:

  • What metadata should we send?

I would say, the: duration, from, to, subject, headers, body and attachments

  • What should we name the events? (I'm thinking [:bamboo, :deliver_now, :start], [:bamboo, :deliver_now, :stop], etc.)

I liked this, because we can have deliver_now and deliver_later

  • Are there other events that you think might be useful?

We connect to API's when using e.g. Sendgrid, right? Can we have events when using the API?

I'm not so sure if its a good idea, but we can check with another Bamboo users/Adapter maintainers.

  • How should we provide a set of standard functions so that users who have custom delivery strategies can trigger the same telemetry events?

I like the Middleware implementation in Tesla, you can implement everything you want, so you can implement the telemetry on start and on the end and pass data though the process.

Their telemetry middleware can do all telemetry events using the same behaviour of t:Tesla.Middleware

I'm not so sure how the custom delivery strategies works but, we have the same behaviour as the default delivery strategy, right? So we can track if from the same behaviour and send start before the custom delivery strategies and after.

OR

We can export those functions allowing the custom delivery strategies to send the telemetry as they want.

@gugahoa
Copy link

gugahoa commented May 20, 2021

Hi @germsvel!
I'm from the same team as @aleDsz, we're quite interested in this addition since we invest heavily into observability with OpenTelemetry—which is not related to :telemetry but it's easier to use otel if the library uses :telemetry.

But I am curious what you think we should do with deliver_later, since that is executed asynchronously.

deliver_later works similar to messaging, I would say. So we can adopt the semantic conventions that OpenTelemetry sets for messaging. It can be found here (I tried really hard to find an already rendered version of it, but didn't so I'm sorry for sending a wall of yml to you as reference). edit: found the markdown rendered one https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/messaging.md

Since it's a wall of yml, I'll try to surmise the approach: There would be a :"deliver_later.produce" and a :"deliver_later.consume". The produce would be sent when the user calls deliver_later, and the consume would be triggered by the delivery strategy when it in fact "consumes" the message.

What metadata should we send?

The whole email struct sounds nice. Also things that could have changed the email struct (such as Interceptors), and things that do side effect using the email struct (such as Adapters).

As for metadata in delivery_later, we should add the delivery later strategy module being used.

How should we provide a set of standard functions so that users who have custom delivery strategies can trigger the same telemetry events?

This is a hard question to answer, since the delivery strategy behavior gives users total control—enough so that users can choose not to send the telemetry events if they don't want to. I believe the best thing we could do, without breaking changes, is to document really well the telemetry events and when do we expect the user to call it.

@germsvel
Copy link
Collaborator

Thanks for all that input @gugahoa and @aleDsz.

Here's my understanding at the moment. And I'd love to hear if this is what you're thinking too.

Tracking two functions

We would track two functions, deliver_now and deliver_later.

  1. For deliver_now, we could use :telemetry.span/3 with [:bamboo, :deliver_now]. That means we would get the following events:
  • [:bamboo, :deliver_now, :start] (with system_time in metadata)
  • [:bamboo, :deliver_now, :stop] (with duration in metadata)
  • [:bamboo, :deliver_now, :exception] (with duration in metadata)
  1. For deliver_later, we have something slightly more complicated. We could use :telemetry.execute/3 to log these events:
  • [:bamboo, :deliver_later, :start] (with system_time in metadata)
  • [:bamboo, :deliver_later, :stop]
  • [:bamboo, :deliver_later, :exception]

Like Gustavo mentioned, the start event would be equivalent to the "produce", and that would happen in Bamboo's main library. The stop and exception events (Gustavo's "consume" events) would happen in the deliver-later strategy.

What about duration for deliver_later?

As far as I can tell, we can't create the duration metadata that a :telemetry.span/3 creates for the :stop and :exception events since they happen in different process from the :start event.

We could add a monotonic_time: System.mononic_time() in the metadata for :start, :stop, and :exception. That way, the consumer of the events can stitch together the duration by taking the difference. It's not ideal, but at least they have the data.

What do you think about that? Is there a better way to try to calculate duration for deliver_later? Or is it unnecessary?

What about deliver_now! and deliver_later!?

The deliver_x functions have their ! counterparts, which raise on errors instead of returning an error tuple. So I wanted to think about whether we should log those ! functions as separate namespaces or if that should only affect whether we log a :stop event or an :exception event.

For example, consider this scenario:

If someone sends an email via deliver_now!, and it fails to send (thus raising an error), we should log an :exception event. But that same email via deliver_now would return an {:error, error}, so we should log a :stop event with the error in the metadata.

The question is, does that mean we should create four different event namespaces?

  • [:bamoo, :deliver_now]
  • [:bamoo, :deliver_now!]
  • [:bamoo, :deliver_later]
  • [:bamoo, :deliver_later!]

My guess is no. I imagine we could combine deliver_now with deliver_now! and deliver_later with deliver_later!. That leaves us with two event namespaces:

  • [:bamoo, :deliver_now]
  • [:bamoo, :deliver_later]

If there's an error, and the user called deliver_now, we log [:bamoo, :deliver_now, :stop]. But if the user called deliver_now!, we log [:bamoo, :deliver_now, :exception].

Does that sound good to you two?

Creating a Bamboo.Telemetry with telemetry helpers

I think we should move all the telemetry logic behind a Bamboo.Telemetry module, which we can then use from the rest of the codebase.

That will also make it easy for people who have custom deliver-later strategies to use the same functions and trigger events as needed.

Metadata

The only question I still have about metadata is how much data to send.

I have a concern that if we send the whole email struct (or even the body), could leak sensitive information. For example:

  • What if it's an email with a token for a reset password? Or
  • What if it has account numbers?

I think we shouldn't leak that info, so I'd like to hear your thoughts on what we could do about that.

@maartenvanvliet
Copy link
Member

I'd also would love to see telemetry events in bamboo! Great work so far!

Regarding the leaking of sensitive information, I think handling this is ultimately the responsibility of the telemetry handlers subscribing to the events. It's unknown to Bamboo what the user will do with the metadata so it can send everything and leave it to the user. Most instrumentation libraries that hook into telemetry have their own ways to filter data.

Helpful guide for telemetry https://keathley.io/blog/telemetry-conventions.html

@gugahoa
Copy link

gugahoa commented May 23, 2021

Like Gustavo mentioned, the start event would be equivalent to the "produce", and that would happen in Bamboo's main library. The stop and exception events (Gustavo's "consume" events) would happen in the deliver-later strategy.

What I suggested is a bit different. We would have two :telemetry.span calls, one with the event being :"deliver_later.produce", emitted by Bamboo's main library, and another event being :"deliver_later.consume" which is the responsibility of the user to implement, if they have custom delivery strategies.

This way we guarantee that we have the start, stop and exception events at least for one of the "legs" in the deliver_later flow.

What about duration for deliver_later?

As far as I can tell, we can't create the duration metadata that a :telemetry.span/3 creates for the :stop and :exception events since they happen in different process from the :start event.

We could add a monotonic_time: System.mononic_time() in the metadata for :start, :stop, and :exception. That way, the consumer of the events can stitch together the duration by taking the difference. It's not ideal, but at least they have the data.

What do you think about that? Is there a better way to try to calculate duration for deliver_later? Or is it unnecessary?

Going with two events (thus two :telemetry.span calls) frees us from thinking too hard about this.

What about deliver_now! and deliver_later!?

The deliver_x functions have their ! counterparts, which raise on errors instead of returning an error tuple. So I wanted to think about whether we should log those ! functions as separate namespaces or if that should only affect whether we log a :stop event or an :exception event.

For example, consider this scenario:

If someone sends an email via deliver_now!, and it fails to send (thus raising an error), we should log an :exception event. But that same email via deliver_now would return an {:error, error}, so we should log a :stop event with the error in the metadata.

The question is, does that mean we should create four different event namespaces?

* `[:bamoo, :deliver_now]`

* `[:bamoo, :deliver_now!]`

* `[:bamoo, :deliver_later]`

* `[:bamoo, :deliver_later!]`

My guess is no. I imagine we could combine deliver_now with deliver_now! and deliver_later with deliver_later!. That leaves us with two event namespaces:

* `[:bamoo, :deliver_now]`

* `[:bamoo, :deliver_later]`

If there's an error, and the user called deliver_now, we log [:bamoo, :deliver_now, :stop]. But if the user called deliver_now!, we log [:bamoo, :deliver_now, :exception].

Does that sound good to you two?

It does 🔝

Creating a Bamboo.Telemetry with telemetry helpers

I think we should move all the telemetry logic behind a Bamboo.Telemetry module, which we can then use from the rest of the codebase.

That will also make it easy for people who have custom deliver-later strategies to use the same functions and trigger events as needed.

💯

Metadata

The only question I still have about metadata is how much data to send.

I have a concern that if we send the whole email struct (or even the body), could leak sensitive information. For example:

* What if it's an email with a token for a reset password? Or

* What if it has account numbers?

I think we shouldn't leak that info, so I'd like to hear your thoughts on what we could do about that.

I have the same opinion as @maartenvanvliet on this one

@germsvel
Copy link
Collaborator

Metadata

I think @maartenvanvliet made a good point, so I think we could proceed with passing as much data as possible.

Deliver later events

@gugahoa thank you for that clarification:

What I suggested is a bit different. We would have two :telemetry.span calls, one with the event being :"deliver_later.produce", emitted by Bamboo's main library, and another event being :"deliver_later.consume" which is the responsibility of the user to implement, if they have custom delivery strategies.

This way we guarantee that we have the start, stop and exception events at least for one of the "legs" in the deliver_later flow.

So, to confirm, these would be the initial Bamboo events:

# deliver_now
[:bamboo, :deliver_now, :start] 
[:bamboo, :deliver_now, :stop] 
[:bamboo, :deliver_now, :exception] 

#deliver_later produce
[:bamboo, :deliver_later, :produce, :start] 
[:bamboo, :deliver_later, :produce, :stop] 
[:bamboo, :deliver_later, :produce, :exception] 

#deliver_later consume
[:bamboo, :deliver_later, :consume :start] 
[:bamboo, :deliver_later, :consume, :stop] 
[:bamboo, :deliver_later, :consume, :exception] 

Do you know if produce and consume are standard telemetry terms, just like start, stop, and exception are, or are there better terms we could use for that?

If we think produce and consume are good terms (or if we find better ones), I believe we could start work on this issue. I don't have time this week to work on this, but I'd be happy to review a pull request with those changes.

@germsvel
Copy link
Collaborator

After a little more thought, I think we shouldn't use :deliver_later, :produce and :deliver_later, :consume.

I now think that it would be better to keep things as standard as possible, so instead we could have :deliver_later and :deliver_later_strategy.

That way, we get the following events:

# deliver_now
[:bamboo, :deliver_now, :start] 
[:bamboo, :deliver_now, :stop] 
[:bamboo, :deliver_now, :exception] 

#deliver_later
[:bamboo, :deliver_later, :start] 
[:bamboo, :deliver_later, :stop] 
[:bamboo, :deliver_later, :exception] 

#deliver_later_strategy (span when the deliver later strategy gets executed)
[:bamboo, :deliver_later_strategy, :start] 
[:bamboo, :deliver_later_strategy, :stop] 
[:bamboo, :deliver_later_strategy, :exception] 

I like that better. And that seems good enough to get started. If we discover more need for other telemetry events, we can add those when the need arises.

@gugahoa @maartenvanvliet @aleDsz any thoughts on that?

@aleDsz
Copy link
Author

aleDsz commented Jun 10, 2021

@germsvel I liked it

@hamzahejja
Copy link

@germsvel Hello there! I'd like to inquire if this is still open for contributions.

@germsvel
Copy link
Collaborator

Thanks for reaching out @hamzahejja. I'm unfortunately no longer Bamboo's maintainer (due to time constraints). But I still have access to make changes. And this telemetry functionality was one of the last things I really wanted to get in.

So, long story short, it's definitely open to contributions, and I would happily review a PR for this!

@hamzahejja
Copy link

hamzahejja commented Jan 3, 2022

@germsvel I'd love to add the functionality, but let me recap the things discussed above really quick:

  • We need to add a Telemetry Supervisor
  • We need to add a generic module that emits the events for deliver_now and deliver_later strategies.
  • We'll utilize the :telemetry.span/3 function to emit the start and stop of successful calls, and start and exception for when the strategies raise an error/exception.
  • We'll emit the following events:
# deliver_now
[:bamboo, :deliver_now, :start] 
[:bamboo, :deliver_now, :stop] 
[:bamboo, :deliver_now, :exception] 

#deliver_later
[:bamboo, :deliver_later, :start] 
[:bamboo, :deliver_later, :stop] 
[:bamboo, :deliver_later, :exception] 

#deliver_later_strategy (span when the deliver later strategy gets executed)
[:bamboo, :deliver_later_strategy, :start] 
[:bamboo, :deliver_later_strategy, :stop] 
[:bamboo, :deliver_later_strategy, :exception] 
  • There's just one question left in my head, what's our conclusion on the metadata accompanied by the event spanned ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants