Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Activity/OpenTelemetry support #1499

Closed
pengweiqhca opened this issue Jan 5, 2023 · 43 comments
Closed

Add Activity/OpenTelemetry support #1499

pengweiqhca opened this issue Jan 5, 2023 · 43 comments
Labels
enhancement New feature or request

Comments

@pengweiqhca
Copy link

Is it possible to wrap the calling server around the Activity? So that, I can get elapsed time per calling.

@jstedfast jstedfast added the wontfix This will not be worked on label Jan 5, 2023
@jstedfast
Copy link
Owner

You could subclass any class in MailKit to add whatever functionality you want.

I don't want to take a dependency on OpenTelemetry because most of my users won't want it.

@EugeneKrapivin
Copy link

@jstedfast it doesn't require any external dependency. In-order to implement tracing and metrics in dotnet you use constructs from dotnet, namely System.Diagnostics.Metrics and System.Diagnostics.ActivitySource.

@jstedfast
Copy link
Owner

@EugeneKrapivin right, but I'm talking about not wanting to add the OpenTelemetry NuGet dependency.

Maybe I'm just misunderstanding the feature request?

@EugeneKrapivin
Copy link

@jstedfast To get the library instrumented (tracing, metrics, logs) you don't have to use any external dependencies as the maintainer of this library. The users of your library, if they chose to export those traces, logs and metrics will use OpenTelemetry libraries.

I'm working on a small POC, and actually using MailKit (that's how I wound up here), so I have a handy example:

In a class called SmtpSenderService I have:

public SmtpSenderService(ILogger<SmtpSenderService> logger, IMeterFactory meterFactory)
{
    _logger = logger;

    _meterFactory = meterFactory;
    _meter = _meterFactory.Create("SMTPSender"); // remember this part

    _createClientHistogram = _meter.CreateHistogram<double>("smtp_create_client", "ms");
    _connectClientHistogram = _meter.CreateHistogram<double>("smtp_connect_client", "ms");
    _sendMessageHistogram = _meter.CreateHistogram<double>("smtp_send", "ms");
    _exceptionsCounter = _meter.CreateCounter<long>("smtp_exception", "exception");
}
private static ActivitySource source = new("SmptSenderService", "1.0.0"); // remember this part

ILogger, IMeterFactory, Meter and ActivitySource are part of

using System.Diagnostics;
using System.Diagnostics.Metrics;
using Microsoft.Extensions.Logging;

namespaces, which require no external dependencies.
In some SendMessage method I have the following code:

if (_client is null)
{                
    var startCreate = Stopwatch.GetTimestamp();
   
    _client ??= new SmtpClient();
    
    var startConnect = Stopwatch.GetTimestamp();
    
    using (var _ = source.StartActivity("Smtp.Connect", ActivityKind.Client))
        await _client.ConnectAsync(config.ServerUrl, int.Parse(config.ServerPort), false, cts.Token);
    
    using (var _ = source.StartActivity("Smtp.Authenticate", ActivityKind.Client))
        await _client.AuthenticateAsync(config.Username, config.Password, cts.Token);
    
    var elapsedConnect = Stopwatch.GetElapsedTime(startConnect);
    var elapsedCreate = Stopwatch.GetElapsedTime(startCreate);

    _createClientHistogram.Record(elapsedCreate.TotalMilliseconds);
    _connectClientHistogram.Record(elapsedConnect.TotalMilliseconds, new KeyValuePair<string, object?>("flow", "create"));
}

and

using var sendActivity = source.StartActivity("Smtp.Send", ActivityKind.Client);
try
{
    var r = await _client.SendAsync(message, cts.Token);
    var total = Stopwatch.GetElapsedTime(start);
    
    sendActivity?.AddTag("status", "success");
    sendActivity?.AddEvent(new ActivityEvent("email sent"));
    _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "success"));
}
catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.RecipientNotAccepted)
{
    _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
    var total = Stopwatch.GetElapsedTime(start);

    _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
    _exceptionsCounter.Add(1,
            new KeyValuePair<string, object?>("exception", ex.GetType().Name),
            new KeyValuePair<string, object?>("flow", "send"));

    sendActivity?.AddTag("status", "failed");
    sendActivity?.AddTag("exception", ex.GetType().Name);
    sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

    return new(false, $"failed with {ex.ErrorCode}");
}
// the catch clauses continue, just to make sure I catch all the exceptions possible and log them properly

As I've said before, none of this code requires any external dependencies. Having this code in MailKit could really help with observability and monitoring for mailkit users.

In a different place in my service (the host part) I have my dependencies on OpenTelemetry:

builder.Services
    .AddOpenTelemetry()
    .WithTracing(tracing =>
    {
        tracing.AddSource("SmptSenderService");
    })
    .WithMetrics(metrics =>
    {
        metrics
            .AddMeter("SMTPSender");
    });

these lines make sure that when traces and metrics are exported, my metrics and traces are also exported.
Once exported I have my metrics:
image

and some fancy traces:
image

and because logs are also enriched when exported:
image

Having MailKit instrumentation out of the box would be really helpful in reducing boilerplate code for monitoring the system with traces and quality metrics.

@jstedfast
Copy link
Owner

Okay, how do you propose adding this to MailKit?

@EugeneKrapivin
Copy link

EugeneKrapivin commented Jan 3, 2024

Well I'd suggest:

  • Metrics - wrapping IO operations in timers (histograms of TimeSpan in my code examples)
    • adding exception/error counters with appropriate tagging
  • Tracing - wrapping IO operations with Activities and Activity Events.
  • Adding logs using standard logging APIs in dotnet, which ultimately get collected and exported (processes that should not concern you as a library maintainer)

This is probably a huge ask, and I think it should be opened for discussion to see if it's worth the investment (aka will users of the library use it).
For the purposes of my POC it was worth it because it gave me the data I needed. However, it wasn't a clean insert :) . There was a lot of copy-paste engineering on my part to cover all exceptions and operations (Create, Connect, NoOp, Send, Disconnect).

As a trend the dotnet community pushes toward adding instrumentation to libraries (especially IO bound libraries) so that distributed tracing (which is priceless in debugging microservices plane) could be done through OpenTelemetry which is the defacto standard for observability in virtually all languages.

just as an example, my PoC uses: ASP.net minimal APIs, KafkaFlow, Orleans, Azure Tables (for orleans storage) and MailKit.
Except MailKit all the others have OOTB Tracing instrumentation, passing the tracing context between ASP.net, kafkaflow, orleans and AzureTables SDK was effortless. All I had to do is add their names to the OpenTelemetry configuration.
After wrapping some MailKit calls in activities:

image

which is accompanied by a correlated log:

{
	"body": "failed sending an email with error SenderNotAccepted",
	"traceid": "b62dd4170b6c0b0da30a6dce0c0fcdf5",
	"spanid": "b37a86f71fc2c09f",
	"severity": "Error",
	"flags": 1,
	"attributes": {
		"ParentId": "e5e6b9e3c5dce273",
		"SpanId": "b37a86f71fc2c09f",
		"TraceId": "b62dd4170b6c0b0da30a6dce0c0fcdf5",
		"errorCode": "SenderNotAccepted",
		"exception.message": "Invalid sender user681@mailhog.local",
		"exception.stacktrace": "MailKit.Net.Smtp.SmtpCommandException: Invalid sender user681@mailhog.local\r\n   at MailKit.Net.Smtp.SmtpClient.SendAsync(FormatOptions options, MimeMessage message, MailboxAddress sender, IList`1 recipients, CancellationToken cancellationToken, ITransferProgress progress)\r\n   at MailKit.Net.Smtp.SmtpClient.SendAsync(FormatOptions options, MimeMessage message, MailboxAddress sender, IList`1 recipients, CancellationToken cancellationToken, ITransferProgress progress)\r\n   at Processor.Grains.SmtpSenderService.SendMessage(SmtpConfig config, EmailRequest emailRequest) in C:\\Users\\nocgod\\source\\repos\\EmailServicePoc\\Processor\\Processor.Grains\\SmtpSenderService.cs:line 138",
		"exception.type": "SmtpCommandException",
		"{OriginalFormat}": "failed sending an email with error {errorCode}"
	},
	"resources": {
		"service.instance.id": "243d4bd2-79e4-4ed1-a308-4b8d37bc7665",
		"service.name": "processor.host",
		"telemetry.sdk.language": "dotnet",
		"telemetry.sdk.name": "opentelemetry",
		"telemetry.sdk.version": "1.7.0"
	},
	"instrumentation_scope": {
		"name": "Processor.Grains.SmtpSenderService"
	}
}

@jstedfast
Copy link
Owner

jstedfast commented Jan 3, 2024

Well I'd suggest:

  • Metrics - wrapping IO operations in timers (histograms of TimeSpan in my code examples)
    • adding exception/error counters with appropriate tagging

Right, but how should MailKit provide this info to the app?

  • Tracing - wrapping IO operations with Activities and Activity Events.

Again, how should MailKit provide this info? via events? via some new IClientMetrics interface that can be set on any of the clients?

I'm not sure what would make the most sense.

Adding logs using standard logging APIs in dotnet, which ultimately get collected and exported (processes that should not concern you as a library maintainer)

Currently, MailKit does not depend on Microsoft.Extensions.Logging or Microsoft.Extensions.DependencyInjection, so this would be awkward.

I've been thinking about moving to DependencyInjection as an alternative to MimeKit.ParserOptions.Register() and MimeKit.Cryptography.CryptographyContext.Register() (especially with the request to make MimeKit depend on MimeKitLite rather than being more-or-less API duplicates of each other), but so far I haven't gotten around to that or even figured out exactly how I want to do that...

What do those other libraries you are using do? It sounds like they just assume you'll be logging telemetry/metrics/etc and only really leave up the configuration to the developers, but MailKit is in a bit of a different situation because it's not really designed specifically for ASP.NET apps where you can pretty much count on everyone wanting that kind of telemetry logging.

@EugeneKrapivin
Copy link

EugeneKrapivin commented Jan 3, 2024

Right, but how should MailKit provide this info to the app?

it should not concern the library. As long as you use System.Diagnostics.Metrics.Meter you and document properly the name of the meter, all that I need as an app developer is the name of the meter. Same goes for tracing, as long as you use activities and properly document the source name, that's enough. The following is code from my PoC:

builder.Services
    .AddOpenTelemetry()
    .WithMetrics(metrics =>
    {
        metrics
            .AddRuntimeInstrumentation()
            .AddMeter("Microsoft.Orleans")
            .AddBuiltInMeters();
    })
    .WithTracing(tracing =>
    {
        if (builder.Environment.IsDevelopment())
        {
            // We want to view all traces in development
            tracing.SetSampler(new AlwaysOnSampler());
        }

        tracing
            .AddSource("Azure.*")
            .AddSource("Microsoft.Orleans.Runtime")
            .AddSource("Microsoft.Orleans.Application")
            .AddSource(KafkaFlowInstrumentation.ActivitySourceName)
            .AddAspNetCoreInstrumentation()
            .AddGrpcClientInstrumentation()
            .AddHttpClientInstrumentation(o => {
                o.FilterHttpRequestMessage = (_) => Activity.Current?.Parent?.Source?.Name != "Azure.Core.Http";
            });
    });

Notice that Microsoft Orleans is added as a string (which they provide in their documentation) while KafkaFlow opted for releasing a little nuget with a public static readonly string to avoid magic strings. This setup code IS dependent on OpenTelemetry, however it should not concern you as a library developer.

Currently, MailKit does not depend on Microsoft.Extensions.Logging or Microsoft.Extensions.DependencyInjection, so this would be awkward.

you could expose a logging adapter and a logging abstraction like KafkaFlow does.

I've been thinking about moving to DependencyInjection as an alternative to MimeKit.ParserOptions.Register() and MimeKit.Cryptography.CryptographyContext.Register() (especially with the request to make MimeKit depend on MimeKitLite rather than being more-or-less API duplicates of each other), but so far I haven't gotten around to that or even figured out exactly how I want to do that...

I guess this will require a bigger refactoring that should probably be out-of-scope for the purposes of observability instrumentation.

What do those other libraries you are using do? It sounds like they just assume you'll be logging telemetry/metrics/etc and only really leave up the configuration to the developers, but MailKit is in a bit of a different situation because it's not really designed specifically for ASP.NET apps where you can pretty much count on everyone wanting that kind of telemetry logging.

for tracing and metrics they use standard dotnet APIs:

which are not dependant on ASP.NET specifically, as you can use them in a plain worker service without dependency on the Web SDKs.

The logging is a design decision you could make:

  • embrace the microsoft logging API
  • abstract logging and provide an adapter to microsoft API

@jstedfast jstedfast reopened this Jan 8, 2024
@jstedfast jstedfast added enhancement New feature or request and removed wontfix This will not be worked on labels Jan 8, 2024
@jstedfast
Copy link
Owner

@EugeneKrapivin can you help define which metrics (and other info) you want/need? I'll put this in the queue for a future v5.0 release but first I need to know what specific metrics developers care about having.

(Note: it seems like there is growing demand for a unification of MimeKit and MimeKitLite which is going to require me to work on a v5.0 sooner rather than later to split MimeKit into MimeKit.Core and MimeKit.Cryptography - this seems like the perfect opportunity to add in metrics & tracing data).

@EugeneKrapivin
Copy link

EugeneKrapivin commented Jan 24, 2024

@jstedfast sorry for the late reply.
I'd add:

  • Metrics:
    • histograms to measure any operation time in the library (like SMTP connect, send, no op, disconnect, etc').
    • counters (monotonically increasing counters) for errors and exceptions with tags/labels to make sure the metrics are dissectible for the user to setup alerts in case a high error rate is detected
    • I'd add gauges to hold count of the number of active clients/connects there are, it'd also be useful to monitor run-away situations with connections
  • Traces - any request should open an activity that would make it possible to see Mailkit IO interactions in a distributed trace (think about a flow of: ASP.Net minimal APIs -> KafkaFlow -> Microsoft Orleans grain -> [ Azure storage table SDK, SMTP client] like I've pasted earlier in this thread)
  • Logs - Use microsoft ILogger logging abstraction, Which is easily configurable to be exported open telemetry parameters using standard dotnet and open telemetry libraries (exporting is not a concern of mail kit).

I'm adding some code from my POC project in which I've added some metrics, traces and logs around some simple usage of the SMTP client. Please do not take this example as best practice =) I'd suggest taking a look at the best practices from msft documentation :)

// note that the SmtpClient is not thread safe (we shouldn't be using reentrant here)
[StatelessWorker]
public class SmtpSenderService : Grain, ISmptSenderService, IDisposable
{
    private bool _disposedValue;
    private SmtpClient _client;
    private readonly ILogger<SmtpSenderService> _logger;

    private readonly IMeterFactory _meterFactory;

    private Meter _meter;
    private Histogram<double> _createClientHistogram;
    private Histogram<double> _connectClientHistogram;
    private Histogram<double> _sendMessageHistogram;
    private Counter<long> _exceptionsCounter;

    public SmtpSenderService(ILogger<SmtpSenderService> logger, IMeterFactory meterFactory)
    {
        _logger = logger;

        _meterFactory = meterFactory;
        _meter = _meterFactory.Create("SMTPSender");

        _createClientHistogram = _meter.CreateHistogram<double>("smtp_create_client", "ms");
        _connectClientHistogram = _meter.CreateHistogram<double>("smtp_connect_client", "ms");
        _sendMessageHistogram = _meter.CreateHistogram<double>("smtp_send", "ms");
        _exceptionsCounter = _meter.CreateCounter<long>("smtp_exception", "exception");
    }
    private static ActivitySource source = new("SmptSenderService", "1.0.0");

    public async Task<SendResult> SendMessage(SmtpConfig config, EmailRequest emailRequest)
    {
        var cts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
        try
        {
            if (_client is null)
            {                
                var startCreate = Stopwatch.GetTimestamp();
               
                _client ??= new SmtpClient();

                var startConnect = Stopwatch.GetTimestamp();
                
                using (var _ = source.StartActivity("Smtp.Connect", ActivityKind.Client))
                    await _client.ConnectAsync(config.ServerUrl, int.Parse(config.ServerPort), false, cts.Token);
                
                using (var _ = source.StartActivity("Smtp.Authenticate", ActivityKind.Client))
                    await _client.AuthenticateAsync(config.Username, config.Password, cts.Token);
                
                var elapsedConnect = Stopwatch.GetElapsedTime(startConnect);
                var elapsedCreate = Stopwatch.GetElapsedTime(startCreate);

                _createClientHistogram.Record(elapsedCreate.TotalMilliseconds);
                _connectClientHistogram.Record(elapsedConnect.TotalMilliseconds, new KeyValuePair<string, object?>("flow", "create"));
            }
            
            var reconnect = false;
            try
            {
                using var _ = source.StartActivity("Smtp.NoOp", ActivityKind.Client);
                await _client.NoOpAsync();
            }
            catch (Exception ex)
            {
                reconnect = true;
            }

            if (!_client.IsConnected || reconnect)
            {
                var startConnect = Stopwatch.GetTimestamp();
                
                using (var _ = source.StartActivity("Smtp.Connect", ActivityKind.Client))
                    await _client.ConnectAsync(config.ServerUrl, int.Parse(config.ServerPort), false, cts.Token);
                
                using (var _ = source.StartActivity("Smtp.Authenticate", ActivityKind.Client))
                    await _client.AuthenticateAsync(config.Username, config.Password, cts.Token);

                var elapsedConnect = Stopwatch.GetElapsedTime(startConnect);
                _connectClientHistogram.Record(elapsedConnect.TotalMilliseconds, new KeyValuePair<string, object?>("flow", "create"));
            }

            DelayDeactivation(TimeSpan.FromSeconds(30));
        }
        catch (AuthenticationException ex)
        {
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "connect"));

            _logger.LogError(ex, "failed authenticating with smtp server");
        }
        catch (Exception ex) when (ex is OperationCanceledException || ex is TimeoutException)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to a timeout", emailRequest.MessageId);
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "connect"));
            return new(false, "timeout");
        }
        catch (Exception ex)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to an exception", emailRequest.MessageId);
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "connect"));
            return new(false, "exception");
        }

        var message = CreateMimeMessage(config, emailRequest);

        var start = Stopwatch.GetTimestamp();
        using var sendActivity = source.StartActivity("Smtp.Send", ActivityKind.Client);
        try
        {
            var r = await _client.SendAsync(message, cts.Token);

            var total = Stopwatch.GetElapsedTime(start);
            
            sendActivity?.AddTag("status", "success");
            sendActivity?.AddEvent(new ActivityEvent("email sent"));
            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "success"));
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.RecipientNotAccepted)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.SenderNotAccepted)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.MessageNotAccepted)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpCommandException ex) when (ex.ErrorCode == SmtpErrorCode.UnexpectedStatusCode)
        {
            _logger.LogError(ex, "failed sending an email with error {errorCode}", ex.ErrorCode);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);
            sendActivity?.AddTag("smtp.error_code", ex.ErrorCode);

            return new(false, $"failed with {ex.ErrorCode}");
        }
        catch (SmtpProtocolException ex)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to smtp protocol exception", emailRequest.MessageId);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds, new KeyValuePair<string, object?>("status", "failed"));
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);

            return new(false, ex.Message);
        }
        catch (IOException ex)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to an IO exception", emailRequest.MessageId);
            var total = Stopwatch.GetElapsedTime(start);

            _sendMessageHistogram.Record(total.TotalMicroseconds);
            _exceptionsCounter.Add(1,
                new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);

            return new(false, ex.Message);
        }
        catch (Exception ex) when (ex is OperationCanceledException || ex is TimeoutException)
        {
            _logger.LogError(ex, "failed sending message {messageId} due to a timeout", emailRequest.MessageId);
            _exceptionsCounter.Add(1,
                    new KeyValuePair<string, object?>("exception", ex.GetType().Name),
                    new KeyValuePair<string, object?>("flow", "send"));

            sendActivity?.AddTag("status", "failed");
            sendActivity?.AddTag("exception", ex.GetType().Name);

            return new(false, "timeout");
        }
        finally
        {
            sendActivity?.Dispose(); // avoid nesting of Disconnect under the Send activity
            using var _ = source.StartActivity("Smtp.Disconnect", ActivityKind.Client);
            await _client.DisconnectAsync(true, cts.Token);
        }

        return new(true, "sent");
    }

    private static MimeMessage CreateMimeMessage(SmtpConfig config, EmailRequest emailRequest)
    {
        var message = new MimeMessage();
        message.From.Add(new MailboxAddress(config.FromName, config.From));
        message.To.Add(new MailboxAddress(emailRequest.RecipientName, emailRequest.Recipient));
        message.Subject = emailRequest.Subject;

        message.Body = new TextPart("plain")
        {
            Text = emailRequest.Body
        };
        return message;
    }

    protected virtual void Dispose(bool disposing)
    {
        if (!_disposedValue)
        {
            if (disposing)
            {
                _logger.LogWarning("disposing of connection");
                _client?.Dispose();
            }
            _disposedValue = true;
        }
    }

    public void Dispose()
    {
        // Do not change this code. Put cleanup code in 'Dispose(bool disposing)' method
        Dispose(disposing: true);
        GC.SuppressFinalize(this);
    }
}

@jstedfast
Copy link
Owner

@EugeneKrapivin Thanks for taking the time to respond and share your ideas. I'll definitely be trying to read up on Microsoft's guides on best practices.

@prashanth-nani
Copy link

Thanks, @jstedfast, for considering adding Telemetry support in Mailkit v5.0. All other libraries that we use to make external calls, such as HttpClient for HTTP calls, MassTransit for RabbitMQ, Stackexchange.Redis for Redis calls, and MysqlConnector for MySQL, have added support for telemetry, and we have started instrumenting them. It would be great if we could have this added in Mailkit.

@jstedfast
Copy link
Owner

Ah, cool, I should look at how HttpClient does stuff as that might make be a close match with MailKit's client APIs.

@jstedfast
Copy link
Owner

Some docs on dotnet's built-in networking (including HttpClient) metrics: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/built-in-metrics-system-net#metric-httpclientopen_connections

Also seems like this Meter stuff is only in net8.0?

@prashanth-nani
Copy link

No. We are running .NET 6 and using Meter for producing custom application metrics. What’s new in .NET 8 is the IMeterFactory interface.

@jstedfast
Copy link
Owner

A few things I'm not clear about:

  1. A Meter has a name and is used to create Counters and Histograms, which also have names. How do those names inter-relate? If I have a Meter with the name mailkit.net.smtp, does that in any way namespace the Counters and Histograms? Or would I want to name them mailkit.net.smtp.client.send.duration?
  2. Should these Meters, Counters, and Histograms be static on SmtpClient? Or should they be per-instance? If they are per-instance, how does that play out if you've got multiple client instances sending messages?

@jstedfast
Copy link
Owner

I would still like answers to my questions in the previous comment, but these are the metrics that I've come up with so far (mostly focused on SmtpClient at the moment since that's what you guys are focused on):

Socket Metrics

Metric: mailkit.net.socket.connect

Name Instrument Type Unit Description
mailkit.net.socket.connect Histogram ms The number of milliseconds taken for a socket to connect to a remote host.
Attribute Type Description Examples Presence
socket.connect.result string The connection result. succeeded, failed, cancelled Always
network.peer.address string Peer IP address of the socket connection. 10.5.3.2 Always
server.address string The host name that the socket is connecting to. smtp.gmail.com Always
server.port int The port that the socket is connecting to. 465 Always
error.type string The type of exception encountered. System.Net.Sockets.SocketException If an error occurred.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric measures the time it takes to connect a socket to a remote host.

SmtpClient Metrics

Metric: mailkit.net.smtp.client.connect.counter

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.counter Counter {connection} Number of outbound SMTP connections that are currently active.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always

This metric tracks the number of successful connections made by the SmtpClient.

Metric: mailkit.net.smtp.client.connect.failures

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.failures Counter {failure} The number of times a connection failed to be established to an SMTP server.
Attribute Type Description Examples Presence
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric counts the number of failed connection attempts made by the SmtpClient.

Metric: mailkit.net.smtp.client.session.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.session.duration Histogram s The duration of successfully established connections to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric tracks the session duration of each SmtpClient connection and records whether that session ended voluntarily or due to an error.

Metric: mailkit.net.smtp.client.send.messages

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.messages Counter {message} The number of messages successfully sent to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always

This metric tracks the number of messages successfully sent by an SmtpClient.

Metric: mailkit.net.smtp.client.send.failures

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.failures Counter {failure} The number of times that sending a message to an SMTP server failed.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric tracks the number of times an SmtpClient encountered an error while trying to send a message.

Metric: mailkit.net.smtp.client.send.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.duration Histogram ms The amount of time it takes to send a message to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 10.5.3.2 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connecting to. 465 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error string The socket error code. ConnectionRefused, TimedOut, ... If one was received.

This metric tracks the amount of time that it takes to send a message and records any error details if the message was not sent successfully.

@jstedfast
Copy link
Owner

First, I'd like to say that I'm not completely happy with the metric names I've come up with, especially the .counter and .failures metric names.

I'd also like to get some feedback on attribute names/values, especially the Type for both smtp.status_code and socket.error.

I started out with socket.error and mapped those SocketError enum values to string because the interger values are worthless to me. That said, the same is probably true for the SmtpStatusCode enum values that I decided to map to int for most of you guys.

Does it make sense for both of these to be int? In hindsight, I'm thinking it does.

The Socket Metrics are because the way I implemented it was to implement it at the MailKit.Net.SocketUtils-level rather than it the higher-level to avoid duplicating as much code. Also if a connect is aborted due to the SmtpClient.Timeout value or the CancellationToken, we don't actually get accurate information if I was to do it at the SmtpClient-level.

Thinking back on my questions from an earlier comment, I'm guessing all of my metrics should be "global" and not "instance", but I'm not 100% positive on that.

Then of course, if they are "global", then does it make sense to add attributes to determine which instance a metric was logged from? e.g. perhaps have a Guid (or int?) clientId?

@jstedfast
Copy link
Owner

Ok, so reading over https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md#naming-rules-for-counters-and-updowncounters

I guess I should probably make the following name changes:

  • mailkit.net.smtp.client.connect.counter -> mailkit.net.smtp.client.connect.count
  • mailkit.net.smtp.client.connect.failures -> mailkit.net.smtp.client.connect.faults ?
  • mailkit.net.smtp.client.send.messages -> mailkit.net.smtp.client.send.count
  • mailkit.net.smtp.client.send.failures -> mailkit.net.smtp.client.send.faults ?

(I think technically failures is ok, but maybe faults is better?)

I also wonder if maybe smtp.client.session.duration should just become smtp.client.duration?

@EugeneKrapivin
Copy link

EugeneKrapivin commented Feb 13, 2024

instead of using faults use a tag/label status=failed it will be easier to dissect the metrics using tags.
if you go back to my earlier example, you'd see how I used tags to add extra data on top of the metric.

mailkit.net.client.send.count {status=[success|failure], protocol=[smtp|pop3]}
mailkit.net.client.send.rate {status=[success|failure], protocol=[smtp|pop3]} 
mailkit.net.client.send_duration {status=[success|failure], protocol=[smtp|pop3], unit=ms}

just an idea.

PS: a bit of math :) a linearly increasing counter could give you a rate using a derivative operation. a rate meter could give you a count using an integral operation. sry for the math tidbits 🤣

@jstedfast
Copy link
Owner

Okay, so I'll drop the .failures.

That still doesn't answer the most important question which is should these instruments (Counters/Histograms) be per-instance or global?

@jstedfast
Copy link
Owner

Updated list of metrics for SmtpClient:

SmtpClient Metrics

Metric: mailkit.net.smtp.client.connect.count

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.count Counter {attempt} The number of times a client has attempted to connect to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. System.Net.Sockets.SocketException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the number of times an SmtpClient has attempted to connect to an SMTP server.

Metric: mailkit.net.smtp.client.connect.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.connect.duration Histogram ms The amount of time it takes for the client to connect to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. System.Net.Sockets.SocketException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the amount of time it takes an SmtpClient to connect to an SMTP server.

Metric: mailkit.net.smtp.client.connection.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.connection.duration Histogram s The duration of successfully established connections to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the connection duration of each SmtpClient connection and records any error details if the connection was terminated involuntarily.

Metric: mailkit.net.smtp.client.send.count

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.count Counter {message} The number of messages sent to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the number of messages sent by an SmtpClient and records any error details if the message was not sent successfully.

Metric: mailkit.net.smtp.client.send.duration

Name Instrument Type Unit Description
mailkit.net.smtp.client.send.duration Histogram ms The amount of time it takes to send a message to an SMTP server.
Attribute Type Description Examples Presence
network.peer.address string Peer IP address of the client connection. 142.251.167.109 When available
server.address string The host name that the client is connected to. smtp.gmail.com Always
server.port int The port that the client is connected to. 25, 465, 587 Always
url.scheme string The URL scheme of the protocol used. smtp or smtps Always
error.type string The type of exception encountered. MailKit.Net.Smtp.SmtpCommandException If an error occurred.
smtp.status_code int The SMTP status code returned by the server. 530, 550, 553, ... If one was received.
socket.error int The socket error code. 10054, 10060, 10061, ... If one was received.

This metric tracks the amount of time that it takes an SmtpClient to send a message and records any error details if the message was not sent successfully.

jstedfast added a commit that referenced this issue Feb 14, 2024
@EugeneKrapivin
Copy link

That still doesn't answer the most important question which is should these instruments (Counters/Histograms) be per-instance or global?

I'm sorry missed that question. I think you should share the Meters between instances, and it looks like the documentation says the APIs are thread-safe. I'd share the Meter (or use IMeterFactory which exists in net8 only😞), and create the instruments (counters, meters, histograms) in the constructor of the instance itself. But I'd be really happy to get some community input on this because I'm not sure whether this is the best practice.

The APIs to create instruments and record measurements are thread-safe. In .NET libraries, most instance methods require synchronization when invoked on the same object from multiple threads, but that's not needed in this case.

source: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation

I've seen your table, thats impressive :)

You should be very careful with tags with high cardinality: like IP and hostname.
These might cause strain on the metrics database which may lead to degraded performance on the DB. I've seen (and made) mistakes where runaways tags values would cause crappy performance in influxDB, I'd assume the same could happen in Prometheus as well.
source: https://learn.microsoft.com/en-us/dotnet/core/diagnostics/metrics-instrumentation#:~:text=Beware%20of%20having%20very%20large%20or%20unbounded%20combinations%20of%20tag%20values%20being%20recorded%20in%20practice.

@prashanth-nani
Copy link

I’m not entirely sure about the best course of action here. However, I'm leaning towards having more details in the traces rather than in the metrics. Personally, I'd rather avoid mixing external library logs with my app logs as it seems somewhat unconventional to me. Let’s consider adding metrics only when we feel they are essential for monitoring, as suggested by @EugeneKrapivin. If uncertain, let's refrain from adding it for now and revisit the decision when needed.

@EugeneKrapivin
Copy link

I assume we'd like data that will help us monitor and troubleshoot the service faster, preferably before issues on the client side begin to bubble :)

For example, I'm not sure I'd care at all about # of messages sent. Sure, it'd be a fun statistic I could use for bragging rights, but not much else. But you guys seem to care about this metric.

it is an integral of the rate metric, so as long as you report 1 (either count or rate) you could calculate the other simply

These are prometheus queries with grafana variables that I copied form a board we have to monitor number exceptions thrown by a service in production:

process_runtime_dotnet_exceptions_count{dc='$dc', env='$env',name='$service', version=~'$ver'}

image
the process_runtime_dotnet_exceptions_count is a dotnet metric reported by the runtime , as you can clearly see, it is a counter, linear counter.
with a simple rate function on this series we could get rate of exceptions
image

Even Send/SendAsync() duration is not hugely useful for me

metrics like these are actually useful for developers who use MailKit, they could report those metrics and build useful alerts based on this metric. If the duration of a send request is increasing overtime its a cause for alarm, seeing quantiles (p99.9) of the duration also helps find issues, especially spikes that get lost in the standard average/mean aggregation people look at.

Tracing and Logging would bring context into the picture, but again, what details do you guys care about?

it'd be really helpful to get outgoing client requests as activities (trace) because it would allow the user to trace a single request and whether or not the email was sent successfully. that's (distributed traces) actually something I'm using daily while researching production issues.

var start = Stopwatch.GetTimestamp();
using var sendActivity = source.StartActivity("Smtp.Send", ActivityKind.Client);
try
{
	var r = await _client.SendAsync(message, cts.Token);

	var total = Stopwatch.GetElapsedTime(start);
	
	sendActivity?.AddTag("status", "success");
	sendActivity?.AddEvent(new ActivityEvent("email sent"));
	_sendMessageHistogram.Record(total.TotalMilliseconds new KeyValuePair<string, object?>("status", "success"));
}

Especially since Traces and Logs are correlated using a RequestId parameter, meaning I could trace a request and find all logs written as a side-effect of the
call chain.

In general, metrics are only a partial view of the whole observability game which must include logs, traces and metrics :).

I say you could have only 2-3 metrics, with good tags just to get you up and running with RED metrics.
Get those logs out the door to the logging tool that the developers use and get those traces with dotnet Activity SDK. It doesn't have all to be at once, it could be gradual. I'd suggest sticking to:

instrument type metric name tags
counter mailkit.net.client.op.count client=[smtp|pop3|imap], status=[success|failed], operation=[send|connect|auth|noop]
histogram mailkit.net.client.op.duration client=[smtp|pop3|imap], status=[success|failed], unit=ms, operation=[send|connect|auth|noop]
histogram mailkit.net.client.send.size {client=[smtp|pop3|imap], status=[success|failed], unit=byte}

of course I do not know the internals of the library so if there are other metrics that you think that might be useful for the users of the library, you should add them. the metrics I suggested would allow a user to:

  • Rate (R of RED) - monitor success/failure rate of operations in the system, I could setup an alert if failed rate is getting high, I could setup an alert if rate (total) is gone under some water mark
  • Error (E of RED) - I could easily monitor failure rates for various operations (Send and Connect/Auth)
  • Duration (D of RED) - I could easily monitor duration of operations (like send)

of course all of those I could gather manually from outside the library by simply wrapping it with lots of code... but that's the added value you've mentioned :)

Regarding traces, just instrument the most used (in your opinion) client and see if requests or MR for other clients arrive =) if you build it, they will come 🤣

Regarding logs, I strongly disagree with @prashanth-nani about library logs in my product logs, I think that once I use the library in my code, it is my job to monitor, read and understand its behavior. This is why all of our logs are gathered into elasticsearch under a single request ID for each and every request (pricy endeavor for sure). these include all the logs that are written by a running process. If there is something I don't want I just filter it in the log4net/logging/logstash/flume forwarders and drop it before it gets stored. logs from MailKit is as important as your log in the service once you use MailKit in production.

@jstedfast I think you should be using the Microsoft logging abstraction, it'd allow the user to use any logger they want (all loggers implement the ILogger abstraction. In addition, those logs could be easily exported to OpenTelemetry collector easily in dotnet. Of course all you have to do is use the standard ILogger logging abstraction, all the exporting logic is not your job, just to be clear.

wow this got fairly long. sorry.

@prashanth-nani
Copy link

prashanth-nani commented Feb 18, 2024

@EugeneKrapivin, thanks for the detailed explanation! I fully agree with your points.

I stand corrected regarding my previous thoughts on library logs. Including them where ever necessary using ILogger is a great idea as it is configurable in the application.

On instrumentation, I align with your suggestions for the traces and RED metrics.

Additionally, message.size tag (and any other crucial information that could cause delays) in the trace activities, and exceptions as trace events would be valuable to correlate with longer duration traces.

Thanks again for your valuable insights!

@EugeneKrapivin
Copy link

Additionally, message.size tag (and any other crucial information that could cause delays) in the trace activities, and exceptions as trace events would be valuable to correlate with longer duration traces.

just be careful with tags with variadic values on metrics - it will explode the backend storage for sure :) it should be OK as an argument in Activity/Event trace

@jstedfast
Copy link
Owner

@EugeneKrapivin Thanks for your thoughts, as always. I really appreciate the guidance because I'm not well-versed in this stuff.

What you said makes sense and brings the bigger picture into better focus for me.

jstedfast added a commit that referenced this issue Feb 19, 2024
jstedfast added a commit that referenced this issue Feb 24, 2024
jstedfast added a commit that referenced this issue Feb 24, 2024
jstedfast added a commit that referenced this issue Feb 24, 2024
jstedfast added a commit that referenced this issue Feb 24, 2024
jstedfast added a commit that referenced this issue Feb 24, 2024
jstedfast added a commit that referenced this issue Feb 27, 2024
jstedfast added a commit that referenced this issue Feb 27, 2024
@jstedfast
Copy link
Owner

jstedfast commented Feb 27, 2024

Here's where I'm at with telemetry metrics: https://github.com/jstedfast/MailKit/blob/metrics/Telemetry.md

A lot of this is based on HttpClient and reading up on the OpenTelemetry specifications, conventions, etc.

I've also got Activities that span Connect/Async, Authenticate/Async and Send/Async.

@EugeneKrapivin
Copy link

@jstedfast would you mind opening a draft merge/pull request so we could productively leave comments?

@jstedfast
Copy link
Owner

Done: #1715

jstedfast added a commit that referenced this issue Mar 1, 2024
jstedfast added a commit that referenced this issue Mar 1, 2024
jstedfast added a commit that referenced this issue Mar 2, 2024
jstedfast added a commit that referenced this issue Mar 2, 2024
jstedfast added a commit that referenced this issue Mar 9, 2024
jstedfast added a commit that referenced this issue Mar 9, 2024
jstedfast added a commit that referenced this issue Apr 13, 2024
jstedfast added a commit that referenced this issue Apr 13, 2024
jstedfast added a commit that referenced this issue May 12, 2024
jstedfast added a commit that referenced this issue May 12, 2024
jstedfast added a commit that referenced this issue May 18, 2024
jstedfast added a commit that referenced this issue May 18, 2024
jstedfast added a commit that referenced this issue May 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants