Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8.0.0-beta.6 massive performance degradation/connections dropped #11897

Open
3 tasks done
gajus opened this issue May 4, 2024 · 9 comments
Open
3 tasks done

8.0.0-beta.6 massive performance degradation/connections dropped #11897

gajus opened this issue May 4, 2024 · 9 comments

Comments

@gajus
Copy link
Contributor

gajus commented May 4, 2024

Is there an existing issue for this?

How do you use Sentry?

Sentry Saas (sentry.io)

Which SDK are you using?

@sentry/node

SDK Version

8.0.0-beta.6

Framework Version

No response

Link to Sentry event

No response

SDK Setup

N/A

Steps to Reproduce

We upgraded from 8.0.0-alpha.9 to 8.0.0-beta.6.

Expected Result

Everything to work.

Actual Result

Experienced major performance degradation.

There are no errors per se. However, service performance has dropped to a halt.

The most noticeable symptom was occasional errors coming from pg module complaining about timeouts, e.g.

Error: timeout expired
    at Timeout._onTimeout (/srv/node_modules/.pnpm/pg@8.11.5/node_modules/pg/lib/client.js:106:28)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7)

Worth noticing that we are also heavily relying on Zod. I noticed that it was mentioned in the upgrade changelog.

@mydea
Copy link
Member

mydea commented May 6, 2024

Hey, thank you for reporting this!

So zodIntegration has been added but is not enabled by default, so that's probably not related to the issue you are describing.

We did bump a bunch of underlying instrumentation, including @opentelemetry/instrumentation-pg 🤔 But I don't really see anything in there that would explain why this would become slower. Just to clarify, you only bumped @sentry/node but no other packages?

One thing you could try is to force it to use v0.38.0 of @opentelemetry/instrumentation-pg, which is what we used before. If you are using yarn, you can do this like this in your package.json

{
  "resolutions": {
    "@sentry/node/@opentelemetry/instrumentation-pg": "^0.38.0"
  }
}

I think this should work to try if that has an impact on performance 🤔 But it could of course also be another instrumentation, hard to tell. I guess you don't have a public reproduction for this that we could take a look at?

@gajus
Copy link
Contributor Author

gajus commented May 9, 2024

Same issue with rc2 just in case. Our app goes to a complete halt after trying to deploy it. Average response time increases from 150ms to over 1500ms.

What's weird is that the CPU usage remains about the same. I would suspect that some instrumentation logic is introducing big overhead, but that does not appear to be the case.

I have not had much luck replicating this in isolation, i.e. we are only seeing this when deployed to production, which would indicate that this issue surfaces under high concurrency.

Will continue to troubleshoot and report back if we can locate the root cause.

@mydea
Copy link
Member

mydea commented May 13, 2024

Thanks for the added info!

Could you share your Sentry.init() code, and also maybe your app setup - e.g. what framework (express, fastify, ...) are you using, what middleware, other relevant config, ...?

@gajus
Copy link
Contributor Author

gajus commented May 15, 2024

Could you share your Sentry.init() code, and also maybe your app setup - e.g. what framework (express, fastify, ...) are you using, what middleware, other relevant config, ...?

import { RequestError } from '@contra/http-client';
import { init, rewriteFramesIntegration } from '@sentry/node';
import { nodeProfilingIntegration } from '@sentry/profiling-node';
import { type Integration } from '@sentry/types';

export { findActiveSpan } from './findActiveSpan.js';
export { getTraceAndBaggage } from './getTraceAndBaggage.js';
export { startSpan } from '@sentry/core';
export {
  type Breadcrumb,
  captureException,
  captureMessage,
  close,
  setupFastifyErrorHandler,
  setUser,
  withScope,
} from '@sentry/node';
export { type Extras } from '@sentry/types';

const {
  NODE_ENV,
  NODE_PROFILING,
  // eslint-disable-next-line n/no-process-env
} = process.env;

type ErrorDetail = {
  message: string;
  path: string[];
};

// Type Guard for ErrorDetail
const isErrorDetail = (error: unknown): error is ErrorDetail => {
  return (
    typeof error === 'object' &&
    error !== null &&
    'message' in error &&
    'path' in error &&
    typeof error.message === 'string' &&
    Array.isArray(error.path)
  );
};

// Extracts error message from JSON error string
const extractErrorMessage = (errorString: string): string | null => {
  try {
    const outerArray = JSON.parse(errorString);
    const parsed = outerArray[0];

    if (Array.isArray(parsed) && parsed.every(isErrorDetail)) {
      return parsed.map((error) => error.message).join(', ');
    } else {
      return parsed.message;
    }
  } catch {
    return null;
  }
};

// Extracts path from JSON error string and removes index of edges
const extractErrorPath = (errorString: string): string | null => {
  try {
    const outerArray = JSON.parse(errorString);
    const parsed = outerArray[0];

    if (Array.isArray(parsed) && parsed.every(isErrorDetail)) {
      return parsed
        .map((error) => error.path.join('.').replaceAll(/\d+/gu, ''))
        .filter(Boolean)
        .join(', ');
    } else {
      return parsed.path.join('.').replaceAll(/\d+/gu, '');
    }
  } catch {
    return null;
  }
};

// Extracts error details
const extractErrorDetails = (error: unknown): Record<string, string | null> => {
  let errorMessage: string | null = null;
  let errorPath: string | null = null;

  errorMessage = extractErrorMessage(typeof error === 'string' ? error : '');
  errorPath = extractErrorPath(typeof error === 'string' ? error : '');

  return { errorMessage, errorPath };
};

const formatFingerprint = (fingerprint: string[]): string[] => {
  return fingerprint.map((item) => {
    const { errorMessage, errorPath } = extractErrorDetails(item);
    if (errorMessage && errorPath) {
      return `{message: "${errorMessage}", path: "${errorPath}"}`;
    }

    return item;
  });
};

export type SentryTracing = {
  shutdown: () => Promise<void>;
};

export const createSentry = ({
  integrations,
  root,
}: {
  integrations?: (integrations: Integration[]) => Integration[];
  root: string;
}) => {
  init({
    beforeSend(event, hint) {
      let fingerprint =
        typeof hint.originalException === 'string'
          ? [hint.originalException]
          : hint.originalException instanceof Error
            ? [hint.originalException?.message]
            : undefined;

      if (!event.contexts) {
        event.contexts = {};
      }

      if (hint.originalException instanceof RequestError) {
        event.contexts.httpRequest = {
          method: hint.originalException.request?.options.method,
          // eslint-disable-next-line @typescript-eslint/no-explicit-any
          source: (hint.originalException as any)?.sourceTrace,
          url: hint.originalException.request?.requestUrl,
        };
        event.contexts.httpResponse = {
          body: hint.originalException?.response?.body ?? null,
          statusCode: hint.originalException?.response?.statusCode,
        };
        fingerprint = [
          hint.originalException.message,
          hint.originalException.request?.options.method ?? '',
          // @ts-expect-error TODO why origin does not exist?
          String(hint.originalException.request?.options.url?.origin),
        ];
      }

      if (fingerprint) {
        if (event.contexts.graphql) {
          fingerprint.push(event.contexts.graphql.path as string);
        } else if (event.contexts.temporal) {
          fingerprint.push(event.contexts.temporal.workflowType as string);
        }
      }

      if (fingerprint) {
        event.fingerprint = formatFingerprint(fingerprint);
      } else {
        event.fingerprint = fingerprint;
      }

      return event;
    },
    integrations: (defaultIntegrations) => {
      const customIntegrations = [
        ...defaultIntegrations,
        rewriteFramesIntegration({
          root,
        }),
      ];

      if (NODE_PROFILING === 'true') {
        customIntegrations.push(nodeProfilingIntegration());
      }

      if (integrations) {
        return integrations(customIntegrations);
      }

      return customIntegrations;
    },
    normalizeDepth: 5,
    spotlight: NODE_ENV === 'development',
    tracesSampleRate: 1,
  });

  return {
    shutdown: async () => {
      // This was made redundant by Sentry abstracting away opentelemetry
    },
  };
};

@gajus
Copy link
Contributor Author

gajus commented May 15, 2024

v8 same issue.

@mydea I managed to get a trace while we were testing v8.

Looks like there are massive gaps between span.

https://contra.sentry.io/performance/trace/7564ad7a9f280bad7145d336820752d8/?fov=0%2C20355.000019073486

Screenshot 2024-05-15 at 1 53 23 PM

@gajus
Copy link
Contributor Author

gajus commented May 15, 2024

The resource consumption profile does not change dramatically.

This is a snapshot of pre-deployment:

NAME                                                   CPU(cores)   MEMORY(bytes)
contra-api-684b67464f-6kdbx                            324m         1080Mi
contra-api-684b67464f-7w99p                            269m         1092Mi
contra-api-684b67464f-9dc9d                            121m         1018Mi
contra-api-684b67464f-9lrk9                            181m         1068Mi
contra-api-684b67464f-9m9sv                            198m         1238Mi
contra-api-684b67464f-drmtj                            205m         1263Mi
contra-api-684b67464f-gnxm7                            135m         1302Mi
contra-api-684b67464f-hz4wl                            236m         1226Mi
contra-api-684b67464f-jctc6                            140m         1352Mi
contra-api-684b67464f-jvscg                            117m         923Mi
contra-api-684b67464f-phnms                            193m         1048Mi
contra-api-684b67464f-q85h5                            172m         1003Mi
contra-api-684b67464f-q87h5                            88m          904Mi
contra-api-684b67464f-s6qwt                            115m         1277Mi
contra-api-684b67464f-x6rvs                            290m         1410Mi

This is after:

contra-api-75df777468-6qt2c                            119m         1524Mi
contra-api-75df777468-7sw7t                            119m         1143Mi
contra-api-75df777468-8nsmg                            253m         1310Mi
contra-api-75df777468-9r4x4                            221m         1462Mi
contra-api-75df777468-cgs9s                            227m         1535Mi
contra-api-75df777468-ctv2h                            195m         1216Mi
contra-api-75df777468-fh7d2                            230m         1323Mi
contra-api-75df777468-fkwcd                            261m         1585Mi
contra-api-75df777468-kf8vm                            288m         1390Mi
contra-api-75df777468-n9tlz                            250m         1451Mi
contra-api-75df777468-pc78z                            397m         1554Mi
contra-api-75df777468-rtvkz                            336m         1529Mi
contra-api-75df777468-v6t8l                            201m         1444Mi
contra-api-75df777468-wz984                            274m         1250Mi
contra-api-75df777468-xtmlm                            67m          1111Mi

Memory usage is def higher on average, but nothing that would make me wince.

@gajus
Copy link
Contributor Author

gajus commented May 15, 2024

Just like the last time, we do see a big uptick in timeouts while trying to establish pg connection.

Logs are flooded with:

Error: timeout expired
    at Timeout._onTimeout (/srv/node_modules/.pnpm/pg@8.11.5/node_modules/pg/lib/client.js:106:28)
    at listOnTimeout (node:internal/timers:573:17)
    at process.processTimers (node:internal/timers:514:7)

@gajus
Copy link
Contributor Author

gajus commented May 15, 2024

@mydea What would be the correct way to disable all auto instrumentations and re-add them all one by one?

I know how to do it with opentelemetry, but not when it is hidden like it is with Sentry.

export OTEL_NODE_ENABLED_INSTRUMENTATIONS=""

doesn't work.

@mydea
Copy link
Member

mydea commented May 16, 2024

@mydea What would be the correct way to disable all auto instrumentations and re-add them all one by one?

I know how to do it with opentelemetry, but not when it is hidden like it is with Sentry.

export OTEL_NODE_ENABLED_INSTRUMENTATIONS=""

doesn't work.

You can do defaultIntegrations: false (see https://docs.sentry.io/platforms/javascript/guides/node/configuration/integrations/#modifying-default-integrations), and then add single integrations back.

Or, alternatively, you can also just filter out single integrations like this:

Sentry.init({
  integrations: function(integrations) {
    return integrations.filter(integration => integration.name !== 'Postgres');
  }
});

(similar to what you are already doing in your init, but just removing the ones you want to exclude for a chance).

Would be interesting to know if this is only related to the pg instrumentation 🤔 I don't see anything in the trace you sent that appears wrong, at a glance 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

No branches or pull requests

2 participants