Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TargetGroup sometimes does not attach to ApplicationLoadBalancer #1254

Closed
rpmccarter opened this issue Apr 3, 2024 · 3 comments
Closed

TargetGroup sometimes does not attach to ApplicationLoadBalancer #1254

rpmccarter opened this issue Apr 3, 2024 · 3 comments
Labels
awaiting-feedback impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec

Comments

@rpmccarter
Copy link

What happened?

I was trying to create a single FargateService with two different TargetGroups attached to an ApplicationLoadBalancer (one tg for HTTP requests, one tg for socket connections). When deployed, one target group simply doesn't attach to the load balancer. What's even more concerning is that, when the exact same code is deployed to a second stack, it attaches just fine. I'm relatively new to Pulumi so there might be something I'm missing, but I assumed identical code should result in identical resources.

I understand this might not be reproducible, I mostly just want to flag that I'm seeing inconsistency between environments and hopefully get some answers on how this is possible

Example

Unfortunately, this is part of our private infra so I won't be able to send the entire deploy script, but I'll try to send as much relevant info as possible. Here is the code for the target groups and load balancer:

const serverTg = new aws.lb.TargetGroup(`leaves-server-tg-${stack}`, {
  vpcId: defaultVpc.vpcId,
  stickiness: {
    type: 'lb_cookie',
  },
  port,
  protocol: 'HTTP',
  targetType: 'ip',
  protocolVersion: 'HTTP1',
  healthCheck: {
    path: '/api',
    port: 'traffic-port',
    protocol: 'HTTP',
    matcher: '200',
    enabled: true,
    interval: 60,
    timeout: 30,
  },
});

const socketTg = new aws.lb.TargetGroup(`leaves-socket-tg-${stack}`, {
  vpcId: defaultVpc.vpcId,
  port: 5001,
  protocol: 'HTTP',
  stickiness: {
    type: 'lb_cookie',
  },
  targetType: 'ip',
  protocolVersion: 'HTTP1',
  healthCheck: {
    path: '/api',
    port: `${port}`,
    protocol: 'HTTP',
    matcher: '200',
    enabled: true,
    interval: 60,
    timeout: 30,
  },
});

const lb = new awsx.lb.ApplicationLoadBalancer(`leaves-lb-${stack}`, {
  listeners: [
    {
      port: 443,
      protocol: 'HTTPS',
      certificateArn: lb_cert.arn,
      defaultActions: [
        {
          type: 'forward',
          targetGroupArn: serverTg.arn,
        },
      ],
    },
    {
      port: 8443,
      protocol: 'HTTPS',
      certificateArn: lb_cert.arn,
      defaultActions: [
        {
          type: 'forward',
          targetGroupArn: socketTg.arn,
        },
      ],
    },
  ],
});

And here's the code for the target service:

new awsx.ecs.FargateService(`leaves-server-service-${stack}`, {
  networkConfiguration: {
    assignPublicIp: true,
    securityGroups: [serviceSg.id],
    subnets: defaultVpc.publicSubnetIds,
  },
  cluster: cluster.arn,
  desiredCount: 4,
  taskDefinitionArgs: {
    taskRole: {
      roleArn: role.arn,
    },
    container: {
      name: 'server',
      image: image.imageUri,
      command: ['infisical', 'run', `--env=${stack}`, '--', 'yarn', 'server'],
      cpu: 2 * 1024,
      memory: 4 * 1024,
      environment: serverEnvironment,
      essential: true,
      portMappings: [
        {
          targetGroup: serverTg,
          containerPort: port,
        },
        {
          targetGroup: socketTg,
          containerPort: 5001,
        },
      ],
      healthCheck: {
        command: ['CMD-SHELL', `curl -f http://localhost:${port}/api/ || exit 1`],
        interval: 30,
        timeout: 5,
        retries: 3,
      },
    },
  },
});

Here are the target groups - the relevant ones are selected. Note that leaves-socket-tg-dev has no associated load balancer:

Screenshot 2024-04-03 at 4 07 27 PM

Output of pulumi about

CLI          
Version      3.112.0
Go Version   go1.22.1
Go Compiler  gc

Plugins
NAME        VERSION
aws         6.28.2
awsx        2.5.0
cloudflare  5.22.0
docker      4.5.3
docker      3.6.1
nodejs      unknown
tls         5.0.1

Host     
OS       darwin
Version  14.4
Arch     arm64

This project is written in nodejs: executable='/Users/rpmccarter/.nvm/versions/node/v20.10.0/bin/node' version='v20.10.0'

Current Stack: Mintlify/leaves/dev

TYPE                                                      URN
[removed]

Found no pending operations associated with dev

Backend        
Name           pulumi.com
URL            https://app.pulumi.com/Mintlify
User           Mintlify
Organizations  Mintlify
Token type     personal

Dependencies:
NAME                VERSION
@pulumi/aws         6.28.2
@pulumi/awsx        2.5.0
@pulumi/cloudflare  5.22.0
@pulumi/pulumi      3.109.0
@pulumi/tls         5.0.1
@types/node         16.18.22
rimraf              5.0.5
typescript          5.3.3

Pulumi locates its logs in /var/folders/dn/z0by0dcj1gnbkjr6_t71hp_m0000gn/T/ by default

Additional context

No response

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@rpmccarter rpmccarter added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Apr 3, 2024
@t0yv0 t0yv0 added impact/reliability Something that feels unreliable or flaky and removed needs-triage Needs attention from the triage team labels Apr 5, 2024
@t0yv0
Copy link
Member

t0yv0 commented Apr 5, 2024

Thanks for reporting this @rpmccarter this sounds pretty concerning. To clarify does the failed state happen sporadically or every single time? Are there no errors reported? Does the condition not resolve after a certain time (5 min later)?

This sounds pretty concerning but will be difficult for our team to diagnose so anything along the lines of narrowing down the repro would be super helpful. If anyone is running into this please let us know also what you are observing.

@mjeffryes
Copy link
Contributor

Any further context you can offer to help us reproduce this @rpmccarter ?

@rpmccarter
Copy link
Author

Hey team, I'm fairly confident this is just a symptom of #1253. I'm just now running into a very similar issue with a Cloudflare Record failing to be created due to a missing field which is lb.loadBalancer.dnsName - closing this as a duplicate

@rpmccarter rpmccarter closed this as not planned Won't fix, can't repro, duplicate, stale May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-feedback impact/reliability Something that feels unreliable or flaky kind/bug Some behavior is incorrect or out of spec
Projects
None yet
Development

No branches or pull requests

4 participants