Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure monitoring for subaccount-sync [betaEnabled] app #691

Closed
5 tasks done
jaroslaw-pieszka opened this issue Apr 22, 2024 · 2 comments
Closed
5 tasks done

Configure monitoring for subaccount-sync [betaEnabled] app #691

jaroslaw-pieszka opened this issue Apr 22, 2024 · 2 comments
Assignees
Labels
2024-Q2 Planned for Q2 2024 size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Comments

@jaroslaw-pieszka
Copy link
Contributor

jaroslaw-pieszka commented Apr 22, 2024

subaccount-sync app exposes metrics on configurable port (default 8081).
We need these metrics to be presented in Plutono.

AC

  • configure necessary resources to scrape metrics
  • make it available in Plutono.
  • create a dedicated dashboard to present metrics.
  • if any metrics missing, create a new task
  • configure alerting for CIS requests and non-empty queue for longer period of time

Info

Metrics:

subaccount_sync_cis_requests{endpoint="accounts"} 1               
subaccount_sync_cis_requests{endpoint="events"} 2
# HELP subaccount_sync_in_memory_states Information about in-memory states.
# TYPE subaccount_sync_in_memory_states gauge  
subaccount_sync_in_memory_states{type=""} 2
subaccount_sync_in_memory_states{type="NOT_USED_FOR_PRODUCTION"} 8
subaccount_sync_in_memory_states{type="USED_FOR_PRODUCTION"} 1                                                         
subaccount_sync_in_memory_states{type="betaEnabled"} 2
subaccount_sync_in_memory_states{type="total"} 18       
# HELP subaccount_sync_informer Informer stats.          
# TYPE subaccount_sync_informer counter          
subaccount_sync_informer{event="add"} 15   
subaccount_sync_informer{event="update"} 684       
# HELP subaccount_sync_priority_queue_size Queue size.
# TYPE subaccount_sync_priority_queue_size gauge                           
subaccount_sync_priority_queue_size 0
@jaroslaw-pieszka
Copy link
Contributor Author

jaroslaw-pieszka commented May 14, 2024

Two alerts proposed:

  1. CIS requests success ratio lower than threshold, separately for accounts and events endpoints, ratio could be calculated (i.e. over 24h).
  2. No extracts from the non-empty queue for defined period (i.e. 1h)

@jaroslaw-pieszka
Copy link
Contributor Author

jaroslaw-pieszka commented May 16, 2024

Alerts introduced:

      - alert: SubaccountSyncQueueSizeAlert
        annotations:
          description: Queue not empty too long - no extracts.
          summary:  Queue not empty too long
        expr: subaccount_sync_priority_size > 0 and dry_run == 0
        for: 1h
        labels:
          severity: warning
      - alert: SubaccountSyncCISAccountsServiceAlert
        annotations:
          description: Success ratio for accounts service too low.
          summary: Success ratio for accounts service too low
        expr: |-
          round(sum(increase_prometheus(subaccount_sync_cis_requests{exported_endpoint="accounts", status="success"}[1d])))
          /
            (round(sum(increase_prometheus(subaccount_sync_cis_requests{exported_endpoint="accounts", status="success"}[1d])))
            +
            round(sum(increase_prometheus(subaccount_sync_cis_requests{exported_endpoint="accounts", status="failure"}[1d])))
            )
          < 0.90
        for: 1d
        labels:
          severity: warning
      - alert: SubaccountSyncCISEventsServiceAlert
        annotations:
          description: Success ratio for events service too low.
          summary:  Success ratio for events service too low
        expr: |-
          round(sum(increase_prometheus(subaccount_sync_cis_requests{exported_endpoint="events", status="success"}[1h])))
          /
            (round(sum(increase_prometheus(subaccount_sync_cis_requests{exported_endpoint="events", status="success"}[1h])))
            +
            round(sum(increase_prometheus(subaccount_sync_cis_requests{exported_endpoint="events", status="failure"}[1h])))
            )
          < 0.90
        for: 1h
        labels:
          severity: warning

@ralikio ralikio closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2024-Q2 Planned for Q2 2024 size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

No branches or pull requests

3 participants