Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update diagnostics service to version 1 #7059

Merged
merged 50 commits into from
May 20, 2024

Conversation

farost
Copy link
Member

@farost farost commented May 7, 2024

Usage and product changes

We introduce an updated version of diagnostics sent from a TypeDB server.

  1. config.yml gets a new field deploymentID for the diagnostics section. This field should be used for collecting the data from multiple servers of a single TypeDB Cloud deployment.
  2. The updated diagnostics data contains more information about the server resources and details for each separate database. More details can be found in the examples below.
  3. For the JSON reporting, we calculated diffs between the current timestamp and the sinceTimestamp (the previous hour when the data had to be sent: it's updated even if we had errors sending the data for simplicity). For the Prometheus data, we send raw counts as Prometheus calculates diffs based on its queries and expects raw diagnostics from our side.
  4. For the JSON monitoring, we show only the incrementing counters from the start of the server just as for the Prometheus diagnostics data (also available through the monitoring page). This way, the content is different from the reporting data.
  5. The schema and data diagnostics about each specific database are sent only from the primary replica of a deployment at the moment of the diagnostics collection. The connection peak values diagnostics regarding a database are still reported by a non-primary replica if the database exists or there were established transactions within the last hour before the database had been deleted.
  6. If the statistics reporting is turned off in the config, we send a totally safe part of the diagnostics data once to notify the server about the moment when the diagnostics were turned off. No user data is shared in this snapshot (see examples below). This action is performed only if the server is up for 1 hour (to avoid our CI tests report data), and only if the server has not successfully sent such a snapshot after turning the statistics reporting off the last time. If there is an error in sending this snapshot, the server will try again after a restart (no extra logic here).

Example diagnostics data for Prometheus (http://localhost:4104/metrics?format=prometheus):

# distribution: TypeDB Core
# version: 2.28.0
# os: Mac OS X x86_64 14.2.1

# TYPE server_resources_count gauge
server_resources_count{kind="memoryUsedInBytes"} 68160245760
server_resources_count{kind="memoryAvailableInBytes"} 559230976
server_resources_count{kind="diskUsedInBytes"} 175619862528
server_resources_count{kind="diskAvailableInBytes"} 1819598303232

# TYPE typedb_schema_data_count gauge
typedb_schema_data_count{database="212487319", kind="typeCount"} 74
typedb_schema_data_count{database="212487319", kind="entityCount"} 2891
typedb_schema_data_count{database="212487319", kind="relationCount"} 2466
typedb_schema_data_count{database="212487319", kind="attributeCount"} 5832
typedb_schema_data_count{database="212487319", kind="hasCount"} 13325
typedb_schema_data_count{database="212487319", kind="roleCount"} 7984
typedb_schema_data_count{database="212487319", kind="storageInBytes"} 2164793
typedb_schema_data_count{database="212487319", kind="storageKeyCount"} 94028
typedb_schema_data_count{database="3717486", kind="typeCount"} 5
typedb_schema_data_count{database="3717486", kind="entityCount"} 0
typedb_schema_data_count{database="3717486", kind="relationCount"} 0
typedb_schema_data_count{database="3717486", kind="attributeCount"} 0
typedb_schema_data_count{database="3717486", kind="hasCount"} 0
typedb_schema_data_count{database="3717486", kind="roleCount"} 0
typedb_schema_data_count{database="3717486", kind="storageInBytes"} 0
typedb_schema_data_count{database="3717486", kind="storageKeyCount"} 0

# TYPE typedb_attempted_requests_total counter
typedb_attempted_requests_total{kind="CONNECTION_OPEN"} 4
typedb_attempted_requests_total{kind="DATABASES_ALL"} 4
typedb_attempted_requests_total{kind="DATABASES_GET"} 4
typedb_attempted_requests_total{kind="SERVERS_ALL"} 4
typedb_attempted_requests_total{database="212487319", kind="DATABASES_CONTAINS"} 2
typedb_attempted_requests_total{database="212487319", kind="SESSION_OPEN"} 2
typedb_attempted_requests_total{database="212487319", kind="TRANSACTION_EXECUTE"} 70
typedb_attempted_requests_total{database="212487319", kind="SESSION_CLOSE"} 1
typedb_attempted_requests_total{database="3717486", kind="DATABASES_CONTAINS"} 2
typedb_attempted_requests_total{database="3717486", kind="SESSION_OPEN"} 2
typedb_attempted_requests_total{database="3717486", kind="TRANSACTION_EXECUTE"} 54
typedb_attempted_requests_total{database="3717486", kind="SESSION_CLOSE"} 1

# TYPE typedb_successful_requests_total counter
typedb_successful_requests_total{kind="CONNECTION_OPEN"} 4
typedb_successful_requests_total{kind="DATABASES_ALL"} 4
typedb_successful_requests_total{kind="DATABASES_GET"} 4
typedb_successful_requests_total{kind="SERVERS_ALL"} 4
typedb_successful_requests_total{kind="USER_TOKEN"} 8
typedb_successful_requests_total{database="212487319", kind="DATABASES_CONTAINS"} 2
typedb_successful_requests_total{database="212487319", kind="SESSION_OPEN"} 2
typedb_successful_requests_total{database="212487319", kind="TRANSACTION_EXECUTE"} 67
typedb_successful_requests_total{database="212487319", kind="SESSION_CLOSE"} 1
typedb_successful_requests_total{database="3717486", kind="DATABASES_CONTAINS"} 2
typedb_successful_requests_total{database="3717486", kind="SESSION_OPEN"} 2
typedb_successful_requests_total{database="3717486", kind="TRANSACTION_EXECUTE"} 47
typedb_successful_requests_total{database="3717486", kind="SESSION_CLOSE"} 1

# TYPE typedb_error_total counter
typedb_error_total{database="3717486", code="TYR03"} 5
typedb_error_total{database="3717486", code="TXN08"} 2

Example diagnostics JSON data from monitoring (http://localhost:4104/metrics?format=JSON):

{
  "version": 1,
  "deploymentID": "HTAOYJNSRYY2WOUR",
  "serverID": "HTAOYJNSRYY2WOUR",
  "distribution": "TypeDB Core",
  "timestamp": "2024-05-14T09:50:46",
  "server": {
    "version": "2.28.0",
    "uptimeInSeconds": 134,
    "os": {
      "name": "Mac OS X",
      "arch": "x86_64",
      "version": "14.2.1"
    },
    "memoryUsedInBytes": 68151644160,
    "memoryAvailableInBytes": 567832576,
    "diskUsedInBytes": 175619862528,
    "diskAvailableInBytes": 1819598303232
  },
  "load": [
    {
      "database": "212487319",
      "schema": {
        "typeCount": 74
      },
      "data": {
        "entityCount": 2891,
        "relationCount": 2466,
        "attributeCount": 5832,
        "hasCount": 13325,
        "roleCount": 7984,
        "storageInBytes": 2164793,
        "storageKeyCount": 94028
      }
    },
    {
      "database": "3717486",
      "schema": {
        "typeCount": 5
      },
      "data": {
        "entityCount": 0,
        "relationCount": 0,
        "attributeCount": 0,
        "hasCount": 0,
        "roleCount": 0,
        "storageInBytes": 0,
        "storageKeyCount": 0
      }
    }
  ],
  "actions": [
    {
      "name": "CONNECTION_OPEN",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "DATABASES_ALL",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "DATABASES_GET",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "SERVERS_ALL",
      "attempted": 4,
      "successful": 4
    },
    {
      "name": "DATABASES_CONTAINS",
      "database": "212487319",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "SESSION_OPEN",
      "database": "212487319",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "TRANSACTION_EXECUTE",
      "database": "212487319",
      "attempted": 70,
      "successful": 67
    },
    {
      "name": "SESSION_CLOSE",
      "database": "212487319",
      "attempted": 1,
      "successful": 1
    },
    {
      "name": "DATABASES_CONTAINS",
      "database": "3717486",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "SESSION_OPEN",
      "database": "3717486",
      "attempted": 2,
      "successful": 2
    },
    {
      "name": "TRANSACTION_EXECUTE",
      "database": "3717486",
      "attempted": 54,
      "successful": 47
    },
    {
      "name": "SESSION_CLOSE",
      "database": "3717486",
      "attempted": 1,
      "successful": 1
    }
  ],
  "errors": [
    {
      "code": "TYR03",
      "database": "3717486",
      "count": 5
    },
    {
      "code": "TXN08",
      "database": "3717486",
      "count": 2
    }
  ]
}

Example of diagnostics JSON data sent when the reporting flag is turned on:

{
  "version":1,
  "deploymentID":"HTAOYJNSRYY2WOUR",
  "serverID":"HTAOYJNSRYY2WOUR",
  "distribution":"TypeDB Core",
  "timestamp":"2024-05-14T09:50:36",
  "periodInSeconds":3600,
  "enabled":true,
  "server":{
    "version":"2.28.0",
    "uptimeInSeconds":124,
    "os":{
      "name":"Mac OS X",
      "arch":"x86_64",
      "version":"14.2.1"
    },
    "memoryUsedInBytes":68097245184,
    "memoryAvailableInBytes":622231552,
    "diskUsedInBytes":175624044544,
    "diskAvailableInBytes":1819594121216
  },
  "load":[
    {
      "database":"212487319",
      "schema":{
        "typeCount":74
      },
      "data":{
        "entityCount":2868,
        "relationCount":2449,
        "attributeCount":5816,
        "hasCount":13247,
        "roleCount":7927,
        "storageInBytes":2164793,
        "storageKeyCount":93379
      },
      "connection":{
        "schemaTransactionPeakCount":0,
        "readTransactionPeakCount":1,
        "writeTransactionPeakCount":1
      }
    },
    {
      "database":"3717486",
      "schema":{
        "typeCount":5
      },
      "data":{
        "entityCount":0,
        "relationCount":0,
        "attributeCount":0,
        "hasCount":0,
        "roleCount":0,
        "storageInBytes":0,
        "storageKeyCount":0
      },
      "connection":{
        "schemaTransactionPeakCount":0,
        "readTransactionPeakCount":2,
        "writeTransactionPeakCount":1
      }
    }
  ],
  "actions":[
    {
      "name":"CONNECTION_OPEN",
      "successful":2,
      "failed":0
    },
    {
      "name":"DATABASES_ALL",
      "successful":2,
      "failed":0
    },
    {
      "name":"DATABASES_GET",
      "successful":2,
      "failed":0
    },
    {
      "name":"SERVERS_ALL",
      "successful":2,
      "failed":0
    },
    {
      "name":"DATABASES_CONTAINS",
      "database":"212487319",
      "successful":1,
      "failed":0
    },
    {
      "name":"SESSION_OPEN",
      "database":"212487319",
      "successful":1,
      "failed":0
    },
    {
      "name":"TRANSACTION_EXECUTE",
      "database":"212487319",
      "successful":32,
      "failed":2
    },
    {
      "name":"SESSION_CLOSE",
      "database":"212487319",
      "successful":1,
      "failed":0
    },
    {
      "name":"DATABASES_CONTAINS",
      "database":"3717486",
      "successful":1,
      "failed":0
    },
    {
      "name":"SESSION_OPEN",
      "database":"3717486",
      "successful":1,
      "failed":0
    },
    {
      "name":"TRANSACTION_EXECUTE",
      "database":"3717486",
      "successful":27,
      "failed":4
    },
    {
      "name":"SESSION_CLOSE",
      "database":"3717486",
      "successful":1,
      "failed":0
    }
  ],
  "errors":[
    {
      "code":"TYR03",
      "database":"3717486",
      "count":3
    },
    {
      "code":"TXN08",
      "database":"3717486",
      "count":1
    }
  ]
}

Example of diagnostics JSON data sent once when the reporting flag is turned off:

{
  "version":1,
  "deploymentID":"HTAOYJNSRYY2WOUR",
  "serverID":"HTAOYJNSRYY2WOUR",
  "distribution":"TypeDB Core",
  "timestamp":"2024-05-14T10:03:53",
  "periodInSeconds":3600,
  "enabled":false,
  "server":{
    "version":"2.28.0"
  }
}

Implementation

There is no huge refactoring as it's planned to be a cleaner feature in the incoming 3.0.

@farost farost added this to the 2.29.0 milestone May 7, 2024
@farost farost requested a review from dmitrii-ubskii May 7, 2024 08:31
@farost farost requested a review from haikalpribadi as a code owner May 7, 2024 08:31
this.usage = new CurrentCounts();
this.userErrors = new UserErrorStatistics();
public void takeSnapshot() {
this.base.updateSinceTimestamp();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added sinceTimestamp in the json data in the end as it was strange to see diffs in the diagnostics on the monitoring page. It's not used anywhere in the external code, but could potentially be in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the monitoring page was supposed to use the same data as Prometheus, with counters steadily ticking up?

Copy link
Member Author

@farost farost May 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but I just didn't know about it. Let's confirm with anyone who might be interested in this page. It would make sense, I guess.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it as you proposed. WIll update the examples and description tomorrow's morning.

server/TransactionService.java Outdated Show resolved Hide resolved
Copy link
Member

@dmitrii-ubskii dmitrii-ubskii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review

test/integration/server/ParametersTest.java Show resolved Hide resolved
test/integration/server/ParametersTest.java Outdated Show resolved Hide resolved
test/integration/server/ParametersTest.java Outdated Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this was going to be renamed to smth like DiagnosticsStore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided not to rename anything before 3.0 as it could become messy with multiple changes. We are used to the current naming and it will be easer to reimplement and rename it in 3.0 this way, but I'm open to renaming it if it won't disturb other engineers even more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think anyone would care, hardly anyone would refer to this class outside of significant changes in the service layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'd like to leave it as Metrics for now as it's at least another anchor which could be shown when you search for metrics and diagnostics store is a kind of another entity we'd introduce with this renaming. But I wouldn't resist if you insisted that Metrics is a bad option to leave in the codebase.

this.usage = new CurrentCounts();
this.userErrors = new UserErrorStatistics();
public void takeSnapshot() {
this.base.updateSinceTimestamp();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the monitoring page was supposed to use the same data as Prometheus, with counters steadily ticking up?

common/diagnostics/Metrics.java Outdated Show resolved Hide resolved
common/diagnostics/Metrics.java Outdated Show resolved Hide resolved
common/diagnostics/Metrics.java Outdated Show resolved Hide resolved
UUID sessionID = byteStringAsUUID(request.getSessionId());
SessionService sessionSvc = sessionServices.get(sessionID);
if (sessionSvc == null) throw TypeDBException.of(SESSION_NOT_FOUND, sessionID);
databaseName = sessionSvc.session().database().name();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as careful here as in TransactionService, eh?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it's inside try block and it would be a valid throw if the request didn't contain any of session, database, or its name. That's what I thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm but now if sessionSvc != null, but database for some reason is, sessionSvc.close() won't be called. I don't know if that's even possible, but it is changing the behaviour in that case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I didn't pay enough attention to the business logic here, my bad. And having a throw because of this line after everything would be dumb as well. Will be careful here as well then...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually add a databaseName() getter to the session service so that we can at least have one fewer null check in these places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

common/diagnostics/Diagnostics.java Outdated Show resolved Hide resolved
common/diagnostics/Diagnostics.java Outdated Show resolved Hide resolved
common/diagnostics/Metrics.java Outdated Show resolved Hide resolved
database/CoreDatabaseManager.java Show resolved Hide resolved
@dmitrii-ubskii dmitrii-ubskii self-assigned this May 13, 2024
@farost farost modified the milestones: 2.29.0, 2.28.1 May 14, 2024
Copy link
Member

@dmitrii-ubskii dmitrii-ubskii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments to address, but otherwise LGTM!

common/diagnostics/Metrics.java Outdated Show resolved Hide resolved
common/diagnostics/Metrics.java Outdated Show resolved Hide resolved
@flyingsilverfin flyingsilverfin merged commit 7a3cf41 into vaticle:development May 20, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants