fix benchmark feature read-only apis #4675

makungaj1 · 2024-05-10T20:53:43Z

Issue #, if available:

Description of changes:

jumpstart_model = JumpStartModel(model_id="meta-textgeneration-llama-3-8b")
jumpstart_model.list_deployment_configs()

[{'DeploymentConfigName': 'lmi-accelerated',
  'DeploymentArgs': {'ImageUri': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-8b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.8',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.g5.2xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 98304,
    'NumberOfAcceleratorDevicesRequired': 4},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': {'ml.g5.2xlarge': [{'name': 'Latency',
     'value': '12',
     'unit': 'ms/token',
     'concurrency': '1'},
    {'name': 'Throughput',
     'value': '213',
     'unit': 'tokens/sec',
     'concurrency': '1'},
    {'name': 'Latency', 'value': '12', 'unit': 'ms/token', 'concurrency': '2'},
    {'name': 'Throughput',
     'value': '213',
     'unit': 'tokens/sec',
     'concurrency': '2'},
    {'name': 'Instance Rate',
     'value': '1.515',
     'unit': 'USD/Hrs',
     'concurrency': None}]}},
 {'DeploymentConfigName': 'lmi',
  'DeploymentArgs': {'ImageUri': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-8b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.8',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.g5.2xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 98304,
    'NumberOfAcceleratorDevicesRequired': 4},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': {'ml.g5.2xlarge': [{'name': 'Latency',
     'value': '36',
     'unit': 'ms/token',
     'concurrency': '1'},
    {'name': 'Throughput',
     'value': '390',
     'unit': 'tokens/sec',
     'concurrency': '1'},
    {'name': 'Latency', 'value': '36', 'unit': 'ms/token', 'concurrency': '2'},
    {'name': 'Throughput',
     'value': '390',
     'unit': 'tokens/sec',
     'concurrency': '2'},
    {'name': 'Instance Rate',
     'value': '1.515',
     'unit': 'USD/Hrs',
     'concurrency': None}]}},
 {'DeploymentConfigName': 'lmi-trtllm',
  'DeploymentArgs': {'ImageUri': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-8b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.8',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.g5.2xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 98304,
    'NumberOfAcceleratorDevicesRequired': 4},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': None},
 {'DeploymentConfigName': 'tgi',
  'DeploymentArgs': {'ImageUri': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121',
   'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-8b/artifacts/inference-prepack/v1.0.0/',
     'S3DataType': 'S3Prefix',
     'CompressionType': 'None'}},
   'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
    'ENDPOINT_SERVER_TIMEOUT': '3600',
    'MODEL_CACHE_ROOT': '/opt/ml/model',
    'SAGEMAKER_ENV': '1',
    'HF_MODEL_ID': '/opt/ml/model',
    'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
    'OPTION_MODEL_ID': '/opt/ml/model',
    'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
    'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
    'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
    'OPTION_ROLLING_BATCH': 'lmi-dist',
    'OPTION_GPU_MEMORY_UTILIZATION': '0.8',
    'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
   'InstanceType': 'ml.g5.2xlarge',
   'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 98304,
    'NumberOfAcceleratorDevicesRequired': 4},
   'ModelDataDownloadTimeout': 1200,
   'ContainerStartupHealthCheckTimeout': 1200},
  'AccelerationConfigs': None,
  'BenchmarkMetrics': None}]

jumpstart_model.deployment_config

{'DeploymentConfigName': 'lmi-accelerated',
 'DeploymentArgs': {'ImageUri': '763104351884.dkr.ecr.us-west-2.amazonaws.com/djl-inference:0.27.0-deepspeed0.12.6-cu121',
  'ModelData': {'S3DataSource': {'S3Uri': 's3://jumpstart-private-cache-prod-us-west-2/meta-textgeneration/meta-textgeneration-llama-3-8b/artifacts/inference-prepack/v1.0.0/',
    'S3DataType': 'S3Prefix',
    'CompressionType': 'None'}},
  'Environment': {'SAGEMAKER_PROGRAM': 'inference.py',
   'ENDPOINT_SERVER_TIMEOUT': '3600',
   'MODEL_CACHE_ROOT': '/opt/ml/model',
   'SAGEMAKER_ENV': '1',
   'HF_MODEL_ID': '/opt/ml/model',
   'SERVING_LOAD_MODELS': 'test::MPI=/opt/ml/model',
   'OPTION_MODEL_ID': '/opt/ml/model',
   'OPTION_SPECULATIVE_DRAFT_MODEL': 'sagemaker',
   'OPTION_TENSOR_PARALLEL_DEGREE': 'max',
   'OPTION_MAX_ROLLING_BATCH_SIZE': '64',
   'OPTION_ROLLING_BATCH': 'lmi-dist',
   'OPTION_GPU_MEMORY_UTILIZATION': '0.8',
   'SAGEMAKER_MODEL_SERVER_WORKERS': '1'},
  'InstanceType': 'ml.g5.2xlarge',
  'ComputeResourceRequirements': {'MinMemoryRequiredInMb': 98304,
   'NumberOfAcceleratorDevicesRequired': 4},
  'ModelDataDownloadTimeout': 1200,
  'ContainerStartupHealthCheckTimeout': 1200},
 'AccelerationConfigs': None,
 'BenchmarkMetrics': {'ml.g5.2xlarge': [{'name': 'Latency',
    'value': '12',
    'unit': 'ms/token',
    'concurrency': '1'},
   {'name': 'Throughput',
    'value': '213',
    'unit': 'tokens/sec',
    'concurrency': '1'},
   {'name': 'Latency', 'value': '12', 'unit': 'ms/token', 'concurrency': '2'},
   {'name': 'Throughput',
    'value': '213',
    'unit': 'tokens/sec',
    'concurrency': '2'},
   {'name': 'Instance Rate',
    'value': '1.515',
    'unit': 'USD/Hrs',
    'concurrency': None}]}}

jumpstart_model.display_benchmark_metrics()

| Instance Type           |   Concurrent Users | Config Name     |   Latency for each user (TTFT in ms) |   Throughput per user (token/seconds) |   Instance Rate (USD/Hrs) |
|:------------------------|-------------------:|:----------------|-------------------------------------:|--------------------------------------:|--------------------------:|
| ml.g5.2xlarge (Default) |                  1 | lmi-accelerated |                                   12 |                                   213 |                     1.515 |
| ml.g5.2xlarge           |                  2 | lmi-accelerated |                                   12 |                                   213 |                     1.515 |
| ml.g5.2xlarge           |                  1 | lmi             |                                   36 |                                   390 |                     1.515 |
| ml.g5.2xlarge           |                  2 | lmi             |                                   36 |                                   390 |                     1.515 |
| ml.g5.12xlarge          |                  1 | lmi             |                                   14 |                                   651 |                     7.09  |
| ml.g5.12xlarge          |                  2 | lmi             |                                   14 |                                   651 |                     7.09  |
| ml.p4d.24xlarge         |                  1 | lmi             |                                    7 |                                  2274 |                    38.67  |
| ml.p4d.24xlarge         |                  2 | lmi             |                                    7 |                                  2274 |                    38.67  |

Testing done:

Merge Checklist

Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.

General

[x ] I have read the CONTRIBUTING doc
[x ] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
[ x] I used the commit message format described in CONTRIBUTING
[x ] I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
[x ] I have updated any necessary documentation, including READMEs and API docs (if appropriate)

Tests

[ x] I have added tests that prove my fix is effective or that my feature works (if appropriate)
[x ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
[x ] I have checked that my tests are not configured for a specific region or account (if appropriate)
[x ] I have used unique_name_from_base to create resource names in integ tests (if appropriate)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

qiyunz · 2024-05-13T19:43:14Z

src/sagemaker/jumpstart/utils.py

+        str: Normalized metric column name.
+    """
+    if "latency" in name.lower():
+        name = "Latency for each user (TTFT in ms)"


could we use the metric unit from metadata directly?

We can, only if it's updated to the desired. Here is what's now

{ "name": "latency", "value": "36", "unit": "ms/token", "concurrency": "2" },

Jonathan Makunga added 28 commits May 7, 2024 19:35

Rearrange benchmark metric table

a64acca

Refactoring

2e39e1c

Refactoring

a8d51d4

Refactoring

66f7d99

Refactoring

9d21eb5

Debug

4806d40

Debug

75296e3

Debug

03e51a2

Debug

8765d88

Debug

7f3c97e

Debug

3af44fd

Debug

dec9384

Debug

1c5f262

Debug

ed67cde

Debug

2a074de

Debug

00c024c

Debug

a015e65

Debug

0591d07

Debug

bb96116

Debug

a212a7e

Debug

84bc36d

Debug

0a8ecab

Refactoring

d88ceeb

Refactoring

8ce0e76

Refactoring

d79b4fd

Refactoring

5575ef9

Refactoring

30ecfcb

Add Unit tests

bbae163

makungaj1 temporarily deployed to auto-approve May 10, 2024 20:53 — with GitHub Actions Inactive

makungaj1 marked this pull request as ready for review May 10, 2024 21:52

makungaj1 requested a review from a team as a code owner May 10, 2024 21:52

makungaj1 requested review from ptkab and removed request for a team May 10, 2024 21:52

qiyunz reviewed May 13, 2024

View reviewed changes

Refactoring

319d25c

makungaj1 temporarily deployed to auto-approve May 14, 2024 15:26 — with GitHub Actions Inactive

Refactoring

f8059f6

makungaj1 temporarily deployed to auto-approve May 15, 2024 15:42 — with GitHub Actions Inactive

hide index from DataFrame

ff2ec5f

makungaj1 temporarily deployed to auto-approve May 16, 2024 21:37 — with GitHub Actions Inactive

qiyunz approved these changes May 17, 2024

View reviewed changes

liujiaorr approved these changes May 22, 2024

View reviewed changes

liujiaorr merged commit 149edb7 into aws:master-benchmark-feature May 22, 2024
11 checks passed

makungaj1 deleted the master-benchmark-feature-concurrency branch May 23, 2024 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix benchmark feature read-only apis #4675

fix benchmark feature read-only apis #4675

makungaj1 commented May 10, 2024 •

edited

qiyunz May 13, 2024

makungaj1 May 13, 2024

fix benchmark feature read-only apis #4675

fix benchmark feature read-only apis #4675

Conversation

makungaj1 commented May 10, 2024 • edited

Merge Checklist

General

Tests

qiyunz May 13, 2024

Choose a reason for hiding this comment

makungaj1 May 13, 2024

Choose a reason for hiding this comment

makungaj1 commented May 10, 2024 •

edited