Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I cannot make it work on GPU for training #5776

Open
dtamienER opened this issue May 3, 2024 · 3 comments
Open

I cannot make it work on GPU for training #5776

dtamienER opened this issue May 3, 2024 · 3 comments

Comments

@dtamienER
Copy link

dtamienER commented May 3, 2024

Description of the issue

I cannot run any experiment on GPU.

I have tried both with a Tesla P4, a P100 and a GTX 1060. I can only make it work using CPU only.

I have tried many configs with setting useActiveGpu to True or False, trialGpuNumber to 1, gpuIndices: '0'. However it always couldn't complete a single architecture training.

I have tried both outside and inside a Docker container.

Configuration

  • Experiment config: nni/examples/trials/mnist-pytorch/config.yml

Outside a Docker container

Environment

  • NNI version: 3.0
  • Training service: local
  • Client OS: Debian 10
  • Python version: 3.10.13
  • PyTorch/TensorFlow version: 2.3.0+cu121
  • Is conda/virtualenv/venv used?: yes

Log message

nnimanager.log

[2024-05-03 10:54:56] WARNING (pythonScript) Python command [nni.tools.nni_manager_scripts.collect_gpu_info] has stderr: Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 174, in <module>
    main()
  File "/opt/conda/lib/python3.10/site-packages/nni/tools/nni_manager_scripts/collect_gpu_info.py", line 34, in main
    print(json.dumps(data), flush=True)
  File "/opt/conda/lib/python3.10/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/opt/conda/lib/python3.10/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/conda/lib/python3.10/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
  File "/opt/conda/lib/python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type bytes is not JSON serializable
 
[2024-05-03 10:54:56] INFO (ShutdownManager) Initiate shutdown: training service initialize failed
[2024-05-03 10:54:56] ERROR (GpuInfoCollector) Failed to collect GPU info, collector output: 
[2024-05-03 10:54:56] ERROR (TrainingServiceCompat) Training srevice initialize failed: Error: TaskScheduler: Failed to collect GPU info
    at TaskScheduler.init (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler/scheduler.js:16:19)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async TaskSchedulerClient.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/task_scheduler_client.js:20:13)
    at async Promise.all (index 0)
    at async TrialKeeper.start (/opt/conda/lib/python3.10/site-packages/nni_node/common/trial_keeper/keeper.js:48:9)
    at async LocalTrainingServiceV3.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/local_v3/local.js:28:9)
    at async V3asV1.start (/opt/conda/lib/python3.10/site-packages/nni_node/training_service/v3/compat.js:235:29

There, the GPU's infos cannot be retreived.

experiment.log

[2024-05-03 13:52:31] INFO (nni.experiment) Starting web server...
[2024-05-03 13:52:32] INFO (nni.experiment) Setting up...
[2024-05-03 13:52:33] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8081 http://10.164.0.8:8081 http://172.17.0.1:8081
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 13:53:03] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 13:53:03] INFO (nni.experiment) Stopping NNI manager, if any...
[2024-05-03 13:53:23] ERROR (nni.experiment) HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
Traceback (most recent call last):
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 537, in _make_request
    response = conn.getresponse()
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connection.py", line 466, in getresponse
    httplib_response = super().getresponse()
  File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 1377, in getresponse
    response.begin()
  File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/opt/conda/envs/nni/lib/python3.9/http/client.py", line 281, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/opt/conda/envs/nni/lib/python3.9/socket.py", line 704, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 486, in send
    resp = conn.urlopen(
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 847, in urlopen
    retries = retries.increment(
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/retry.py", line 470, in increment
    raise reraise(type(error), error, _stacktrace)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/util/util.py", line 39, in reraise
    raise value
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 793, in urlopen
    response = self._make_request(
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 539, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/urllib3/connectionpool.py", line 370, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/experiment.py", line 171, in _stop_nni_manager
    rest.delete(self.port, '/experiment', self.url_prefix)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 52, in delete
    request('delete', port, api, prefix=prefix)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/nni/experiment/rest.py", line 31, in request
    resp = requests.request(method, url, timeout=timeout)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/sessions.py", line 703, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/envs/nni/lib/python3.9/site-packages/requests/adapters.py", line 532, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='localhost', port=8081): Read timed out. (read timeout=20)
[2024-05-03 13:53:23] WARNING (nni.experiment) Cannot gracefully stop experiment, killing NNI process...

There is a timeout since data cannot be retreived.

Inside a Docker container

Dockerfile

# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.

FROM nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04

ARG NNI_RELEASE

LABEL maintainer='Microsoft NNI Team<nni@microsoft.com>'

ENV DEBIAN_FRONTEND=noninteractive 

RUN apt-get -y update
RUN apt-get -y install \
    automake \
    build-essential \
    cmake \
    curl \
    git \
    openssh-server \
    python3 \
    python3-dev \
    python3-pip \
    sudo \
    unzip \
    wget \
    zip
RUN apt-get clean
RUN rm -rf /var/lib/apt/lists/*

RUN ln -s python3 /usr/bin/python

RUN python3 -m pip --no-cache-dir install pip==22.0.3 setuptools==60.9.1 wheel==0.37.1

RUN python3 -m pip --no-cache-dir install \
    lightgbm==3.3.2 \
    numpy==1.22.2 \
    pandas==1.4.1 \
    scikit-learn==1.0.2 \
    scipy==1.8.0

RUN python3 -m pip --no-cache-dir install \
    torch==1.10.2+cu113 \
    torchvision==0.11.3+cu113 \
    torchaudio==0.10.2+cu113 \
    -f https://download.pytorch.org/whl/cu113/torch_stable.html
RUN python3 -m pip --no-cache-dir install pytorch-lightning==1.6.1

RUN python3 -m pip --no-cache-dir install tensorflow==2.9.1

RUN python3 -m pip --no-cache-dir install azureml==0.2.7 azureml-sdk==1.38.0

# COPY dist/nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl .
# RUN python3 -m pip install nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl
# RUN rm nni-${NNI_RELEASE}-py3-none-manylinux1_x86_64.whl

ENV PATH=/root/.local/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/bin:/usr/bin:/usr/sbin

WORKDIR /root

RUN pip install nni
RUN git clone https://github.com/microsoft/nni.git

RUN apt-get -y update
RUN apt-get -y install nano

Log message

nnimanager.log

root@1b02414e6d3e:~/nni-experiments/_latest/log# cat nnimanager.log 
[2024-05-03 14:46:11] DEBUG (WsChannelServer.tuner) Start listening tuner/:channel
[2024-05-03 14:46:11] INFO (main) Start NNI manager
[2024-05-03 14:46:11] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-05-03 14:46:11] INFO (RestServer) REST server started.
[2024-05-03 14:46:11] DEBUG (SqlDB) Database directory: /root/nni-experiments/o21hdgqs/db
[2024-05-03 14:46:11] INFO (NNIDataStore) Datastore initialization done
[2024-05-03 14:46:11] DEBUG (main) start() returned.
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) POST: /experiment: body: {
  experimentType: 'hpo',
  searchSpaceFile: '/root/nni/examples/trials/mnist-pytorch/search_space.json',
  searchSpace: {
    batch_size: { _type: 'choice', _value: [Array] },
    hidden_size: { _type: 'choice', _value: [Array] },
    lr: { _type: 'choice', _value: [Array] },
    momentum: { _type: 'uniform', _value: [Array] }
  },
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialConcurrency: 1,
  trialGpuNumber: 1,
  useAnnotation: false,
  debug: false,
  logLevel: 'info',
  experimentWorkingDirectory: '/root/nni-experiments',
  tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } },
  trainingService: {
    platform: 'local',
    trialCommand: 'python3 mnist.py',
    trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
    trialGpuNumber: 1,
    debug: false,
    useActiveGpu: true,
    maxTrialNumberPerGpu: 1,
    reuseMode: false
  }
}
[2024-05-03 14:46:12] INFO (NNIManager) Starting experiment: o21hdgqs
[2024-05-03 14:46:12] INFO (NNIManager) Setup training service...
[2024-05-03 14:46:12] DEBUG (LocalV3.local) Training sevice config: {
  platform: 'local',
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialGpuNumber: 1,
  debug: false,
  useActiveGpu: true,
  maxTrialNumberPerGpu: 1,
  reuseMode: false
}
[2024-05-03 14:46:12] INFO (NNIManager) Setup tuner...
[2024-05-03 14:46:12] DEBUG (NNIManager) dispatcher command: /usr/bin/python3,-m,nni,--exp_params,eyJleHBlcmltZW50VHlwZSI6ImhwbyIsInNlYXJjaFNwYWNlRmlsZSI6Ii9yb290L25uaS9leGFtcGxlcy90cmlhbHMvbW5pc3QtcHl0b3JjaC9zZWFyY2hfc3BhY2UuanNvbiIsInRyaWFsQ29tbWFuZCI6InB5dGhvbjMgbW5pc3QucHkiLCJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvcm9vdC9ubmkvZXhhbXBsZXMvdHJpYWxzL21uaXN0LXB5dG9yY2giLCJ0cmlhbENvbmN1cnJlbmN5IjoxLCJ0cmlhbEdwdU51bWJlciI6MSwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvcm9vdC9ubmktZXhwZXJpbWVudHMiLCJ0dW5lciI6eyJuYW1lIjoiVFBFIiwiY2xhc3NBcmdzIjp7Im9wdGltaXplX21vZGUiOiJtYXhpbWl6ZSJ9fSwidHJhaW5pbmdTZXJ2aWNlIjp7InBsYXRmb3JtIjoibG9jYWwiLCJ0cmlhbENvbW1hbmQiOiJweXRob24zIG1uaXN0LnB5IiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL3Jvb3Qvbm5pL2V4YW1wbGVzL3RyaWFscy9tbmlzdC1weXRvcmNoIiwidHJpYWxHcHVOdW1iZXIiOjEsImRlYnVnIjpmYWxzZSwidXNlQWN0aXZlR3B1Ijp0cnVlLCJtYXhUcmlhbE51bWJlclBlckdwdSI6MSwicmV1c2VNb2RlIjpmYWxzZX19
[2024-05-03 14:46:12] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-05-03 14:46:12] DEBUG (tuner_command_channel) Waiting connection...
[2024-05-03 14:46:12] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:46:13] DEBUG (WsChannelServer.tuner) Incoming connection __default__
[2024-05-03 14:46:13] DEBUG (WsChannel.__default__) Epoch 0 start
[2024-05-03 14:46:13] INFO (NNIManager) Add event listeners
[2024-05-03 14:46:13] DEBUG (NNIManager) Send tuner command: INITIALIZE: [object Object]
[2024-05-03 14:46:13] INFO (LocalV3.local) Start
[2024-05-03 14:46:13] INFO (NNIManager) NNIManager received command from dispatcher: ID, 
[2024-05-03 14:46:13] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.001, "momentum": 0.47523697672790355}, "parameter_index": 0}
[2024-05-03 14:46:14] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 0,
  hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 32, "hidden_size": 128, "lr": 0.001, "momentum": 0.47523697672790355}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2024-05-03 14:46:15] INFO (GpuInfoCollector) Forced update: {
  gpuNumber: 1,
  driverVersion: '550.54.15',
  cudaVersion: 12040,
  gpus: [
    {
      index: 0,
      model: 'Tesla T4',
      cudaCores: 2560,
      gpuMemory: 16106127360,
      freeGpuMemory: 15642263552,
      gpuCoreUtilization: 0,
      gpuMemoryUtilization: 0
    }
  ],
  processes: [],
  success: true
}
[2024-05-03 14:46:17] INFO (LocalV3.local) Register directory trial_code = /root/nni/examples/trials/mnist-pytorch

experiment.log

root@1b02414e6d3e:~/nni-experiments/_latest/log# cat experiment.log 
[2024-05-03 14:46:11] INFO (nni.experiment) Creating experiment, Experiment ID: o21hdgqs
[2024-05-03 14:46:11] INFO (nni.experiment) Starting web server...
[2024-05-03 14:46:12] INFO (nni.experiment) Setting up...
[2024-05-03 14:46:12] INFO (nni.experiment) Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-05-03 14:46:42] INFO (nni.experiment) Stopping experiment, please wait...
[2024-05-03 14:46:42] INFO (nni.experiment) Saving experiment checkpoint...
[2024-05-03 14:46:42] INFO (nni.experiment) Stopping NNI manager, if any...

When I'm using CPU only:

I obtain what I want using the GPU, the WebUI, the experiments trials, and so on...

root@6dcd2267cf44:~# nnictl create --config nni/examples/trials/mnist-pytorch/config.yml --foreground --debug
[2024-05-03 14:37:54] Creating experiment, Experiment ID: tcq192jf
[2024-05-03 14:37:54] Starting web server...
[2024-05-03 14:37:55] DEBUG (WsChannelServer.tuner) Start listening tuner/:channel
[2024-05-03 14:37:55] INFO (main) Start NNI manager
[2024-05-03 14:37:55] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-05-03 14:37:55] INFO (RestServer) REST server started.
[2024-05-03 14:37:55] DEBUG (SqlDB) Database directory: /root/nni-experiments/tcq192jf/db
[2024-05-03 14:37:55] INFO (NNIDataStore) Datastore initialization done
[2024-05-03 14:37:55] DEBUG (main) start() returned.
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:37:55] Setting up...
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) POST: /experiment: body: {
  experimentType: 'hpo',
  searchSpaceFile: '/root/nni/examples/trials/mnist-pytorch/search_space.json',
  searchSpace: {
    batch_size: { _type: 'choice', _value: [Array] },
    hidden_size: { _type: 'choice', _value: [Array] },
    lr: { _type: 'choice', _value: [Array] },
    momentum: { _type: 'uniform', _value: [Array] }
  },
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialConcurrency: 1,
  trialGpuNumber: 0,
  useAnnotation: false,
  debug: false,
  logLevel: 'info',
  experimentWorkingDirectory: '/root/nni-experiments',
  tuner: { name: 'TPE', classArgs: { optimize_mode: 'maximize' } },
  trainingService: {
    platform: 'local',
    trialCommand: 'python3 mnist.py',
    trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
    trialGpuNumber: 0,
    debug: false,
    maxTrialNumberPerGpu: 1,
    reuseMode: false
  }
}
[2024-05-03 14:37:55] INFO (NNIManager) Starting experiment: tcq192jf
[2024-05-03 14:37:55] INFO (NNIManager) Setup training service...
[2024-05-03 14:37:55] DEBUG (LocalV3.local) Training sevice config: {
  platform: 'local',
  trialCommand: 'python3 mnist.py',
  trialCodeDirectory: '/root/nni/examples/trials/mnist-pytorch',
  trialGpuNumber: 0,
  debug: false,
  maxTrialNumberPerGpu: 1,
  reuseMode: false
}
[2024-05-03 14:37:55] INFO (NNIManager) Setup tuner...
[2024-05-03 14:37:55] DEBUG (NNIManager) dispatcher command: /usr/bin/python3,-m,nni,--exp_params,eyJleHBlcmltZW50VHlwZSI6ImhwbyIsInNlYXJjaFNwYWNlRmlsZSI6Ii9yb290L25uaS9leGFtcGxlcy90cmlhbHMvbW5pc3QtcHl0b3JjaC9zZWFyY2hfc3BhY2UuanNvbiIsInRyaWFsQ29tbWFuZCI6InB5dGhvbjMgbW5pc3QucHkiLCJ0cmlhbENvZGVEaXJlY3RvcnkiOiIvcm9vdC9ubmkvZXhhbXBsZXMvdHJpYWxzL21uaXN0LXB5dG9yY2giLCJ0cmlhbENvbmN1cnJlbmN5IjoxLCJ0cmlhbEdwdU51bWJlciI6MCwidXNlQW5ub3RhdGlvbiI6ZmFsc2UsImRlYnVnIjpmYWxzZSwibG9nTGV2ZWwiOiJpbmZvIiwiZXhwZXJpbWVudFdvcmtpbmdEaXJlY3RvcnkiOiIvcm9vdC9ubmktZXhwZXJpbWVudHMiLCJ0dW5lciI6eyJuYW1lIjoiVFBFIiwiY2xhc3NBcmdzIjp7Im9wdGltaXplX21vZGUiOiJtYXhpbWl6ZSJ9fSwidHJhaW5pbmdTZXJ2aWNlIjp7InBsYXRmb3JtIjoibG9jYWwiLCJ0cmlhbENvbW1hbmQiOiJweXRob24zIG1uaXN0LnB5IiwidHJpYWxDb2RlRGlyZWN0b3J5IjoiL3Jvb3Qvbm5pL2V4YW1wbGVzL3RyaWFscy9tbmlzdC1weXRvcmNoIiwidHJpYWxHcHVOdW1iZXIiOjAsImRlYnVnIjpmYWxzZSwibWF4VHJpYWxOdW1iZXJQZXJHcHUiOjEsInJldXNlTW9kZSI6ZmFsc2V9fQ==
[2024-05-03 14:37:55] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-05-03 14:37:55] DEBUG (tuner_command_channel) Waiting connection...
[2024-05-03 14:37:55] Web portal URLs: http://127.0.0.1:8080 http://172.17.0.2:8080
[2024-05-03 14:37:55] DEBUG (NNIRestHandler) GET: /check-status: body: {}
[2024-05-03 14:37:57] DEBUG (WsChannelServer.tuner) Incoming connection __default__
[2024-05-03 14:37:57] DEBUG (WsChannel.__default__) Epoch 0 start
[2024-05-03 14:37:57] INFO (NNIManager) Add event listeners
[2024-05-03 14:37:57] DEBUG (NNIManager) Send tuner command: INITIALIZE: [object Object]
[2024-05-03 14:37:57] INFO (LocalV3.local) Start
[2024-05-03 14:37:57] INFO (NNIManager) NNIManager received command from dispatcher: ID, 
[2024-05-03 14:37:57] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}
[2024-05-03 14:37:57] INFO (NNIManager) submitTrialJob: form: {
  sequenceId: 0,
  hyperParameters: {
    value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}',
    index: 0
  },
  placementConstraint: { type: 'None', gpus: [] }
}
[2024-05-03 14:37:58] INFO (LocalV3.local) Register directory trial_code = /root/nni/examples/trials/mnist-pytorch
[2024-05-03 14:37:58] INFO (LocalV3.local) Created trial wcvTY
[2024-05-03 14:38:00] INFO (LocalV3.local) Trial parameter: wcvTY {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"batch_size": 128, "hidden_size": 1024, "lr": 0.001, "momentum": 0.6039114358987745}, "parameter_index": 0}
[2024-05-03 14:38:05] DEBUG (NNIRestHandler) GET: /check-status: body: {}
...

How to reproduce it?

If from a Docker container:

docker build -t "nas-experiment" .
nvidia-docker run -it -p 8081:8081 nas-101-experiment

Then in both cases:

  1. I would both outside and inside a Docker container modify the file from /nni/example/trials/mnist-pytorch/config.yml in order to set the process on GPU.
  2. Then I would run the following command so I could see the logs in direct.
nnictl create --config /nni/example/trials/mnist-pytorch/config.yml --port 8081 --debug --foreground

As a result, the WebUI wouldn't start due to a timeout trying to retrive data, since the experiment won't load on GPU.

Notes

  • I very available to answer and get helped on the subject as I currently work on NAS.
  • I'm going to see what is ArchAI and how it differs from nii util I can use GPU for training there.
  • I'm using GCP Instances to do this search
@dtamienER
Copy link
Author

I made it using a devcontainer with version 2.7 of nni

Dockerfile

FROM msranni/nni:v2.7

RUN pip install matplotlib tensorflow_datasets dill

Still having problems with version 3.0

@Rajesh90123
Copy link

Rajesh90123 commented May 22, 2024

I have somewhat similar issue.

authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
  builtinTunerName: Random
  classArgs:
    optimize_mode: minmize
trial:
  command: python train.py
  codeDir: .

when i work in this fashion, the code runs on CPU. But when I run the code as follow:

authorName: default
experimentName: hyperparam searching
trialConcurrency: 1
trainingServicePlatform: local
useAnnotation: false
searchSpacePath: searching_space.json
tuner:
  builtinTunerName: Random
  classArgs:
    optimize_mode: minmize
trial:
  command: python train.py
  codeDir: .
  gpuNum: 1
localConfig:
  useActiveGpu: false

It creates 800+ python files and the link doesn't open anymore. It either crashes my PC (because of those multiple files) or the link will have Running 0. Why?

@msasen
Copy link

msasen commented May 23, 2024

I am having the same problem as Rajesh90123.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants