Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Cluster stuck in the pending status and shutdown itself with Dask Gateway over the Slurm HPC Cluster #478

Open
menendes opened this issue Feb 15, 2022 · 4 comments

Comments

@menendes
Copy link

What happened: When I try to create cluster via dask gateway I getting error like below. When cluster created successfully ; I think it stucks in the pending status and shut down itself automatically. When I only use slurm command like sbatch I can verified that job successfully run over slurm cluster but when I try to generate job via dask gateway it automatically close itself after a few seconds.

from dask_gateway import Gateway
from dask_gateway import BasicAuth

auth = BasicAuth(username="dask", password="password")

gateway = Gateway("http://10.100.3.99:8000", auth=auth)

print(gateway.list_clusters())

cluster = gateway.new_cluster()
print(gateway.list_clusters())
gateway.close()

dask_gateway_config.py

c.DaskGateway.backend_class = (
    "dask_gateway_server.backends.jobqueue.slurm.SlurmBackend"
)

c.DaskGateway.authenticator_class = "dask_gateway_server.auth.SimpleAuthenticator"
c.SimpleAuthenticator.password = "password"
#c.SimpleAuthenticator.username = "dask"
c.DaskGateway.log_level = 'DEBUG'
#c.DaskGateway.show_config = True
c.SlurmClusterConfig.scheduler_cores = 1
c.SlurmClusterConfig.scheduler_memory = '500 M'
c.SlurmClusterConfig.staging_directory = '{home}/.dask-gateway/'
c.SlurmClusterConfig.worker_cores = 1
c.SlurmClusterConfig.worker_memory = '500 M'
c.SlurmBackend.backoff_base_delay = 0.1
c.SlurmBackend.backoff_max_delay = 300
#c.SlurmBackend.check_timeouts_period = 0.0
c.SlurmBackend.cluster_config_class = 'dask_gateway_server.backends.jobqueue.slurm.SlurmClusterConfig'
c.SlurmBackend.cluster_heartbeat_period = 15
c.SlurmBackend.cluster_start_timeout = 60
c.SlurmBackend.cluster_status_period = 30
c.SlurmBackend.dask_gateway_jobqueue_launcher = '/opt/dask-gateway/miniconda/bin/dask-gateway-jobqueue-launcher'

c.SlurmClusterConfig.adaptive_period = 3
c.SlurmClusterConfig.partition = 'computenodes'

scontrol show job output

scontrol_output

Environment:

  • Dask version: 2022.1.1
  • Python version: 3.9.5
  • Operating System: Ubuntu 20.04.2 LTS
  • Slurm Package : slurm-wlm/focal,now 19.05.5-1 amd64
  • Install method (conda, pip, source): Conda
@martindurant
Copy link
Member

Can you get the log from the failed process? As far as I can tell, the printout only says that it terminated with a non-zero code.

@menendes
Copy link
Author

Can you get the log from the failed process? As far as I can tell, the printout only says that it terminated with a non-zero code.

Hi Martin, you mean that slurmctld or slurmd log ? Where exactly can I view job logs ?

@martindurant
Copy link
Member

I'm afraid I don't know where such a log would appear, perhaps your sysadmin would know.

@menendes
Copy link
Author

menendes commented Feb 17, 2022

In the worker node when I view the logs I notice that some errors. Related logs in the below.

Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: Launching batch job 67 for UID 1001
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherEnergy NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherProfile NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherInterconnect NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  AcctGatherFilesystem NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  switch NONE plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Job accounting gather LINUX plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  cont_id hasn't been set yet not running poll
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  laying out the 1 tasks on 1 hosts testslurmworker1 dist 2
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Message thread started pid = 41666
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: task affinity plugin loaded with CPU mask 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000>
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Checkpoint plugin loaded: checkpoint/none
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: Munge credential signature plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  job_container none plugin loaded
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: Could not open stdout file /home/dask/.dask-gateway/2428b456f82a44fdb3c8e57576662e8f/dask-scheduler-2428b456f82a44fdb3c8e57576662e8f.log: >
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: error: IO setup failed: No such file or directory
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  step_terminate_monitor_stop signaling condition
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: job 67 completed with slurm_rc = 0, job_rc = 256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: debug:  Message thread exited
Şub 17 09:03:40 testslurmworker1 slurmstepd[41666]: done with job
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  _rpc_terminate_job, uid = 64030
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  task_p_slurmd_release_resources: affinity jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  credential for job 67 revoked
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Waiting for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Finished wait for job 67's prolog to complete
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Calling /usr/sbin/slurmstepd spank epilog
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  Reading slurm.conf file: /etc/slurm-llnl/slurm.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  Running spank/epilog for jobid [67] uid [1001]
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  spank: opening plugin stack /etc/slurm-llnl/plugstack.conf
Şub 17 09:03:40 testslurmworker1 spank-epilog[41673]: debug:  /etc/slurm-llnl/plugstack.conf: 1: include "/etc/slurm-llnl/plugstack.conf.d/*.conf"
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  completed epilog for jobid 67
Şub 17 09:03:40 testslurmworker1 slurmd-testslurmworker1[905]: debug:  Job 67: sent epilog complete msg: rc = 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants