You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running model training on a compute cluster where the compute nodes do not have access to the internet. Therefore, while the jobs are running, I'm calling wandb sync at regular intervals from a node with internet access that shares the same file system, so that I can follow the training in the UI. When I do this, however, the jobs that are running are listed as "Finished" in the UI throughout the whole run, which makes downstream evaluation problematic as it's set to only run eval on finished jobs.
My question is, is the expected behavior, or am I doing something wrong?
Best, Lars
The text was updated successfully, but these errors were encountered:
Hi @ankile, this does seem like a bug on our end where a live run is being marked as finished upon being synced. We did fix this issue a few years ago and it seems like this is a regression. Here's the condition in our code which handles this, we'll investigate further internally and identify the root cause.
I was able to reproduce this with the following script as well:
import wandb
import time
import numpy as np
import pandas as pd
run = wandb.init(project="<project_name>")
for i in range(0, 100):
table = wandb.Table(data=pd.DataFrame(np.random.randint(0, 1001, size=(3, 3)),
columns=['a', 'b', 'c']),
columns=['a', 'b', 'c'])
wandb.log({"table": table})
time.sleep(5)
wandb.finish()
On a side note, W&B recommends syncing an offline run once it finishes (after run.finish() is called) to avoid running into any error states. The reason is if one tries to sync an active run, the config of the run and some other files are not up-to-date at the point of syncing and only partial data is uploaded. You should see a warning such as:
WARNING .wandb file is incomplete (record checksum is invalid, data may be corrupt), be sure to sync this run again once it's finished
Thank you so much for your response and this info!
Have you had a chance to look more into why this is happening and how to fix it @anmolmann? Alternatively, are there any escape hatches I could implement directly in the wandb code to fix it while we wait for a fix to be released?
Hi
I'm running model training on a compute cluster where the compute nodes do not have access to the internet. Therefore, while the jobs are running, I'm calling
wandb sync
at regular intervals from a node with internet access that shares the same file system, so that I can follow the training in the UI. When I do this, however, the jobs that are running are listed as "Finished" in the UI throughout the whole run, which makes downstream evaluation problematic as it's set to only run eval on finished jobs.My question is, is the expected behavior, or am I doing something wrong?
Best, Lars
The text was updated successfully, but these errors were encountered: