Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with PEFT + Accelerate randomly gets stuck with DeepSpeed after the first epoch #2724

Open
vikram71198 opened this issue Apr 29, 2024 · 6 comments

Comments

@vikram71198
Copy link

vikram71198 commented Apr 29, 2024

Hi, I'm fine tuning an LLM using Soft Prompt Tuning using DeepSpeed via Accelerate implicitly, using the deepspeed param in TrainingArguments.

And all goes well until after the first epoch, where I get this relatively obscure message.

Invalidate trace cache @ step 0: expected module 0, but got module 456

After some investigating, I came across this comment from a DeepSpeed maintainer on what this message means right here.

So, I ignored it, but turns out training just gets stuck after the first epoch and just does not proceed.

Similar issues have been raised here & here.

I'm not entirely sure if this is a DeepSpeed or Accelerate issue, but I'm inclining more towards this being one of Accelerate.

I'm running all of my experiments on a Databricks Cluster with p4d.24xlarge, which is 8x40 Gb Nvidia A100s.

These are my platform specs:

Libraries
absl-py==1.0.0
accelerate==0.29.3
aiohttp==3.9.1
aiosignal==1.3.1
anyio==3.5.0
appdirs==1.4.4
argon2-cffi==21.3.0
argon2-cffi-bindings==21.2.0
astor==0.8.1
asttokens==2.0.5
astunparse==1.6.3
async-timeout==4.0.3
attrs==22.1.0
audioread==3.0.1
azure-core==1.29.1
azure-cosmos==4.3.1
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backcall==0.2.0
bcrypt==3.2.0
beautifulsoup4==4.11.1
black==22.6.0
bleach==4.1.0
blinker==1.4
blis==0.7.11
boto3==1.24.28
botocore==1.27.96
cachetools==5.3.2
catalogue==2.0.10
category-encoders==2.6.3
certifi==2022.12.7
cffi==1.15.1
chardet==4.0.0
charset-normalizer==2.0.4
click==8.0.4
cloudpathlib==0.16.0
cloudpickle==2.0.0
cmake==3.28.1
cmdstanpy==1.2.0
comm==0.1.2
confection==0.1.4
configparser==5.2.0
contourpy==1.0.5
cryptography==39.0.1
cycler==0.11.0
cymem==2.0.8
Cython==0.29.32
dacite==1.8.1
databricks-automl-runtime==0.2.20
databricks-cli==0.18.0
databricks-feature-engineering==0.2.1
databricks-sdk==0.1.6
dataclasses-json==0.6.3
datasets==2.15.0
dbl-tempo==0.1.26
dbus-python==1.2.18
debugpy==1.6.7
decorator==5.1.1
deepspeed==0.14.1
defusedxml==0.7.1
dill==0.3.6
diskcache==5.6.3
distlib==0.3.7
distro==1.7.0
distro-info==1.1+ubuntu0.2
docstring-to-markdown==0.11
docstring_parser==0.16
einops==0.7.0
entrypoints==0.4
evaluate==0.4.1
executing==0.8.3
facets-overview==1.1.1
fastjsonschema==2.19.1
fasttext==0.9.2
filelock==3.9.0
flash-attn==2.5.0
Flask==2.2.5
flatbuffers==23.5.26
fonttools==4.25.0
frozenlist==1.4.1
fsspec==2023.6.0
future==0.18.3
gast==0.4.0
gensim==4.3.2
gitdb==4.0.11
GitPython==3.1.27
google-api-core==2.15.0
google-auth==2.21.0
google-auth-oauthlib==1.0.0
google-cloud-core==2.4.1
google-cloud-storage==2.11.0
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
greenlet==2.0.1
grpcio==1.48.2
grpcio-status==1.48.1
gunicorn==20.1.0
gviz-api==1.10.0
h5py==3.7.0
hf_transfer==0.1.6
hjson==3.1.0
holidays==0.38
horovod==0.28.1
htmlmin==0.1.12
httplib2==0.20.2
huggingface-hub==0.21.3
idna==3.4
ImageHash==4.3.1
imbalanced-learn==0.11.0
importlib-metadata==4.11.3
importlib-resources==6.1.1
ipykernel==6.25.0
ipython==8.14.0
ipython-genutils==0.2.0
ipywidgets==7.7.2
isodate==0.6.1
itsdangerous==2.0.1
jedi==0.18.1
jeepney==0.7.1
Jinja2==3.1.2
jmespath==0.10.0
joblib==1.2.0
joblibspark==0.5.1
jsonpatch==1.33
jsonpointer==2.4
jsonschema==4.17.3
jupyter-client==7.3.4
jupyter-server==1.23.4
jupyter_core==5.2.0
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.0
keras==2.14.0
keyring==23.5.0
kiwisolver==1.4.4
langchain==0.0.348
langchain-core==0.0.13
langcodes==3.3.0
langsmith==0.0.79
launchpadlib==1.10.16
lazr.restfulclient==0.14.4
lazr.uri==1.0.6
lazy_loader==0.3
libclang==15.0.6.1
librosa==0.10.1
lightgbm==4.1.0
lit==17.0.6
llvmlite==0.39.1
lxml==4.9.1
Mako==1.2.0
Markdown==3.4.1
markdown-it-py==3.0.0
MarkupSafe==2.1.1
marshmallow==3.20.2
matplotlib==3.7.0
matplotlib-inline==0.1.6
mccabe==0.7.0
mdurl==0.1.2
mistune==0.8.4
ml-dtypes==0.2.0
mlflow-skinny==2.9.2
more-itertools==8.10.0
mpmath==1.2.1
msgpack==1.0.7
multidict==6.0.4
multimethod==1.10
multiprocess==0.70.14
murmurhash==1.0.10
mypy-extensions==0.4.3
nbclassic==0.5.2
nbclient==0.5.13
nbconvert==6.5.4
nbformat==5.7.0
nest-asyncio==1.5.6
networkx==2.8.4
ninja==1.11.1.1
nltk==3.7
nodeenv==1.8.0
notebook==6.5.2
notebook_shim==0.2.2
numba==0.56.4
numpy==1.23.5
nvidia-cublas-cu11==11.11.3.6
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu11==8.7.0.84
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu11==10.9.0.58
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu11==10.3.0.86
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu11==11.7.5.86
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu11==2.19.3
nvidia-nccl-cu12==2.20.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu11==11.8.86
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.0
openai==0.28.1
opt-einsum==3.3.0
packaging==23.2
pandas==1.5.3
pandocfilters==1.5.0
paramiko==2.9.2
parso==0.8.3
pathspec==0.10.3
patsy==0.5.3
peft==0.10.0
petastorm==0.12.1
pexpect==4.8.0
phik==0.12.4
pickleshare==0.7.5
Pillow==9.4.0
platformdirs==2.5.2
plotly==5.9.0
pluggy==1.0.0
pmdarima==2.0.4
pooch==1.4.0
preshed==3.0.9
prompt-toolkit==3.0.36
prophet==1.1.5
protobuf==4.24.0
psutil==5.9.0
psycopg2==2.9.3
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==8.0.0
pyarrow-hotfix==0.5
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.11.1
pycparser==2.21
pydantic==1.10.6
pyflakes==3.1.0
Pygments==2.17.2
PyGObject==3.42.1
PyJWT==2.3.0
PyNaCl==1.5.0
pynvml==11.5.0
pyodbc==4.0.32
pyparsing==3.0.9
pyright==1.1.294
pyrsistent==0.18.0
pytesseract==0.3.10
python-apt==2.4.0+ubuntu3
python-dateutil==2.8.2
python-editor==1.0.4
python-lsp-jsonrpc==1.1.1
python-lsp-server==1.8.0
pytoolconfig==1.2.5
pytz==2022.7
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.0
regex==2022.7.9
requests==2.28.1
requests-oauthlib==1.3.1
responses==0.18.0
rich==13.7.1
rope==1.7.0
rsa==4.9
s3transfer==0.6.2
safetensors==0.4.1
scikit-learn==1.1.1
scipy==1.10.0
seaborn==0.12.2
SecretStorage==3.3.1
Send2Trash==1.8.0
sentence-transformers==2.2.2
sentencepiece==0.1.99
shap==0.44.0
shtab==1.7.1
simplejson==3.17.6
six==1.16.0
slicer==0.0.7
smart-open==5.2.1
smmap==5.0.0
sniffio==1.2.0
soundfile==0.12.1
soupsieve==2.3.2.post1
soxr==0.3.7
spacy==3.7.2
spacy-legacy==3.0.12
spacy-loggers==1.0.5
spark-tensorflow-distributor==1.0.0
SQLAlchemy==1.4.39
sqlparse==0.4.2
srsly==2.4.8
ssh-import-id==5.11
stack-data==0.2.0
stanio==0.3.0
statsmodels==0.13.5
sympy==1.11.1
tabulate==0.8.10
tangled-up-in-unicode==0.2.0
tenacity==8.1.0
tensorboard==2.14.1
tensorboard-data-server==0.7.2
tensorboard-plugin-profile==2.14.0
tensorflow==2.14.1
tensorflow-estimator==2.14.0
tensorflow-io-gcs-filesystem==0.35.0
termcolor==2.4.0
terminado==0.17.1
thinc==8.2.2
threadpoolctl==2.2.0
tiktoken==0.5.2
tinycss2==1.2.1
tokenize-rt==4.2.1
tokenizers==0.19.1
tomli==2.0.1
torch==2.2.2+cu118
torchaudio==2.2.2+cu118
torchvision==0.17.2+cu118
tornado==6.1
tqdm==4.64.1
traitlets==5.7.1
transformers==4.40.1
triton==2.2.0
trl==0.8.6
typeguard==2.13.3
typer==0.9.0
typing-inspect==0.9.0
typing_extensions==4.11.0
tyro==0.8.3
ujson==5.4.0
unattended-upgrades==0.1
urllib3==1.26.14
virtualenv==20.16.7
visions==0.7.5
wadllib==1.3.6
wasabi==1.1.2
wcwidth==0.2.5
weasel==0.3.4
webencodings==0.5.1
websocket-client==0.58.0
Werkzeug==2.2.2
whatthepatch==1.0.2
widgetsnbextension==3.6.1
wordcloud==1.9.3
wrapt==1.14.1
xgboost==1.7.6
xxhash==3.4.1
yapf==0.33.0
yarl==1.9.4
ydata-profiling==4.2.0
zipp==3.11.0

This is how you dropdown.

Here is a minimum reproducible:

Repro
""" fine_tune_deepspeed.py"""

import os
import argparse
from typing import Dict

os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

import torch
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
print(f"Torch can see {torch.cuda.device_count()} GPUs")

import os
import random
import os
import numpy as np
import pandas as pd
import torch
import datasets
from torch.utils.data import Dataset, DataLoader
import peft
import transformers
import trl
import json
from pprint import pprint
import flash_attn
import accelerate
from transformers import BitsAndBytesConfig

from transformers import default_data_collator, get_linear_schedule_with_warmup
from peft import get_peft_config, get_peft_model, PromptTuningInit, PromptTuningConfig, TaskType, PeftType
import torch
from torch.utils.data import DataLoader
from tqdm import tqdm

num_virtual_tokens = 25
prompt_tuning_init_text = "Generate the reason the customer calls up the agent based on the transcript:"
random_seed = 42

model_name = "teknium/OpenHermes-2.5-Mistral-7B"
output_dir = "enter-your-output-dir-here"
text_column = "Transcript"
label_column = "RFC"

batch_size = 1
max_length = 4096
lr = 2e-4 * np.sqrt(8)
weight_decay = 0.2
adam_epsilon = 1e-8
num_epochs = 10

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token_id is None:
    tokenizer.pad_token_id = tokenizer.eos_token_id

tokenizer.use_default_system_prompt = False

def get_cluster_gpu_statistics():
    for i in range(8):
        torch.cuda.set_device(i)
        get_gpu_statistics()
    torch.cuda.set_device(0)

def empty_cluster_gpu_cache():
    for i in range(8):
        torch.cuda.set_device(i)
        torch.cuda.empty_cache()
    torch.cuda.set_device(0)

def get_gpu_statistics():
    available, total = torch.cuda.mem_get_info()
    print(f"Available VRAM: {available}, Total VRAM: {total}")

def df_to_transcript(df: pd.DataFrame) -> str:
    roles = df.Role.to_list()
    utterances = df.Transcript.to_list()
    return "\n".join(
        [r[0] + r[1:].lower() + ": " + t for r, t in zip(roles, utterances)]
    )

def get_prompt(transcript: str) -> str:
    prompt = """Transcript:

{transcript}

---

I want you to act as a transcript analysis expert. I have provided you with a transcript between agent & customer above and your goal is to summarize the reason why the customer calls up the agent. If there is no discernible reason, output "No reason identified".

Answer:"""

    return prompt.format(transcript = transcript)

def get_mistral_prompt(transcript: str, system_message : str = "") -> str:
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": get_prompt(transcript)}
    ]
    return tokenizer.apply_chat_template(messages, tokenize = False, add_generation_prompt = True)

def preprocess_function(examples):
    batch_size = len(examples[text_column])
    inputs = [get_mistral_prompt(x) for x in examples[text_column]]
    targets = [str(x) for x in examples[label_column]]
    model_inputs = tokenizer(inputs)
    labels = tokenizer(targets)
    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i] + [tokenizer.pad_token_id]
        model_inputs["input_ids"][i] = sample_input_ids + label_input_ids
        labels["input_ids"][i] = [-100] * len(sample_input_ids) + label_input_ids
        model_inputs["attention_mask"][i] = [1] * len(model_inputs["input_ids"][i])

    for i in range(batch_size):
        sample_input_ids = model_inputs["input_ids"][i]
        label_input_ids = labels["input_ids"][i]

        model_inputs["input_ids"][i] = [tokenizer.pad_token_id] * (
            max_length - len(sample_input_ids)
        ) + sample_input_ids
        model_inputs["attention_mask"][i] = [0] * (max_length - len(sample_input_ids)) + model_inputs[
            "attention_mask"
        ][i]
        labels["input_ids"][i] = [-100] * (max_length - len(sample_input_ids)) + label_input_ids

        model_inputs["input_ids"][i] = torch.tensor(model_inputs["input_ids"][i][:max_length])
        model_inputs["attention_mask"][i] = torch.tensor(model_inputs["attention_mask"][i][:max_length])
        labels["input_ids"][i] = torch.tensor(labels["input_ids"][i][:max_length])
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

def get_dataset():
    
    transcripts = ["Agent: Hello, how are you doing today? Customer: I called up to cancel my insurance policy", "Agent: hello Customer: thank you"]

    rfcs = ["Customer called up to cancel their insurance policy", "No reason identified"]
    all_data = {"RFC": rfcs, "Transcript": transcripts}

    df = pd.DataFrame(all_data)

    from datasets import DatasetDict, Dataset

    rfc_dataset = DatasetDict()

    rfc_dataset["train"] = Dataset.from_pandas(df)

    formatted_dataset = rfc_dataset.map(
        preprocess_function,
        batched=True,
        num_proc=1,
        remove_columns=rfc_dataset["train"].column_names,
        load_from_cache_file=False,
        desc="Running tokenizer on dataset",
    )

    formatted_dataset = formatted_dataset.shuffle(seed = random_seed)

    train_dataset = formatted_dataset["train"]

    return train_dataset

def return_peft_model():

    from transformers import AutoModelForCausalLM

    model = AutoModelForCausalLM.from_pretrained(model_name, device_map = None, torch_dtype = torch.bfloat16)

    peft_config = PromptTuningConfig(
        task_type = TaskType.CAUSAL_LM,
        prompt_tuning_init = PromptTuningInit.TEXT,
        num_virtual_tokens = num_virtual_tokens,
        prompt_tuning_init_text = prompt_tuning_init_text,
        tokenizer_name_or_path = model_name,
    )

    model = get_peft_model(model, peft_config)

    return model

def distributed_training_deepspeed(model, train_dataset, deepspeed_config):
    
    from transformers import TrainingArguments

    training_args = TrainingArguments(
        per_device_train_batch_size = batch_size,
        gradient_accumulation_steps = 1,
        gradient_checkpointing = True,
        output_dir = output_dir,
        remove_unused_columns = False,
        num_train_epochs = num_epochs,
        lr_scheduler_type = "linear",
        warmup_steps = 200, 
        learning_rate = lr,
        weight_decay = 0.2, 
        adam_epsilon = adam_epsilon,
        logging_strategy = "epoch", 
        evaluation_strategy = "no", 
        save_strategy = "epoch",
        report_to = "none",
        deepspeed = deepspeed_config,
        bf16 = True,
)

    from transformers import Trainer
    from transformers import DataCollatorWithPadding


    data_collator = DataCollatorWithPadding(tokenizer = tokenizer)

    trainer = Trainer(
        model = model,
        args = training_args,
        data_collator = data_collator,
        train_dataset = train_dataset
    )

    trainer.train()

if __name__ == "__main__":

    deepspeed_config = {
    "fp16": {"enabled": False},
    "bf16": {"enabled": True},
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto",
        },
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 0,
            "warmup_max_lr": 2e-4 * np.sqrt(8),
            "warmup_num_steps": "auto",
        },
    },
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": False, #backwards prefetching
        "contiguous_gradients": True,
        "sub_group_size": 1000000000.0,
        "reduce_bucket_size": 500000000.0,
        "stage3_prefetch_bucket_size": 500000000.0,
        "stage3_param_persistence_threshold": 100000.0,
        "stage3_max_live_parameters": 1000000000.0,
        "stage3_max_reuse_distance": 1000000000.0,
        "stage3_gather_16bit_weights_on_model_save": True,
        "offload_param": {
            "device": "cpu",
            "pin_memory": False,
        },
    },
    "gradient_accumulation_steps": 1,
    "gradient_clipping": "auto",
    "steps_per_print": 39,
    "train_batch_size": 8,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False,
    }

    train_dataset = get_dataset()
    model = return_peft_model()
    print(f"Maximum Sequence Length: {max_length}")
    distributed_training_deepspeed(model, train_dataset, deepspeed_config)

And then in a separate notebook, I execute the following terminal command:

!deepspeed --num_nodes=1 --num_gpus=8 fine_tune_deepspeed.py

And this is the exact stacktrace I see:

Stacktrace
[2024-04-29 18:44:39,142] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-29 18:44:42,291] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-29 18:44:42,291] [INFO] [runner.py:568:main] cmd = /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None /Workspace/Users/{username}/Repro/fine_tune_deepspeed.py
[2024-04-29 18:44:45,369] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-29 18:44:48,388] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-04-29 18:44:48,388] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-04-29 18:44:48,388] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-04-29 18:44:48,388] [INFO] [launch.py:163:main] dist_world_size=8
[2024-04-29 18:44:48,388] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-04-29 18:44:48,388] [INFO] [launch.py:253:main] process 1486203 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=0']
[2024-04-29 18:44:48,389] [INFO] [launch.py:253:main] process 1486204 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=1']
[2024-04-29 18:44:48,389] [INFO] [launch.py:253:main] process 1486205 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=2']
[2024-04-29 18:44:48,390] [INFO] [launch.py:253:main] process 1486206 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=3']
[2024-04-29 18:44:48,390] [INFO] [launch.py:253:main] process 1486207 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=4']
[2024-04-29 18:44:48,391] [INFO] [launch.py:253:main] process 1486208 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=5']
[2024-04-29 18:44:48,391] [INFO] [launch.py:253:main] process 1486209 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=6']
[2024-04-29 18:44:48,392] [INFO] [launch.py:253:main] process 1486210 spawned with command: ['/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/bin/python', '-u', '/Workspace/Users/{username}/Repro/fine_tune_deepspeed.py', '--local_rank=7']
Torch can see 8 GPUs
Torch can see 8 GPUs
Torch can see 8 GPUs
Torch can see 8 GPUs
Torch can see 8 GPUs
Torch can see 8 GPUs
Torch can see 8 GPUs
Torch can see 8 GPUs
2024-04-29 18:44:54.203995: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.204071: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.204109: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.211346: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.803203: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.803252: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.803284: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.803388: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.803427: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.803462: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.808929: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.808971: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.808999: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.809535: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.809568: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.809597: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.810447: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.810518: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.815648: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.816266: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.850741: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.850789: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.850815: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.852461: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.852461: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-29 18:44:54.852533: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.852534: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-29 18:44:54.852570: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.852571: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-29 18:44:54.858081: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.860265: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-29 18:44:54.860266: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 46.26 examples/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running tokenizer on dataset:   0%|                | 0/2 [00:00<?, ? examples/s]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running tokenizer on dataset:   0%|                | 0/2 [00:00<?, ? examples/s]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 37.59 examples/s]
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 34.83 examples/s]
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 26.38 examples/s]
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 25.84 examples/s]
Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 34.57 examples/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 23.16 examples/s]
Running tokenizer on dataset: 100%|████████| 2/2 [00:00<00:00, 25.52 examples/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:05<00:00,  2.94s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Maximum Sequence Length: 4096
[2024-04-29 18:45:04,064] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
Loading checkpoint shards:  50%|█████████         | 1/2 [00:06<00:06,  6.24s/it] [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-29 18:45:09,821] [INFO] [comm.py:637:init_distributed] cdb=None
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:09<00:00,  4.51s/it]
Loading checkpoint shards:  50%|█████████         | 1/2 [00:05<00:05,  5.61s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards:  50%|█████████         | 1/2 [00:05<00:05,  5.21s/it]Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards:  50%|█████████         | 1/2 [00:05<00:05,  5.33s/it]Maximum Sequence Length: 4096
[2024-04-29 18:45:10,859] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:07<00:00,  3.51s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:07<00:00,  3.70s/it]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:07<00:00,  3.77s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Maximum Sequence Length: 4096
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:06<00:00,  3.47s/it]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:06<00:00,  3.48s/it]
[2024-04-29 18:45:12,278] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:07<00:00,  3.52s/it]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Maximum Sequence Length: 4096
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Maximum Sequence Length: 4096
[2024-04-29 18:45:12,615] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-29 18:45:12,686] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Maximum Sequence Length: 4096
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
Maximum Sequence Length: 4096
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Maximum Sequence Length: 4096
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
[2024-04-29 18:45:12,811] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-29 18:45:12,848] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-29 18:45:12,884] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-04-29 18:45:13,650] [INFO] [comm.py:637:init_distributed] cdb=None
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-29 18:45:14,632] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-29 18:45:14,632] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
[2024-04-29 18:45:15,033] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-29 18:45:15,247] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-29 18:45:15,504] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-29 18:45:15,607] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-29 18:45:15,644] [INFO] [comm.py:637:init_distributed] cdb=None
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...

Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output multi_tensor_adam.cuda.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -std=c++17 -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include/TH -isystem /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DBF16_AVAILABLE -c /local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 25.933195114135742 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 25.94469428062439 seconds
Time to load fused_adam op: 25.944432735443115 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 25.945977687835693 seconds
Time to load fused_adam op: 25.9484806060791 seconds
Time to load fused_adam op: 25.946951866149902 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 25.959933519363403 secondsTime to load fused_adam op: 25.959373712539673 seconds

Parameter Offload: Total persistent parameters: 266240 in 65 params
  0%|                                                    | 0/10 [00:00<?, ?it/s]/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/local_disk0/.ephemeral_nfs/envs/pythonEnv-cc1daeae-7a4d-46de-9d96-72f119078c4f/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
{'loss': 4.8532, 'grad_norm': 20.21285371142944, 'learning_rate': 0.0, 'epoch': 1.0}
 10%|████▍                                       | 1/10 [00:04<00:39,  4.38s/it]Invalidate trace cache @ step 0: expected module 0, but got module 456

And at that point, it just gets stuck without proceeding any further.

Would really appreciate help getting to the bottom of this @muellerzr @pacman100. Thanks.

@vikram71198
Copy link
Author

@muellerzr @pacman100 @BenjaminBossan can you please help with this?

@matbee-eth
Copy link

Would like an update

@vikram71198
Copy link
Author

Yeah, don't expect this to be resolved. I see a lot of issues here are not even addressed by the maintainers of this repo. We're on ourselves really.

@BenjaminBossan
Copy link
Member

I tried to reproduce the issue but couldn't run the script as is because of memory issues (used 2 T4's). Therefore, I had to make some changes to the script (most notably using a smaller model, see below). For me, this passed successfully when running with deepspeed --num_nodes=1 --num_gpus=2 2724.py.

I'm not sure which of the changes (if any) causes this to pass for me but not for you. In general, I would recommend to check this DeepSpeed + PEFT guide, as it is known to work.

1d0
< 
8c7
< os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
---
> #os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1" # BB
11c10
< os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
---
> os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
27d25
< import flash_attn
41,42c39,41
< model_name = "teknium/OpenHermes-2.5-Mistral-7B"
< output_dir = "enter-your-output-dir-here"
---
> #model_name = "teknium/OpenHermes-2.5-Mistral-7B"
> model_name = "facebook/opt-125m" # BB
> output_dir = "/tmp/"
46,47c45,46
< batch_size = 1
< max_length = 4096
---
> batch_size = 2  # BB
> max_length = 32 # BB
63c62
<     for i in range(8):
---
>     for i in range(2):
69c68
<     for i in range(8):
---
>     for i in range(2):
137c136
<     
---
> 
170c169
<     model = AutoModelForCausalLM.from_pretrained(model_name, device_map = None, torch_dtype = torch.bfloat16)
---
>     model = AutoModelForCausalLM.from_pretrained(model_name, device_map = None, torch_dtype = torch.float16) # BB
185c184
<     
---
> 
190c189
<         gradient_accumulation_steps = 1,
---
>         gradient_accumulation_steps = 4, # BB
198c197
<         weight_decay = 0.2, 
---
>         weight_decay = 0.2,
200c199
<         logging_strategy = "epoch", 
---
>         logging_strategy = "epoch",
203,205c202,205
<         report_to = "none",
<         deepspeed = deepspeed_config,
<         bf16 = True,
---
>         report_to = "none", # BB
>         # deepspeed = deepspeed_config, # BB
>         bf16 = False, # BB
>         fp16 = True, # BB
225,268c225,226
<     deepspeed_config = {
<     "fp16": {"enabled": False},
<     "bf16": {"enabled": True},
<     "optimizer": {
<         "type": "AdamW",
<         "params": {
<             "lr": "auto",
<             "betas": "auto",
<             "eps": "auto",
<             "weight_decay": "auto",
<         },
<     },
<     "scheduler": {
<         "type": "WarmupLR",
<         "params": {
<             "warmup_min_lr": 0,
<             "warmup_max_lr": 2e-4 * np.sqrt(8),
<             "warmup_num_steps": "auto",
<         },
<     },
<     "zero_optimization": {
<         "stage": 3,
<         "overlap_comm": False, #backwards prefetching
<         "contiguous_gradients": True,
<         "sub_group_size": 1000000000.0,
<         "reduce_bucket_size": 500000000.0,
<         "stage3_prefetch_bucket_size": 500000000.0,
<         "stage3_param_persistence_threshold": 100000.0,
<         "stage3_max_live_parameters": 1000000000.0,
<         "stage3_max_reuse_distance": 1000000000.0,
<         "stage3_gather_16bit_weights_on_model_save": True,
<         "offload_param": {
<             "device": "cpu",
<             "pin_memory": False,
<         },
<     },
<     "gradient_accumulation_steps": 1,
<     "gradient_clipping": "auto",
<     "steps_per_print": 39,
<     "train_batch_size": 8,
<     "train_micro_batch_size_per_gpu": 1,
<     "wall_clock_breakdown": False,
<     }
< 
---
>     # not used BB
>     deepspeed_config = {}

accelerate env:

- `Accelerate` version: 0.30.1
- Platform: Linux-4.19.0-26-cloud-amd64-x86_64-with-glibc2.28
- `accelerate` bash location: /opt/conda/envs/env/bin/accelerate
- Python version: 3.11.8
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.2.1+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 51.10 GB
- GPU type: Tesla T4
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - mixed_precision: fp16
        - use_cpu: False
        - debug: False
        - num_processes: 2
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'gradient_accumulation_steps': 4, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero3_save_16bit_model': False, 'zero_stage': 3}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

PEFT: latest version from source (commit fb7f2796e5411ee86588447947d1fdd5b6395cad).

@uggggg
Copy link

uggggg commented May 26, 2024

I also meet this issue,when the trainer save the model on the first epoch.

@uggggg
Copy link

uggggg commented May 26, 2024

[INFO|trainer.py:1715] 2024-05-26 14:55:04,755 >> Number of trainable parameters = 204,800 [INFO|trainer.py:1836] 2024-05-26 14:55:56,361 >> Epoch: [0],Step [1/5],Elapsed Time: 00:00:51,Estimated Remaining Time: 00:03:26,Speed: 51.59 steps/sec [INFO|trainer.py:1836] 2024-05-26 14:56:01,107 >> Epoch: [0],Step [2/5],Elapsed Time: 00:00:56,Estimated Remaining Time: 00:01:24,Speed: 28.17 steps/sec [INFO|trainer.py:1836] 2024-05-26 14:56:04,294 >> Epoch: [0],Step [3/5],Elapsed Time: 00:00:59,Estimated Remaining Time: 00:00:39,Speed: 19.84 steps/sec [INFO|trainer.py:1836] 2024-05-26 14:56:06,884 >> Epoch: [0],Step [4/5],Elapsed Time: 00:01:02,Estimated Remaining Time: 00:00:15,Speed: 15.53 steps/sec [INFO|trainer.py:2915] 2024-05-26 14:56:22,776 >> Saving model checkpoint to ../output/Qwen-1_8B/init_100_Please think step by step according to the question and answer it._Inprompt/tmp-checkpoint-5 Invalidate trace cache @ step 0: expected module 0, but got module 274 ^C[2024-05-26 15:06:40,046] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 22523 Traceback (most recent call last): File "/opt/conda/bin/deepspeed", line 6, in <module> main() File "/opt/conda/lib/python3.10/site-packages/deepspeed/launcher/runner.py", line 586, in main result.wait() File "/opt/conda/lib/python3.10/subprocess.py", line 1209, in wait return self._wait(timeout=timeout) File "/opt/conda/lib/python3.10/subprocess.py", line 1959, in _wait (pid, sts) = self._try_wait(0) File "/opt/conda/lib/python3.10/subprocess.py", line 1917, in _try_wait (pid, sts) = os.waitpid(self.pid, wait_flags) KeyboardInterrupt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants