Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added hyperparameter advanced tutorial #69

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Priyansi
Copy link
Collaborator

@Priyansi Priyansi commented Nov 8, 2021

Fixes #29

Copy link
Member

@trsvchn trsvchn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Priyansi !
I've added some suggestions related to coding style.

Comment on lines +132 to +138
" trainset = CIFAR10(\n",
" root=data_dir, train=True, download=True, transform=transform)\n",
"\n",
" testset = CIFAR10(\n",
" root=data_dir, train=False, download=True, transform=transform)\n",
"\n",
" return trainset, testset"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" trainset = CIFAR10(\n",
" root=data_dir, train=True, download=True, transform=transform)\n",
"\n",
" testset = CIFAR10(\n",
" root=data_dir, train=False, download=True, transform=transform)\n",
"\n",
" return trainset, testset"
" trainset = CIFAR10(\n",
" root=data_dir, train=True, download=True, transform=transform\n",
" )\n",
" testset = CIFAR10(\n",
" root=data_dir, train=False, download=True, transform=transform\n",
" )\n",
" return trainset, testset"

Comment on lines +150 to +164
" train_subset, val_subset = random_split(\n",
" trainset, [test_abs, len(trainset) - test_abs])\n",
"\n",
" trainloader = idist.auto_dataloader(\n",
" train_subset,\n",
" batch_size=int(config[\"batch_size\"]),\n",
" shuffle=True,\n",
" num_workers=8)\n",
" valloader = idist.auto_dataloader(\n",
" val_subset,\n",
" batch_size=int(config[\"batch_size\"]),\n",
" shuffle=True,\n",
" num_workers=8)\n",
" \n",
" return trainloader, valloader"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
" train_subset, val_subset = random_split(\n",
" trainset, [test_abs, len(trainset) - test_abs])\n",
"\n",
" trainloader = idist.auto_dataloader(\n",
" train_subset,\n",
" batch_size=int(config[\"batch_size\"]),\n",
" shuffle=True,\n",
" num_workers=8)\n",
" valloader = idist.auto_dataloader(\n",
" val_subset,\n",
" batch_size=int(config[\"batch_size\"]),\n",
" shuffle=True,\n",
" num_workers=8)\n",
" \n",
" return trainloader, valloader"
" train_subset, val_subset = random_split(\n",
" trainset, [test_abs, len(trainset) - test_abs]\n",
" )\n",
" trainloader = idist.auto_dataloader(\n",
" train_subset,\n",
" batch_size=int(config[\"batch_size\"]),\n",
" shuffle=True,\n",
" num_workers=8\n",
" )\n",
" valloader = idist.auto_dataloader(\n",
" val_subset,\n",
" batch_size=int(config[\"batch_size\"]),\n",
" shuffle=True,\n",
" num_workers=8\n",
" )\n",
" return trainloader, valloader"

Comment on lines +217 to +231
"def initialize(config, checkpoint_dir):\n",
" model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n",
"\n",
" device = idist.device()\n",
"\n",
" criterion = nn.CrossEntropyLoss()\n",
" optimizer = idist.auto_optim(optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9))\n",
"\n",
" if checkpoint_dir:\n",
" model_state, optimizer_state = torch.load(\n",
" os.path.join(checkpoint_dir, \"checkpoint\"))\n",
" model.load_state_dict(model_state)\n",
" optimizer.load_state_dict(optimizer_state)\n",
" \n",
" return model, device, criterion, optimizer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"def initialize(config, checkpoint_dir):\n",
" model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n",
"\n",
" device = idist.device()\n",
"\n",
" criterion = nn.CrossEntropyLoss()\n",
" optimizer = idist.auto_optim(optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9))\n",
"\n",
" if checkpoint_dir:\n",
" model_state, optimizer_state = torch.load(\n",
" os.path.join(checkpoint_dir, \"checkpoint\"))\n",
" model.load_state_dict(model_state)\n",
" optimizer.load_state_dict(optimizer_state)\n",
" \n",
" return model, device, criterion, optimizer"
"def initialize(config, checkpoint_dir):\n",
" model = idist.auto_model(Net(config[\"l1\"], config[\"l2\"]))\n",
"\n",
" device = idist.device()\n",
"\n",
" criterion = nn.CrossEntropyLoss()\n",
" optimizer = idist.auto_optim(\n",
" optim.SGD(model.parameters(), lr=config[\"lr\"], momentum=0.9)\n",
" )\n",
"\n",
" if checkpoint_dir:\n",
" model_state, optimizer_state = torch.load(\n",
" os.path.join(checkpoint_dir, \"checkpoint\")\n",
" )\n",
" model.load_state_dict(model_state)\n",
" optimizer.load_state_dict(optimizer_state)\n",
"\n",
" return model, device, criterion, optimizer"

Comment on lines +265 to +293
"def train_cifar(config, data_dir=None, checkpoint_dir=None):\n",
" trainloader, valloader = get_train_val_loaders(config, data_dir)\n",
" model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n",
" \n",
" trainer = create_supervised_trainer(model, optimizer, criterion, device=device, non_blocking=True)\n",
" \n",
" avg_output = RunningAverage(output_transform=lambda x: x)\n",
" avg_output.attach(trainer, 'running_avg_loss')\n",
" \n",
" val_evaluator = create_supervised_evaluator(model, metrics={ \"accuracy\": Accuracy(), \"loss\": Loss(criterion)}, device=device, non_blocking=True)\n",
" \n",
" @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n",
" def log_training_loss(engine):\n",
" print(f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\")\n",
"\n",
"\n",
" @trainer.on(Events.EPOCH_COMPLETED)\n",
" def log_validation_results(trainer):\n",
" val_evaluator.run(valloader)\n",
" metrics = val_evaluator.state.metrics\n",
" print(f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\")\n",
"\n",
" with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n",
" path = os.path.join(checkpoint_dir, \"checkpoint\")\n",
" torch.save((model.state_dict(), optimizer.state_dict()), path)\n",
" \n",
" tune.report(loss=metrics['loss'], accuracy=metrics['accuracy']) \n",
"\n",
" trainer.run(trainloader, max_epochs=10) "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"def train_cifar(config, data_dir=None, checkpoint_dir=None):\n",
" trainloader, valloader = get_train_val_loaders(config, data_dir)\n",
" model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n",
" \n",
" trainer = create_supervised_trainer(model, optimizer, criterion, device=device, non_blocking=True)\n",
" \n",
" avg_output = RunningAverage(output_transform=lambda x: x)\n",
" avg_output.attach(trainer, 'running_avg_loss')\n",
" \n",
" val_evaluator = create_supervised_evaluator(model, metrics={ \"accuracy\": Accuracy(), \"loss\": Loss(criterion)}, device=device, non_blocking=True)\n",
" \n",
" @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n",
" def log_training_loss(engine):\n",
" print(f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\")\n",
"\n",
"\n",
" @trainer.on(Events.EPOCH_COMPLETED)\n",
" def log_validation_results(trainer):\n",
" val_evaluator.run(valloader)\n",
" metrics = val_evaluator.state.metrics\n",
" print(f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\")\n",
"\n",
" with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n",
" path = os.path.join(checkpoint_dir, \"checkpoint\")\n",
" torch.save((model.state_dict(), optimizer.state_dict()), path)\n",
" \n",
" tune.report(loss=metrics['loss'], accuracy=metrics['accuracy']) \n",
"\n",
" trainer.run(trainloader, max_epochs=10) "
"def train_cifar(config, data_dir=None, checkpoint_dir=None):\n",
" trainloader, valloader = get_train_val_loaders(config, data_dir)\n",
" model, device, criterion, optimizer = initialize(config, checkpoint_dir)\n",
"\n",
" trainer = create_supervised_trainer(\n",
" model, optimizer, criterion, device=device, non_blocking=True\n",
" )\n",
"\n",
" avg_output = RunningAverage(output_transform=lambda x: x)\n",
" avg_output.attach(trainer, \"running_avg_loss\")\n",
"\n",
" val_evaluator = create_supervised_evaluator(\n",
" model,\n",
" metrics={\"accuracy\": Accuracy(), \"loss\": Loss(criterion)},\n",
" device=device,\n",
" non_blocking=True,\n",
" )\n",
"\n",
" @trainer.on(Events.ITERATION_COMPLETED(every=2000))\n",
" def log_training_loss(engine):\n",
" print(\n",
" f\"Epoch[{engine.state.epoch}], Iter[{engine.state.iteration}] Loss: {engine.state.output:.2f} Running Avg Loss: {engine.state.metrics['running_avg_loss']:.2f}\"\n",
" )\n",
"\n",
" @trainer.on(Events.EPOCH_COMPLETED)\n",
" def log_validation_results(trainer):\n",
" val_evaluator.run(valloader)\n",
" metrics = val_evaluator.state.metrics\n",
" print(\n",
" f\"Validation Results - Epoch[{trainer.state.epoch}] Avg accuracy: {metrics['accuracy']:.2f} Avg loss: {metrics['loss']:.2f}\"\n",
" )\n",
"\n",
" with tune.checkpoint_dir(trainer.state.epoch) as checkpoint_dir:\n",
" path = os.path.join(checkpoint_dir, \"checkpoint\")\n",
" torch.save((model.state_dict(), optimizer.state_dict()), path)\n",
"\n",
" tune.report(loss=metrics[\"loss\"], accuracy=metrics[\"accuracy\"])\n",
"\n",
" trainer.run(trainloader, max_epochs=10)"

Comment on lines +311 to +327
"def test_best_model(best_trial, data_dir=None):\n",
" _, testset = load_data(data_dir)\n",
" \n",
" best_trained_model = idist.auto_model(Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"]))\n",
" device = idist.device()\n",
"\n",
" best_checkpoint_dir = best_trial.checkpoint.value\n",
" model_state, optimizer_state = torch.load(os.path.join(\n",
" best_checkpoint_dir, \"checkpoint\"))\n",
" best_trained_model.load_state_dict(model_state)\n",
"\n",
" test_evaluator = create_supervised_evaluator(best_trained_model, metrics={\"Accuracy\": Accuracy()}, device=device, non_blocking=True)\n",
"\n",
" testloader = idist.auto_dataloader(testset, batch_size=4, shuffle=False, num_workers=2)\n",
"\n",
" test_evaluator.run(testloader)\n",
" print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"def test_best_model(best_trial, data_dir=None):\n",
" _, testset = load_data(data_dir)\n",
" \n",
" best_trained_model = idist.auto_model(Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"]))\n",
" device = idist.device()\n",
"\n",
" best_checkpoint_dir = best_trial.checkpoint.value\n",
" model_state, optimizer_state = torch.load(os.path.join(\n",
" best_checkpoint_dir, \"checkpoint\"))\n",
" best_trained_model.load_state_dict(model_state)\n",
"\n",
" test_evaluator = create_supervised_evaluator(best_trained_model, metrics={\"Accuracy\": Accuracy()}, device=device, non_blocking=True)\n",
"\n",
" testloader = idist.auto_dataloader(testset, batch_size=4, shuffle=False, num_workers=2)\n",
"\n",
" test_evaluator.run(testloader)\n",
" print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")"
"def test_best_model(best_trial, data_dir=None):\n",
" _, testset = load_data(data_dir)\n",
"\n",
" best_trained_model = idist.auto_model(\n",
" Net(best_trial.config[\"l1\"], best_trial.config[\"l2\"])\n",
" )\n",
" device = idist.device()\n",
"\n",
" best_checkpoint_dir = best_trial.checkpoint.value\n",
" model_state, optimizer_state = torch.load(\n",
" os.path.join(best_checkpoint_dir, \"checkpoint\")\n",
" )\n",
" best_trained_model.load_state_dict(model_state)\n",
"\n",
" test_evaluator = create_supervised_evaluator(\n",
" best_trained_model,\n",
" metrics={\"Accuracy\": Accuracy()},\n",
" device=device,\n",
" non_blocking=True,\n",
" )\n",
"\n",
" testloader = idist.auto_dataloader(\n",
" testset, batch_size=4, shuffle=False, num_workers=2\n",
" )\n",
"\n",
" test_evaluator.run(testloader)\n",
" print(f\"Best trial test set accuracy: {test_evaluator.state.metrics}\")"

Comment on lines +358 to +389
"def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n",
" data_dir = os.path.abspath(\"./data\")\n",
" load_data(data_dir)\n",
" \n",
" config = {\n",
" \"l1\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n",
" \"l2\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n",
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
" \"batch_size\": tune.choice([2, 4, 8, 16])\n",
" }\n",
" scheduler = ASHAScheduler(\n",
" metric=\"loss\",\n",
" mode=\"min\",\n",
" max_t=max_num_epochs,\n",
" grace_period=1,\n",
" reduction_factor=2)\n",
" reporter = CLIReporter(\n",
" metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n",
" result = tune.run(\n",
" partial(train_cifar, data_dir=data_dir),\n",
" resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n",
" config=config,\n",
" num_samples=num_samples,\n",
" scheduler=scheduler,\n",
" progress_reporter=reporter)\n",
"\n",
" best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n",
" print(f\"Best trial config: {best_trial.config}\")\n",
" print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n",
" print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n",
" \n",
" test_best_model(best_trial, data_dir)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n",
" data_dir = os.path.abspath(\"./data\")\n",
" load_data(data_dir)\n",
" \n",
" config = {\n",
" \"l1\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n",
" \"l2\": tune.sample_from(lambda _: 2**np.random.randint(2, 9)),\n",
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
" \"batch_size\": tune.choice([2, 4, 8, 16])\n",
" }\n",
" scheduler = ASHAScheduler(\n",
" metric=\"loss\",\n",
" mode=\"min\",\n",
" max_t=max_num_epochs,\n",
" grace_period=1,\n",
" reduction_factor=2)\n",
" reporter = CLIReporter(\n",
" metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n",
" result = tune.run(\n",
" partial(train_cifar, data_dir=data_dir),\n",
" resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n",
" config=config,\n",
" num_samples=num_samples,\n",
" scheduler=scheduler,\n",
" progress_reporter=reporter)\n",
"\n",
" best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n",
" print(f\"Best trial config: {best_trial.config}\")\n",
" print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n",
" print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n",
" \n",
" test_best_model(best_trial, data_dir)"
"def main(num_samples=10, max_num_epochs=10, gpus_per_trial=1):\n",
" data_dir = os.path.abspath(\"./data\")\n",
" load_data(data_dir)\n",
"\n",
" config = {\n",
" \"l1\": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),\n",
" \"l2\": tune.sample_from(lambda _: 2 ** np.random.randint(2, 9)),\n",
" \"lr\": tune.loguniform(1e-4, 1e-1),\n",
" \"batch_size\": tune.choice([2, 4, 8, 16]),\n",
" }\n",
" scheduler = ASHAScheduler(\n",
" metric=\"loss\",\n",
" mode=\"min\",\n",
" max_t=max_num_epochs,\n",
" grace_period=1,\n",
" reduction_factor=2,\n",
" )\n",
" reporter = CLIReporter(metric_columns=[\"loss\", \"accuracy\", \"training_iteration\"])\n",
" result = tune.run(\n",
" partial(train_cifar, data_dir=data_dir),\n",
" resources_per_trial={\"cpu\": 2, \"gpu\": gpus_per_trial},\n",
" config=config,\n",
" num_samples=num_samples,\n",
" scheduler=scheduler,\n",
" progress_reporter=reporter,\n",
" )\n",
"\n",
" best_trial = result.get_best_trial(\"loss\", \"min\", \"last\")\n",
" print(f\"Best trial config: {best_trial.config}\")\n",
" print(f\"Best trial final validation loss: {best_trial.last_result['loss']}\")\n",
" print(f\"Best trial final validation accuracy: {best_trial.last_result['accuracy']}\")\n",
"\n",
" test_best_model(best_trial, data_dir)"

@trsvchn
Copy link
Member

trsvchn commented Nov 24, 2021

What about adding some summarizing sentences about the best trial, how to interpret the results?

"For every trial, Ray Tune will randomly sample a combination of parameters from these search spaces. It will then train a number of models in parallel and find the best performing one among these. \n",
"We also use the `ASHAScheduler()` which is one of the trial schedulers that aggressively terminate low-performing trials.\n",
"Apart from that, we leverage the `CLIReporter()` to prettify our outputs.\n",
"And then, we wrap `train_cifar` in functools.partial and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
"And then, we wrap `train_cifar` in functools.partial and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n",
"And then, we wrap `train_cifar` in `functools.partial` and pass it to `tune.run` along with other resources like the CPUs and GPUs available to use, the configurable parameters, the number of trials, scheduler and reporter.\n",

"id": "vJgTaKWU8Doq"
},
"source": [
"In this tutorial, we will see how [Ray Tune](https://docs.ray.io/en/stable/tune.html) can be used with Ignite for hyperparameter tuning. We will also compare it with other frameworks like [Optuna](https://optuna.org/) and [Ax](https://ax.dev/) for hyperparameter optimization.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we going to add comparisons with ax and optuna?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Advanced tutorial: Hyperparameter tuning
2 participants