AI Benchmarks

This section provides the information and steps that we used to run some of the benchmarks on AiMOS.

Common AI Datasets (CADS) Repository

This repository contains a set of open-sourced AI datasets are made available to AiMOS users. It is for the ease of access for benchmarking and experimentation.

If you are interested in having access to the repository, you need to place a request in your respective #aimos-xxx slack channel. You can request additional data sets to be placed in the repo. Requests can be placed in your respective slack channel.

The CADS repository is in directory: /gpfs/u/locker/200/CADS/datasets

Common AI Benchmarks (CABS) Repository

This repository contains sample scripts, experiment setup, datasets, etc, that were used to run the common benchmarks, such as ResNet-50, MobileNetV2, Fairseq, etc.

See the repository at: https://github.com/IBM-AI-Hardware-Center/AiMOS-CABS

Monitoring Tools

Tensorboard

Tensorboard is a data visualization toolkits for machine learning experimentation. For more information see https://www.tensorflow.org/tensorboard/get_started.

In order to use Tensorboard to visualize the data, you need to collect the data that you want to visualize. For Pytorch, you can use the integrated Tensorboard to collect Tensorboard data. For documentation, see https://pytorch.org/docs/stable/tensorboard.html. For tutorial, see https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html

Code Snippets to collect the data

Define the directory to store the collected data.

parser.add_argument('--log-dir', default='./logs',
                 help='tensorboard log directory')

Import tensorboard class and set the logwritter.

try:
  from torch.utils.tensorboard import SummaryWriter
  log_writer = SummaryWriter(args.log_dir) if args.global_rank == 0 else None
except ImportError:
  log_writer = None

In the train loop, collect the data:

if log_writer:
  log_writer.add_scalar('train/loss', losses.avg, epoch)
  log_writer.add_scalar('train/accuracy', top1.avg, epoch)

In the validation loop, collect the data:

if log_writer:
  log_writer.add_scalar('val/loss', losses.avg, epoch)
  log_writer.add_scalar('val/accuracy', top1.avg, epoch)

You should see the collected data in the ./logs directory as defined above.

events.out.tfevents.1600877997.dcs049.ccni.rpi.edu.134021.0
events.out.tfevents.1600903856.dcs003.ccni.rpi.edu.42871.0
events.out.tfevents.1600904287.dcs026.ccni.rpi.edu.23978.0
events.out.tfevents.1600907411.dcs009.ccni.rpi.edu.16523.0

Visualize the data

Install tensorboard package if it was not installed.

conda install tensorboard

After tensorboard is installed, you start tensorboard to plot multiple experiments together, use –logdir_spec, for example:

tensorboard --logdir_spec="1node":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_1,"2nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2,"4nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_4,"8nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_8:"16nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_16 --host "0.0.0.0" --port 6006

Or if you only have a single experiment, use –logdir, for example:

tensorboard --logdir /gpfs/u/home/BMHR/BMHRkmkh/scratch/logs/test_mobnetv2_2  --host "0.0.0.0" --port 6006

Now you can use ssh tunneling to display the graphs on your destop. For example:

[id@kvt-rhel ~]$ ssh -L6006:dcsfen01:6006 <your-id>@blp01.ccni.rpi.edu

Now point your brower to localhost:6006. For example:

http://localhost:6006

nvidia-smi

You can use nvidia-smi command to collect GPU data such as gpu utilization, memory utilization, power used, fan speed, etc.

For the list of valid properties to query for switch “–query-gpu=” see https://briot-jerome.developpez.com/fichiers/blog/nvidia-smi/list.txt

For example:

nvidia-smi --query-gpu=timestamp,gpu_uuid,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw,fan.speed --format=csv -l 10 |tee nvidia_2nodes.txt

Visualize the data

You can download a sample jupyter notebook “plot_nvidia-smi.ipynb” at https://github.com/IBM-AI-Hardware-Center/AiMOS and modify it accordingly to plot the data.