AI Benchmarks ============= This section provides the information and steps that we used to run some of the benchmarks on AiMOS. Common AI Datasets (CADS) Repository ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This repository contains a set of open-sourced AI datasets are made available to AiMOS users. It is for the ease of access for benchmarking and experimentation. If you are interested in having access to the repository, you need to place a request in your respective #aimos-xxx slack channel. You can request additional data sets to be placed in the repo. Requests can be placed in your respective slack channel. The CADS repository is in directory: **/gpfs/u/locker/200/CADS/datasets** .. figure:: ai-datasets.png Common AI Benchmarks (CABS) Repository ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This repository contains sample scripts, experiment setup, datasets, etc, that were used to run the common benchmarks, such as ResNet-50, MobileNetV2, `Fairseq `_, etc. See the repository at: https://github.com/IBM-AI-Hardware-Center/AiMOS-CABS Monitoring Tools ^^^^^^^^^^^^^^^^ Tensorboard +++++++++++ Tensorboard is a data visualization toolkits for machine learning experimentation. For more information see https://www.tensorflow.org/tensorboard/get_started. In order to use Tensorboard to visualize the data, you need to collect the data that you want to visualize. For Pytorch, you can use the integrated Tensorboard to collect Tensorboard data. For documentation, see https://pytorch.org/docs/stable/tensorboard.html. For tutorial, see https://pytorch.org/tutorials/recipes/recipes/tensorboard_with_pytorch.html Code Snippets to collect the data %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Define the directory to store the collected data. .. code:: bash parser.add_argument('--log-dir', default='./logs', help='tensorboard log directory') Import tensorboard class and set the logwritter. .. code:: bash try: from torch.utils.tensorboard import SummaryWriter log_writer = SummaryWriter(args.log_dir) if args.global_rank == 0 else None except ImportError: log_writer = None In the train loop, collect the data: .. code:: bash if log_writer: log_writer.add_scalar('train/loss', losses.avg, epoch) log_writer.add_scalar('train/accuracy', top1.avg, epoch) In the validation loop, collect the data: .. code:: bash if log_writer: log_writer.add_scalar('val/loss', losses.avg, epoch) log_writer.add_scalar('val/accuracy', top1.avg, epoch) You should see the collected data in the ./logs directory as defined above. .. code:: bash events.out.tfevents.1600877997.dcs049.ccni.rpi.edu.134021.0 events.out.tfevents.1600903856.dcs003.ccni.rpi.edu.42871.0 events.out.tfevents.1600904287.dcs026.ccni.rpi.edu.23978.0 events.out.tfevents.1600907411.dcs009.ccni.rpi.edu.16523.0 Visualize the data %%%%%%%%%%%%%%%%%% Install tensorboard package if it was not installed. .. code:: bash conda install tensorboard After tensorboard is installed, you start tensorboard to plot multiple experiments together, use --logdir_spec, for example: .. code:: bash tensorboard --logdir_spec="1node":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_1,"2nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2,"4nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_4,"8nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_8:"16nodes":/gpfs/u/home/BMHR/BMHRkmkh/scratch/mob2/logs/test_mobnetv2_16 --host "0.0.0.0" --port 6006 Or if you only have a single experiment, use --logdir, for example: .. code:: bash tensorboard --logdir /gpfs/u/home/BMHR/BMHRkmkh/scratch/logs/test_mobnetv2_2 --host "0.0.0.0" --port 6006 Now you can use ssh tunneling to display the graphs on your destop. For example: .. code:: bash [id@kvt-rhel ~]$ ssh -L6006:dcsfen01:6006 @blp01.ccni.rpi.edu Now point your brower to localhost:6006. For example: .. code:: bash http://localhost:6006 .. figure:: tensorboard.png nvidia-smi ++++++++++ You can use `nvidia-smi `_ command to collect GPU data such as gpu utilization, memory utilization, power used, fan speed, etc. For the list of valid properties to query for switch "--query-gpu=" see https://briot-jerome.developpez.com/fichiers/blog/nvidia-smi/list.txt For example: .. code:: bash nvidia-smi --query-gpu=timestamp,gpu_uuid,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw,fan.speed --format=csv -l 10 |tee nvidia_2nodes.txt Visualize the data %%%%%%%%%%%%%%%%%% You can download a sample jupyter notebook "plot_nvidia-smi.ipynb" at https://github.com/IBM-AI-Hardware-Center/AiMOS and modify it accordingly to plot the data.