Skip to content

Latest commit

 

History

History
235 lines (178 loc) · 12.9 KB

performance.md

File metadata and controls

235 lines (178 loc) · 12.9 KB

Performance Data

Overview

This document demonstrates the training and inference performance as well as accuracy results on several popular AI workloads with Intel® Extension for TensorFlow* benchmarked on Intel GPUs. You can easily reproduce these results following the guidlines in examples.

Models

The following tables provide the links where you can get the original code repository and step-by-step guide running on Intel GPUs for each model.

Training Workloads

ModelOriginal Model RepoITEX Step-by-Step Guide
ResNet50v1.5TensorFlow-Models/ResNet50v1.5Resnet50 train on Intel GPU
BERT-LargeDeepLearningExamples/BERTAccelerate BERT-Large Pretraining on Intel GPU
Mask-RCNNDeepLearningExamples/Mask-RCNNAccelerate Mask R-CNN Training on Intel GPU
3D-UNetDeepLearningExamples/3D-UNetAccelerate 3D-UNet Training for medical image segmentation on Intel GPU

Inference Workloads

ModelOriginal Model RepoITEX Step-by-Step Guide
ResNet50v1.5Intel-Reference-Models/ResNet50v1.5ResNet50v1.5 Model Inference with Intel® Extention for TensorFlow*
EfficientNet-B0Keras-Applications/EfficientNetUse the exact same codes and instructions as in the orignal model repo
EfficientNet-B3Keras-Applications/EfficientNetUse the exact same codes and instructions as in the orignal model repo
Mask-RCNNDeepLearningExamples/Mask-RCNNUse the exact same codes and instructions as in the orignal model repo
Stable Diffusion v1-4KerasCV/Stable-DiffusionStable Diffusion Inference for Text2Image on Intel GPU

Training Accuracy Results

Training Accuracy on 1-node of 4x Intel Data Center GPU Max 1550

The following table shows the BERT-Large performance, training loss and time-to-train (TTT) results for both the pre-training and fine-tuning phases on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU).

Pre-training Phase1Pre-training Phase2Fine-Tuning
DatasetWikipedia and BookCorpusWikipedia and BookCorpusSQuAD 1.1
Maximum Sequence Length128512384
Data TypeBF16BF16BF16
Throughput (sequences/sec)3265.35699.25523.55
Time to Train (hours)39.3220.400.67
Loss1.60471.38700.6867

Training Performance Results

Training Performance on 1-node of 4x Intel Data Center GPU Max 1550

The following tables show the performance numbers for several popular training workloads on 1-node of 4x Intel® Data Center GPU Max 1550 (600W OAM, 2-stack for each GPU). For these workloads, we enable and benchmark both FP32 training and BF16 automatic mixed precision (AMP) training with 1-Stack of 1x Max 1550, 2-Stack of 1x Max 1550 as well as 4x Max 1550 (with 8 Stacks in total), to showcase the performance boost and scalability with Intel® Extension for TensorFlow* and Intel® Optimization for Horovod*.

Note: The training performance result on each workload below for 1x Max 1550 w/ 1-Stack represents the minimum value of the performance results on 2 stacks of single GPU, with 2 instances initiated simultaneously, while each stack of the GPU executing the workload separately, without distributed training.

ResNet50v1-5 Training Performance Results

GPUsRanksLocal Batch Size:
FP32, BF16
Training
Steps
Throughput w/
TF32 (images/sec)
Throughput w/
BF16 (images/sec)
Throughput Speedup
w/ AMP
Weak Scaling
w/ TF32
Weak Scaling
w/ BF16
1x Max 1550 w/ 1-Stack1256, 5125000918.961766.531.92x1.001.00
1x Max 1550 w/ 2-Stack2256, 51250001762.763461.861.96x1.921.96
4x Max 15508256, 2565000NA12278.32NANA6.95

BERT-Large Phase2 Training Performance Results

GPUsRanksLocal
Batch Size
x Accumulation Steps
Training
Steps
Throughput
w/ TF32
(sequences/sec)
Throughput
w/ BF16
(sequences/sec)
Throughput Speedup
w/ AMP
Weak Scaling
w/ TF32
Weak Scaling
w/ BF16
1x Max 1550 w/ 1-Stack132 x 302036.2293.222.57x1.001.00
1x Max 1550 w/ 2-Stack232 x 302074.40182.572.45x2.051.96
4x Max 1550832 x 3020NA692.11NANA7.42

Mask-RCNN Training Performance Results

GPUsRanksLocal Batch SizeTraining StepsThroughput w/ BF16 (images/sec)Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack142029.031.00
1x Max 1550 w/ 2-Stack242055.511.91

Medical Image 3D U-Net Training Performance Results

GPUsRanksLocal Batch SizeTraining StepsThroughput w/ BF16 (samples/sec)Weak Scaling w/ BF16
1x Max 1550 w/ 1-Stack11100012.811.00
1x Max 1550 w/ 2-Stack21100023.561.84
4x Max 155081100087.076.80

Inference Performance Results

Inference Performance on 1x Intel Data Center GPU Flex 170

The following tables show the performance numbers for several popular inference workloads on 1x Intel® Data Center GPU Flex 170 (150W PCIe, 1-stack for each GPU).

Note: Inference with online mode refers to running the workloads using 1 as the batch size, while inference with batch mode utilizes larger batch size.

ResNet50v1-5 Inference Performance Results

GPUsDatasetImage SizeModeBatch SizeData TypeInference StepsThroughput (images/sec)
1x Flex 170Dummy224x224Online1INT85000435.01
1x Flex 170Dummy224x224Batch1024INT850009842.75

EfficientNet-B0 Inference Performance Results

GPUsDatasetImage SizeModeBatch SizeData TypeInference StepsThroughput (images/sec)
1x Flex 170Dummy224x224Batch64FP16 (AMP)503007.60
1x Flex 170Dummy224x224Batch128FP16 (AMP)503587.29

EfficientNet-B3 Inference Performance Results

GPUsDatasetImage SizeModeBatch SizeData TypeInference StepsThroughput (images/sec)
1x Flex 170Dummy300x300Batch64FP16 (AMP)50928.56
1x Flex 170Dummy300x300Batch128FP16 (AMP)50968.83

Mask-RCNN Inference Performance Results

GPUsDatasetModeBatch SizeData TypeInference StepsThroughput (images/sec)
1x Flex 170COCO 2017Online1FP16 (AMP)500019.38
1x Flex 170COCO 2017Batch16FP16 (AMP)31243.02

Stable Diffusion v1-4 Inference Performance Results

GPUsDatasetOutput
Image Size
ModeBatch SizeData TypeDiffusion StepsThroughput
(iterations/sec)
Throughput Speedup
w/ FP16
1x Flex 170Text Prompt512x512Online1FP32502.911.00x
1x Flex 170Text Prompt512x512Online1FP16 (pure)506.532.24x

Configuration

Software Configuration

Software Configuration for Intel Max 1550 GPU

Software ComponentVersion
GPU Driver736.25
Intel® oneAPI Base Toolkit2024.0
TensorFlowv2.14.0
Intel® Extension for TensorFlow*v2.14.0.1
Intel® Optimization for Horovod*v0.28.1.2

Software Configuration for Intel Flex 170 GPU

Software ComponentVersion
GPU Driver736.25
Intel® oneAPI Base Toolkit2024.0
TensorFlowv2.14.0
Intel® Extension for TensorFlow*v2.14.0.1

Hardware Configuration

Hardware Configuration for Intel Max 1550 GPU

GPU System4x Intel® Data Center GPU Max 1550
Number of Nodes1
Xe®-Cores per GPU128 in total 2-Stack
Memory Size per GPU128 GB HBM2e in total 2-Stack
TDP per GPU600W
GPU ECC SettingOFF
Server BoardIntel® Denali Pass D50DNP1SBB
OSSUSE Linux Enterprise Server 15 SP4
Kernel5.14.21-150400.24.69-default
CPU ModelIntel® Xeon® Platinum 8480+ @ 2.00 GHz
Number of Sockets2
CPU Cores per Socket56
Hyper ThreadingON
Turbo BoostON
Automatic NUMA BalancingEnabled
CPU Frequency GovernorPerformance
TDP per CPU350W
Installed Memory1024GB (16x64GB 4800 MT/s DDR5)
NIC1x Intel® Ethernet Controller X710 for 10GBASE-T
Storage1x WD® WD_BLACK SN850X 2TB NVMe SSD

Hardware Configuration for Intel Flex 170 GPU

GPU System1x Intel® Data Center GPU Flex 170
Number of Nodes1
Xe®-Cores per GPU32
Memory Size per GPU16 GB GDDR6
TDP per GPU150W
GPU ECC SettingON
Server BoardIntel® Whitley
OSUbuntu 22.04.3 LTS
Kernel5.15.0-57-generic
CPU ModelIntel® Xeon® Gold 6336Y CPU @ 2.40GHz
Number of Sockets2
CPU Cores per Socket24
Hyper ThreadingON
Turbo BoostON
Automatic NUMA BalancingEnabled
CPU Frequency GovernorPerformance
TDP per CPU185W
Installed Memory128GB (8x16GB 3200 MT/s DDR4)
NIC2x Intel® Ethernet Controller X710 for 10GBASE-T,
1x Intel® 82574L Gigabit Ethernet Controller
Storage1x Intel® SSDSC2KG960G8,
1x Samsung® 870 EVO 1TB SSD

Additional Performance Data for Intel AI Data Center Products

You can find the latest performance data on other Intel® AI Data Center Products such as 3rd, 4th, and 5th Gen Intel® Xeon® Scalable processors via Performance Data for Intel® AI Data Center Products.

close