Intel® Extension for TensorFlow* provides graph optimization to fuse specified operator patterns into a new single operator for better performance.
The basic list of supported fusions is shown below. These fusions require input and output of the same data type.
Pattern | Operator number |
---|---|
(Equal , NotEqual , GreaterEqual , Greater , LessEqual , Less )+Cast | 2 |
L2loss +AddN | 2 |
BatchMatMul +Mul | 2 |
Mul +AddN +TrainingOp | 3 |
Conv +Bias | 2 |
Conv +Bias +(Relu , Relu6 , Elu , LeakyRelu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) | 3 |
MatMul +Bias | 2 |
MatMul +Bias +(Relu , Relu6 , Elu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) | 3 |
FusedBatchNorm+Relu | 2 |
FusedBatchNormGrad+ReluGrad | 2 |
Conv+Bias+Add | 3 |
Conv +Bias +Add +(Relu , Relu6 , Elu , LeakyRelu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) | 4 |
MatMul +Bias +Add | 3 |
MatMul +Bias +Add +(Relu , Relu6 , Elu , Gelu_erf , Gelu_tanh , Tanh , Sigmoid ) | 4 |
MatMul+BiasAddGrad | 2 |
ConvGradFilter +BiasAddGrad | 2 |
Pad +Conv | 2 |
BatchMatMul with variable post-op | 2+ |
Swish | 2 |
LayerNorm | 3+ |
As stock TensorFlow only supports same-data-type input and output, inserting a cast node during BF16
inference and training may break the existing fusion pattern and impact performance.
Intel® Extension for TensorFlow* provides mixed data type fusion, which removes the additional data type conversions on the graph level.
Here is the list of supported mixed data type fusions, and we'll take a closer look at MatMul
as an example.
Pattern | Fused operator | Input data type | Output data type | oneDNN FP32 Math mode |
---|---|---|---|---|
MatMul + Cast | AccMatMul | BF16 | FP32 | N/A |
FusedMatMul + Cast | FusedAccMatMul | BF16 | FP32 | N/A |
AccMatMul + any MatMul Fusion | FusedAccMatMul | BF16 | FP32 | N/A |
Cast + MatMul + Cast | AccMatMul | FP32 | FP32 | BF16 |
Cast + FusedMatMul + Cast | FusedAccMatMul | FP32 | FP32 | BF16 |
The Cast + (Fused)MatMul + Cast
pattern is covered by pattern matcher; the rest is covered by remapper fusion. The new kernels are implemented(AccMatMul
and FusedAccMatMul(WithSum)
)as an extension of original MatMul
with the following new attributes:
Tout
: Output data type ∈ {float32
}.Tpost
: Post op data type ∈ {bfloat16
,float32
}.is_bf16_math_mode
: A Boolean to indicate whether to use oneDNNBF16
math mode ifFP32
input,FP32
output.
As the channels_first format is not supported by stock TensorFlow on CPU, it inserts transpose nodes before and after the Conv3D/MaxPool3D nodes. However, this problem does not exist in GPU device. To avoid unnecessary layout transformation when running on a GPU device, Intel® Extension for TensorFlow* adds a separate layout optimizer.
Pattern | Fused operator | Conv data format (before optimization) | Conv data format (after optimization) |
---|---|---|---|
Transpose + Conv3D + Transpose | Conv3D | NDHWC | NCDHW |