Intel® Extension for TensorFlow* CPU lets you choose between the OpenMP thread pool (default) or Eigen thread pool through environment variable ITEX_OMP_THREADPOOL=1
or 0
respectively. This gives you flexibility to select a more efficient thread pool for your workload and hardware configuration. If setting inter_op_parallelism_threads=1
causes a large performance drop for your workload, we recommended you use Eigen thread pool. Running independent operations concurrently can be more efficient for cheap ops which cannot fully utilize the hardware compute units on their own.
OpenMP thread pool is default in Intel® Extension for TensorFlow* CPU. It provides lower scheduling overheads, better data locality, and better cache usage. Configure the number of OMP threads via the OMP_NUM_THREADS
environment variable. Due to the fork-join model of OpenMP and TensorFlow parallelism between independent operations, you must set the correct configuration to avoid thread conflicts. Make sure the total number of OMP threads forked from inter_op_parallelism_threads
is less than the number of available CPU cores. For example, by default, Intel® Extension for TensorFlow* sets the number of threads used by independent non-blocking operations to be 1
. Set OMP_NUM_THREADS
to be the number of cores available, and KMP_BLOCKTIME=1
, KMP_AFFINITY=granularity=fine,compact,1,0
.
For workloads with large inter-op concurrency, an OpenMP thread pool may not supply sufficient parallelism between operations. In this case, you should switch to the non-blocking thread pool provided by Eigen, which is the default in TensorFlow. In this case, same as TensorFlow, inter_op_parallelism_threads
is set to 0 by default, which means to parallelize independent operations as much as possible. The work-stealing queue in Eigen thread pool allows better dynamic load balancing, giving better performance and scaling with larger inter_op_parallelism_threads
. No other configuration is needed when using Eigen thread pool.
Here we show two examples using different thread pools on Intel® Xeon® Platinum 8480+ systems.
This example is modified from a keras example.
importtensorflowastfimporttimefromtensorflowimportkerasfromtensorflow.kerasimportlayers"""## The Antirectifier layer"""classAntirectifier(layers.Layer): def__init__(self, initializer="he_normal", **kwargs): super().__init__(**kwargs) self.initializer=keras.initializers.get(initializer) defbuild(self, input_shape): output_dim=input_shape[-1] self.kernel=self.add_weight( shape=(output_dim*2, output_dim), initializer=self.initializer, name="kernel", trainable=True, ) defcall(self, inputs): inputs-=tf.reduce_mean(inputs, axis=-1, keepdims=True) pos=tf.nn.relu(inputs) neg=tf.nn.relu(-inputs) concatenated=tf.concat([pos, neg], axis=-1) mixed=tf.matmul(concatenated, self.kernel) returnmixeddefget_config(self): # Implement get_config to enable serialization. This is optional.base_config=super().get_config() config= {"initializer": keras.initializers.serialize(self.initializer)} returndict(list(base_config.items()) +list(config.items())) """## Let's test-drive it on MNIST"""# Training parametersbatch_size=64num_classes=10epochs=20# The data, split between train and test sets (x_train, y_train), (x_test, y_test) =keras.datasets.mnist.load_data() x_train=x_train.reshape(-1, 784) x_test=x_test.reshape(-1, 784) x_train=x_train.astype("float32") x_test=x_test.astype("float32") x_train/=255x_test/=255print(x_train.shape[0], "train samples") print(x_test.shape[0], "test samples") # Build the modelmodel=keras.Sequential( [ keras.Input(shape=(784,)), layers.Dense(256), Antirectifier(), layers.Dense(256), Antirectifier(), layers.Dropout(0.5), layers.Dense(10), ] ) # Compile the modelmodel.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=keras.optimizers.RMSprop(), metrics=[keras.metrics.SparseCategoricalAccuracy()], ) # Train the model# warmupmodel.fit(x_train, y_train, batch_size=batch_size, epochs=1, steps_per_epoch=10, validation_split=0.15) # Train the modelstart=time.time() model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.15) end=time.time() print("train time(s): ",end-start) # Test the model# warmupmodel.evaluate(x_test, y_test) start=time.time() model.evaluate(x_test, y_test) end=time.time() print("evaluation time(s): ",end-start)
For Eigen thread pool, run with
ITEX_OMP_THREADPOOL=0 numactl -C 0-55 -l python antirectifier.py
, This run takes about 60 seconds to train and 0.26 seconds to evaluate. For OpenMP thread pool, run withOMP_NUM_THREADS=56 KMP_BLOCKTIME=1 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 numactl -C 0-55 -l python antirectifier.py
, This run takes about 85 seconds to train and 0.36 seconds to evaluate. Eigen thread pool is about 30% faster on this small model.Benchmark the inception_v4 example. Get the trained model via
wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/inceptionv4_fp32_pretrained_model.pb
before running the following code.importtensorflowastffromtensorflow.core.framework.graph_pb2importGraphDefimporttimeimportjsonimportosdefload_pb(pb_file): withopen(pb_file, 'rb') asf: gd=GraphDef() gd.ParseFromString(f.read()) returngddefget_concrete_function(graph_def, inputs, outputs, print_graph=False): defimports_graph_def(): tf.compat.v1.import_graph_def(graph_def, name="") wrap_function=tf.compat.v1.wrap_function(imports_graph_def, []) graph=wrap_function.graphreturnwrap_function.prune( tf.nest.map_structure(graph.as_graph_element, inputs), tf.nest.map_structure(graph.as_graph_element, outputs)) defsave_json_data(logs, filename="train.json"): withopen(filename,"w") asf: json.dump(logs,f) defrun_infer(concrete_function, shape): total_times=50warm=int(0.2*total_times) res= [] foriinrange(total_times): input_x=tf.random.uniform(shape, minval=0, maxval=1.0, dtype=tf.float32) bt=time.time() y=concrete_function(input=input_x) delta_time=time.time() -btprint('Iteration %d: %.3f sec'% (i, delta_time)) ifi>=warm: res.append(delta_time) latency=sum(res) /len(res) returnlatencydefdo_benchmark(pb_file, inputs, outputs, base_shape): concrete_function=get_concrete_function( graph_def=load_pb(pb_file), inputs=inputs, outputs=outputs, print_graph=True) bs=1base_shape.insert(0, bs) shape=tuple(base_shape) latency=run_infer(concrete_function, shape) res= {'latency':latency} print("Benchmark is done!") print("Benchmark res {}".format(res)) print("Finished") returnresdefbenchmark(): pb_file="inceptionv4_fp32_pretrained_model.pb"inputs= ['input:0'] outputs= ['InceptionV4/Logits/Predictions:0'] base_shape= [299, 299, 3] returndo_benchmark(pb_file, inputs, outputs, base_shape) if__name__=="__main__": benchmark()
For Eigen thread pool, run with
ITEX_OMP_THREADPOOL=0 numactl -C 0-3 -l python benchmark.py
. The latency is about 0.08 second/image. For OpenMP thread pool, run withOMP_NUM_THREADS=4 KMP_BLOCKTIME=1 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 numactl -C 0-3 -l python benchmark.py
. The latency is 0.04 second/image. OMP thread pool is about 2x slower than Eigen thread pool on this model.