Skip to content

Latest commit

 

History

History
194 lines (159 loc) · 8.76 KB

threadpool.md

File metadata and controls

194 lines (159 loc) · 8.76 KB

Selecting Thread Pool in Intel® Extension for TensorFlow* CPU [Experimental]

Intel® Extension for TensorFlow* CPU lets you choose between the OpenMP thread pool (default) or Eigen thread pool through environment variable ITEX_OMP_THREADPOOL=1 or 0 respectively. This gives you flexibility to select a more efficient thread pool for your workload and hardware configuration. If setting inter_op_parallelism_threads=1 causes a large performance drop for your workload, we recommended you use Eigen thread pool. Running independent operations concurrently can be more efficient for cheap ops which cannot fully utilize the hardware compute units on their own.

Using OpenMP Thread Pool

OpenMP thread pool is default in Intel® Extension for TensorFlow* CPU. It provides lower scheduling overheads, better data locality, and better cache usage. Configure the number of OMP threads via the OMP_NUM_THREADS environment variable. Due to the fork-join model of OpenMP and TensorFlow parallelism between independent operations, you must set the correct configuration to avoid thread conflicts. Make sure the total number of OMP threads forked from inter_op_parallelism_threads is less than the number of available CPU cores. For example, by default, Intel® Extension for TensorFlow* sets the number of threads used by independent non-blocking operations to be 1. Set OMP_NUM_THREADS to be the number of cores available, and KMP_BLOCKTIME=1, KMP_AFFINITY=granularity=fine,compact,1,0.

Using Eigen Thread Pool

For workloads with large inter-op concurrency, an OpenMP thread pool may not supply sufficient parallelism between operations. In this case, you should switch to the non-blocking thread pool provided by Eigen, which is the default in TensorFlow. In this case, same as TensorFlow, inter_op_parallelism_threads is set to 0 by default, which means to parallelize independent operations as much as possible. The work-stealing queue in Eigen thread pool allows better dynamic load balancing, giving better performance and scaling with larger inter_op_parallelism_threads. No other configuration is needed when using Eigen thread pool.

Example

Here we show two examples using different thread pools on Intel® Xeon® Platinum 8480+ systems.

  1. This example is modified from a keras example.

    importtensorflowastfimporttimefromtensorflowimportkerasfromtensorflow.kerasimportlayers"""## The Antirectifier layer"""classAntirectifier(layers.Layer): def__init__(self, initializer="he_normal", **kwargs): super().__init__(**kwargs) self.initializer=keras.initializers.get(initializer) defbuild(self, input_shape): output_dim=input_shape[-1] self.kernel=self.add_weight( shape=(output_dim*2, output_dim), initializer=self.initializer, name="kernel", trainable=True, ) defcall(self, inputs): inputs-=tf.reduce_mean(inputs, axis=-1, keepdims=True) pos=tf.nn.relu(inputs) neg=tf.nn.relu(-inputs) concatenated=tf.concat([pos, neg], axis=-1) mixed=tf.matmul(concatenated, self.kernel) returnmixeddefget_config(self): # Implement get_config to enable serialization. This is optional.base_config=super().get_config() config= {"initializer": keras.initializers.serialize(self.initializer)} returndict(list(base_config.items()) +list(config.items())) """## Let's test-drive it on MNIST"""# Training parametersbatch_size=64num_classes=10epochs=20# The data, split between train and test sets (x_train, y_train), (x_test, y_test) =keras.datasets.mnist.load_data() x_train=x_train.reshape(-1, 784) x_test=x_test.reshape(-1, 784) x_train=x_train.astype("float32") x_test=x_test.astype("float32") x_train/=255x_test/=255print(x_train.shape[0], "train samples") print(x_test.shape[0], "test samples") # Build the modelmodel=keras.Sequential( [ keras.Input(shape=(784,)), layers.Dense(256), Antirectifier(), layers.Dense(256), Antirectifier(), layers.Dropout(0.5), layers.Dense(10), ] ) # Compile the modelmodel.compile( loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer=keras.optimizers.RMSprop(), metrics=[keras.metrics.SparseCategoricalAccuracy()], ) # Train the model# warmupmodel.fit(x_train, y_train, batch_size=batch_size, epochs=1, steps_per_epoch=10, validation_split=0.15) # Train the modelstart=time.time() model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.15) end=time.time() print("train time(s): ",end-start) # Test the model# warmupmodel.evaluate(x_test, y_test) start=time.time() model.evaluate(x_test, y_test) end=time.time() print("evaluation time(s): ",end-start)

    For Eigen thread pool, run with ITEX_OMP_THREADPOOL=0 numactl -C 0-55 -l python antirectifier.py, This run takes about 60 seconds to train and 0.26 seconds to evaluate. For OpenMP thread pool, run with OMP_NUM_THREADS=56 KMP_BLOCKTIME=1 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 numactl -C 0-55 -l python antirectifier.py, This run takes about 85 seconds to train and 0.36 seconds to evaluate. Eigen thread pool is about 30% faster on this small model.

  2. Benchmark the inception_v4 example. Get the trained model via wget https://storage.googleapis.com/intel-optimized-tensorflow/models/v1_8/inceptionv4_fp32_pretrained_model.pb before running the following code.

    importtensorflowastffromtensorflow.core.framework.graph_pb2importGraphDefimporttimeimportjsonimportosdefload_pb(pb_file): withopen(pb_file, 'rb') asf: gd=GraphDef() gd.ParseFromString(f.read()) returngddefget_concrete_function(graph_def, inputs, outputs, print_graph=False): defimports_graph_def(): tf.compat.v1.import_graph_def(graph_def, name="") wrap_function=tf.compat.v1.wrap_function(imports_graph_def, []) graph=wrap_function.graphreturnwrap_function.prune( tf.nest.map_structure(graph.as_graph_element, inputs), tf.nest.map_structure(graph.as_graph_element, outputs)) defsave_json_data(logs, filename="train.json"): withopen(filename,"w") asf: json.dump(logs,f) defrun_infer(concrete_function, shape): total_times=50warm=int(0.2*total_times) res= [] foriinrange(total_times): input_x=tf.random.uniform(shape, minval=0, maxval=1.0, dtype=tf.float32) bt=time.time() y=concrete_function(input=input_x) delta_time=time.time() -btprint('Iteration %d: %.3f sec'% (i, delta_time)) ifi>=warm: res.append(delta_time) latency=sum(res) /len(res) returnlatencydefdo_benchmark(pb_file, inputs, outputs, base_shape): concrete_function=get_concrete_function( graph_def=load_pb(pb_file), inputs=inputs, outputs=outputs, print_graph=True) bs=1base_shape.insert(0, bs) shape=tuple(base_shape) latency=run_infer(concrete_function, shape) res= {'latency':latency} print("Benchmark is done!") print("Benchmark res {}".format(res)) print("Finished") returnresdefbenchmark(): pb_file="inceptionv4_fp32_pretrained_model.pb"inputs= ['input:0'] outputs= ['InceptionV4/Logits/Predictions:0'] base_shape= [299, 299, 3] returndo_benchmark(pb_file, inputs, outputs, base_shape) if__name__=="__main__": benchmark()

    For Eigen thread pool, run with ITEX_OMP_THREADPOOL=0 numactl -C 0-3 -l python benchmark.py. The latency is about 0.08 second/image. For OpenMP thread pool, run with OMP_NUM_THREADS=4 KMP_BLOCKTIME=1 KMP_AFFINITY=granularity=fine,verbose,compact,1,0 numactl -C 0-3 -l python benchmark.py. The latency is 0.04 second/image. OMP thread pool is about 2x slower than Eigen thread pool on this model.

close