Description
Bugzilla Link | 48486 |
Version | 11.0 |
OS | Linux |
CC | @alexey-bataev,@topperc,@LebedevRI,@RKSimon,@rscottmanley,@rotateright |
Extended Description
With llvm 11.0 the change to the heuristics and/or instructions costs used in SLPVectorize.cpp (opt) have causes a 30% regression in overall application performance with routine __nv_MorphologyPrimitive_F1L2849_2 in the attached morphology.ll as measured on an Intel Skylake 40 core Xeon server.
With llvm 10.0, SLPVectorize promotes some of the loops from using xmm pd to ymm pd. Those same transformations do not happen with llvm 11.0.
Attached in SLPV.tar are:
morphology.ll (used as input for llvm opt releases 10 and 11)
morphology-10.llvm (output of opt using --opt-bisect-limit=778 - just after the SLP pass) - exactly:
lim=778
opt -O2 -mcpu=skylake-avx512 --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math --enable-no-signed-zeros-fp-math --opt-bisect-limit=${lim} ./obj/magick/morphology.ll -S -o ./obj/magick/morphology-10.llvm
morphology-11.llvm
morphology-10.s output from llc invoked with:
-mcpu=skylake-avx512 -O2 --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math --enable-no-signed-zeros-fp-math -fast-isel=0 -non-global-value-max-name-size=4294967295 -x86-cmov-converter=0 -filetype=obj
perf-10.lst and perf-11.lst: snapshots of perf report ofthe most costly loop in routine __nv_MorphologyPrimitive_F1L2849_2