Skip to content

Performance regression in SLPVectorize between llvm 10.0 and 11.0 #47830

Open
@d-parks

Description

@d-parks
Bugzilla Link48486
Version11.0
OSLinux
CC@alexey-bataev,@topperc,@LebedevRI,@RKSimon,@rscottmanley,@rotateright

Extended Description

With llvm 11.0 the change to the heuristics and/or instructions costs used in SLPVectorize.cpp (opt) have causes a 30% regression in overall application performance with routine __nv_MorphologyPrimitive_F1L2849_2 in the attached morphology.ll as measured on an Intel Skylake 40 core Xeon server.

With llvm 10.0, SLPVectorize promotes some of the loops from using xmm pd to ymm pd. Those same transformations do not happen with llvm 11.0.

Attached in SLPV.tar are:
morphology.ll (used as input for llvm opt releases 10 and 11)
morphology-10.llvm (output of opt using --opt-bisect-limit=778 - just after the SLP pass) - exactly:

lim=778
opt -O2 -mcpu=skylake-avx512 --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math --enable-no-signed-zeros-fp-math --opt-bisect-limit=${lim} ./obj/magick/morphology.ll -S -o ./obj/magick/morphology-10.llvm

morphology-11.llvm
morphology-10.s output from llc invoked with:
-mcpu=skylake-avx512 -O2 --enable-unsafe-fp-math --enable-no-nans-fp-math --enable-no-infs-fp-math --enable-no-signed-zeros-fp-math -fast-isel=0 -non-global-value-max-name-size=4294967295 -x86-cmov-converter=0 -filetype=obj

perf-10.lst and perf-11.lst: snapshots of perf report ofthe most costly loop in routine __nv_MorphologyPrimitive_F1L2849_2

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      close