- Notifications
You must be signed in to change notification settings - Fork 13.3k
[LoongArch] Fix fp_to_uint/fp_to_sint conversion errors for lasx#137129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-backend-loongarch Author: None (tangaac) ChangesPrvious Full diff: https://github.com/llvm/llvm-project/pull/137129.diff 3 Files Affected:
diff --git a/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td b/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td index e4268920e0b27..fe08c1050b4d7 100644 --- a/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td+++ b/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td@@ -1815,24 +1815,24 @@ def : Pat<(v4f32 (uint_to_fp v4i64:$vj)), // XVFTINTRZ_{W_S/L_D} def : Pat<(v8i32 (fp_to_sint v8f32:$vj)), (XVFTINTRZ_W_S v8f32:$vj)>; def : Pat<(v4i64 (fp_to_sint v4f64:$vj)), (XVFTINTRZ_L_D v4f64:$vj)>; -def : Pat<(v4i64 (fp_to_sint v4f32:$vj)),- (VEXT2XV_D_W (SUBREG_TO_REG (i64 0), (VFTINTRZ_W_S v4f32:$vj),- sub_128))>;-def : Pat<(v4i32 (fp_to_sint (v4f64 LASX256:$vj))),- (EXTRACT_SUBREG (XVFTINTRZ_W_S (XVFCVT_S_D (XVPERMI_D v4f64:$vj, 238),- v4f64:$vj)),- sub_128)>;+def : Pat<(v4i64(fp_to_sint v4f32:$vj)), (VEXT2XV_D_W(SUBREG_TO_REG(i64 0),+ (VFTINTRZ_W_S v4f32:$vj),+ sub_128))>;+def : Pat<(v4i32(fp_to_sint v4f64:$vj)),+ (EXTRACT_SUBREG(XVPICKEV_W(XVPERMI_D(XVFTINTRZ_L_D v4f64:$vj), 238),+ (XVFTINTRZ_L_D v4f64:$vj)),+ sub_128)>; // XVFTINTRZ_{W_SU/L_DU} def : Pat<(v8i32 (fp_to_uint v8f32:$vj)), (XVFTINTRZ_WU_S v8f32:$vj)>; def : Pat<(v4i64 (fp_to_uint v4f64:$vj)), (XVFTINTRZ_LU_D v4f64:$vj)>; -def : Pat<(v4i64 (fp_to_uint v4f32:$vj)),- (VEXT2XV_DU_WU (SUBREG_TO_REG (i64 0), (VFTINTRZ_WU_S v4f32:$vj),- sub_128))>;-def : Pat<(v4i32 (fp_to_uint (v4f64 LASX256:$vj))),- (EXTRACT_SUBREG (XVFTINTRZ_W_S (XVFCVT_S_D (XVPERMI_D v4f64:$vj, 238),- v4f64:$vj)),- sub_128)>;+def : Pat<(v4i64(fp_to_uint v4f32:$vj)), (VEXT2XV_DU_WU(SUBREG_TO_REG(i64 0),+ (VFTINTRZ_WU_S v4f32:$vj),+ sub_128))>;+def : Pat<(v4i32(fp_to_uint v4f64:$vj)),+ (EXTRACT_SUBREG(XVPICKEV_W(XVPERMI_D(XVFTINTRZ_LU_D v4f64:$vj), 238),+ (XVFTINTRZ_LU_D v4f64:$vj)),+ sub_128)>; // XVPERMI_Q foreach vt = [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64] in diff --git a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll index 0d9f57b57ffae..ed333c303879c 100644 --- a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll+++ b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll@@ -31,9 +31,9 @@ define void @fptosi_v4f64_v4i32(ptr %res, ptr %in){ ; CHECK-LABEL: fptosi_v4f64_v4i32: ; CHECK: # %bb.0: ; CHECK-NEXT: xvld $xr0, $a1, 0 +; CHECK-NEXT: xvftintrz.l.d $xr0, $xr0 ; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238 -; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0-; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0+; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0 ; CHECK-NEXT: vst $vr0, $a0, 0 ; CHECK-NEXT: ret %v0 = load <4 x double>, ptr %in diff --git a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll index 27d70f33cd34e..9c499ba71d646 100644 --- a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll+++ b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll@@ -31,9 +31,9 @@ define void @fptoui_v4f64_v4i32(ptr %res, ptr %in){ ; CHECK-LABEL: fptoui_v4f64_v4i32: ; CHECK: # %bb.0: ; CHECK-NEXT: xvld $xr0, $a1, 0 +; CHECK-NEXT: xvftintrz.lu.d $xr0, $xr0 ; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238 -; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0-; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0+; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0 ; CHECK-NEXT: vst $vr0, $a0, 0 ; CHECK-NEXT: ret %v0 = load <4 x double>, ptr %in |
; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238 | ||
; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0 | ||
; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0 | ||
; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0 | ||
; CHECK-NEXT: vst $vr0, $a0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous impl appears to convert f64
to f32
before performing the integer conversion, which introduces a loss of precision at the f64 -> f32
step.
d: float64 s: float32 w: int32 *a1: d0, d1, d2, d3 xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0 xr1: d3, d2, d3, d2 // xvpermi.d $xr1, $xr0, 238 xr0: s3, s2, s3, s2, s3, s2, s1, s0 // xvfcvt.s.d $xr0, $xr1, $xr0 ^ +-- Loss of precision at this step xr0: w3, w2, w3, w2, w3, w2, w1, w0 // xvftintrz.w.s $xr0, $xr0 vr0: w3, w2, w1, w0 // vst $vr0, $a0, 0 *a0: w0, w1, w2, w3
In the updated version, it seems the f64
values are first converted to i64
and then truncated to u32
. This can produce incorrect results when the original f64
values exceed the range representable by (signed) i32
.
Would it make sense to go with a direct f64 -> i32
conversion instead?
*a1: d0, d1, d2, d3 xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0 xr1: d3, d2, d3, d2 // xvpermi.d $xr1, $xr0, 238 xr0: w3, w2, w3, w2, w3, w2, w1, w0 // xvftintrz.w.d $xr0, $xr1, $xr0 vr0: w3, w2, w1, w0 // vst $vr0, $a0, 0 *a0: w0, w1, w2, w3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3A6000 | latency | throughput |
---|---|---|
xvftintrz.w.d | 5 | 2 |
xvftintrz.l.d | 4 | 4 |
xvpickev.w | 1 | 4 |
xvftintrz.w.d
or xvftintrz.l.d + xvpickev.w
which one is faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to correct my earlier comment - it was inaccurate and has been deleted. :P
According to the LLVM IR fptosi
spec:
If the value cannot fit in ty2, the result is a poison value.
So unless overflow/invalid recording is required, using xvftintrz.l.d + xvpickev.w
is also valid, and this sequence may offer better throughput.
; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238 | ||
; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0 | ||
; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0 | ||
; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0 | ||
; CHECK-NEXT: vst $vr0, $a0, 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
d: float64 l: int64 *a1: d0, d1, d2, d3 xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0 xr0: l3, l2, l1, l0 // xvftintrz.lu.d $xr0, $xr0 xr0: {l3h,l3l}, {l2h,l2l}, {l1h,l1l}, {l0h,l0l} xr1: l3, l2, l3, l2 // xvpermi.d $xr1, $xr0, 238 xr1: {l3h,l3l}, {l2h,l2l}, {l3h,l3l}, {l2h,l2l} xr0: l3l, l2l, l3l, l2l, l3l, l2l, l1l, l0l // xvpickev.w $xr0, $xr1, $xr0 vr0: l3l, l2l, l1l, l0l // vst $vr0, $a0, 0 *a0: l0l, l1l, l2l, l3l
Prvious
fp_to_uint/fp_to_sint
patterns forv4f64 -> v4i32
are wrong.Conversion error was triggered after pr #126456.