Skip to content

[LoongArch] Fix fp_to_uint/fp_to_sint conversion errors for lasx#137129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base:main
Choose a base branch
from

Conversation

tangaac
Copy link
Contributor

Prvious fp_to_uint/fp_to_sint patterns for v4f64 -> v4i32 are wrong.
Conversion error was triggered after pr #126456.

@llvmbot
Copy link
Member

@llvm/pr-subscribers-backend-loongarch

Author: None (tangaac)

Changes

Prvious fp_to_uint/fp_to_sint patterns for v4f64 -> v4i32 are wrong.
Conversion error was triggered after pr #126456.


Full diff: https://github.com/llvm/llvm-project/pull/137129.diff

3 Files Affected:

  • (modified) llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td (+14-14)
  • (modified) llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll (+2-2)
  • (modified) llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll (+2-2)
diff --git a/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td b/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td index e4268920e0b27..fe08c1050b4d7 100644 --- a/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td+++ b/llvm/lib/Target/LoongArch/LoongArchLASXInstrInfo.td@@ -1815,24 +1815,24 @@ def : Pat<(v4f32 (uint_to_fp v4i64:$vj)), // XVFTINTRZ_{W_S/L_D} def : Pat<(v8i32 (fp_to_sint v8f32:$vj)), (XVFTINTRZ_W_S v8f32:$vj)>; def : Pat<(v4i64 (fp_to_sint v4f64:$vj)), (XVFTINTRZ_L_D v4f64:$vj)>; -def : Pat<(v4i64 (fp_to_sint v4f32:$vj)),- (VEXT2XV_D_W (SUBREG_TO_REG (i64 0), (VFTINTRZ_W_S v4f32:$vj),- sub_128))>;-def : Pat<(v4i32 (fp_to_sint (v4f64 LASX256:$vj))),- (EXTRACT_SUBREG (XVFTINTRZ_W_S (XVFCVT_S_D (XVPERMI_D v4f64:$vj, 238),- v4f64:$vj)),- sub_128)>;+def : Pat<(v4i64(fp_to_sint v4f32:$vj)), (VEXT2XV_D_W(SUBREG_TO_REG(i64 0),+ (VFTINTRZ_W_S v4f32:$vj),+ sub_128))>;+def : Pat<(v4i32(fp_to_sint v4f64:$vj)),+ (EXTRACT_SUBREG(XVPICKEV_W(XVPERMI_D(XVFTINTRZ_L_D v4f64:$vj), 238),+ (XVFTINTRZ_L_D v4f64:$vj)),+ sub_128)>; // XVFTINTRZ_{W_SU/L_DU} def : Pat<(v8i32 (fp_to_uint v8f32:$vj)), (XVFTINTRZ_WU_S v8f32:$vj)>; def : Pat<(v4i64 (fp_to_uint v4f64:$vj)), (XVFTINTRZ_LU_D v4f64:$vj)>; -def : Pat<(v4i64 (fp_to_uint v4f32:$vj)),- (VEXT2XV_DU_WU (SUBREG_TO_REG (i64 0), (VFTINTRZ_WU_S v4f32:$vj),- sub_128))>;-def : Pat<(v4i32 (fp_to_uint (v4f64 LASX256:$vj))),- (EXTRACT_SUBREG (XVFTINTRZ_W_S (XVFCVT_S_D (XVPERMI_D v4f64:$vj, 238),- v4f64:$vj)),- sub_128)>;+def : Pat<(v4i64(fp_to_uint v4f32:$vj)), (VEXT2XV_DU_WU(SUBREG_TO_REG(i64 0),+ (VFTINTRZ_WU_S v4f32:$vj),+ sub_128))>;+def : Pat<(v4i32(fp_to_uint v4f64:$vj)),+ (EXTRACT_SUBREG(XVPICKEV_W(XVPERMI_D(XVFTINTRZ_LU_D v4f64:$vj), 238),+ (XVFTINTRZ_LU_D v4f64:$vj)),+ sub_128)>; // XVPERMI_Q foreach vt = [v32i8, v16i16, v8i32, v4i64, v8f32, v4f64] in diff --git a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll index 0d9f57b57ffae..ed333c303879c 100644 --- a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll+++ b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptosi.ll@@ -31,9 +31,9 @@ define void @fptosi_v4f64_v4i32(ptr %res, ptr %in){ ; CHECK-LABEL: fptosi_v4f64_v4i32: ; CHECK: # %bb.0: ; CHECK-NEXT: xvld $xr0, $a1, 0 +; CHECK-NEXT: xvftintrz.l.d $xr0, $xr0 ; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238 -; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0-; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0+; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0 ; CHECK-NEXT: vst $vr0, $a0, 0 ; CHECK-NEXT: ret %v0 = load <4 x double>, ptr %in diff --git a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll index 27d70f33cd34e..9c499ba71d646 100644 --- a/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll+++ b/llvm/test/CodeGen/LoongArch/lasx/ir-instruction/fptoui.ll@@ -31,9 +31,9 @@ define void @fptoui_v4f64_v4i32(ptr %res, ptr %in){ ; CHECK-LABEL: fptoui_v4f64_v4i32: ; CHECK: # %bb.0: ; CHECK-NEXT: xvld $xr0, $a1, 0 +; CHECK-NEXT: xvftintrz.lu.d $xr0, $xr0 ; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238 -; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0-; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0+; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0 ; CHECK-NEXT: vst $vr0, $a0, 0 ; CHECK-NEXT: ret %v0 = load <4 x double>, ptr %in 
; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238
; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0
; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0
; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0
; CHECK-NEXT: vst $vr0, $a0, 0
Copy link
Member

@heiherheiherApr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous impl appears to convert f64 to f32 before performing the integer conversion, which introduces a loss of precision at the f64 -> f32 step.

d: float64 s: float32 w: int32 *a1: d0, d1, d2, d3 xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0 xr1: d3, d2, d3, d2 // xvpermi.d $xr1, $xr0, 238 xr0: s3, s2, s3, s2, s3, s2, s1, s0 // xvfcvt.s.d $xr0, $xr1, $xr0 ^ +-- Loss of precision at this step xr0: w3, w2, w3, w2, w3, w2, w1, w0 // xvftintrz.w.s $xr0, $xr0 vr0: w3, w2, w1, w0 // vst $vr0, $a0, 0 *a0: w0, w1, w2, w3 

In the updated version, it seems the f64 values are first converted to i64 and then truncated to u32. This can produce incorrect results when the original f64 values exceed the range representable by (signed) i32.

Would it make sense to go with a direct f64 -> i32 conversion instead?

*a1: d0, d1, d2, d3 xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0 xr1: d3, d2, d3, d2 // xvpermi.d $xr1, $xr0, 238 xr0: w3, w2, w3, w2, w3, w2, w1, w0 // xvftintrz.w.d $xr0, $xr1, $xr0 vr0: w3, w2, w1, w0 // vst $vr0, $a0, 0 *a0: w0, w1, w2, w3 
Copy link
ContributorAuthor

@tangaactangaacApr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3A6000latencythroughput
xvftintrz.w.d52
xvftintrz.l.d44
xvpickev.w14

xvftintrz.w.d or xvftintrz.l.d + xvpickev.w
which one is faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to correct my earlier comment - it was inaccurate and has been deleted. :P

According to the LLVM IR fptosi spec:

If the value cannot fit in ty2, the result is a poison value.

So unless overflow/invalid recording is required, using xvftintrz.l.d + xvpickev.w is also valid, and this sequence may offer better throughput.

; CHECK-NEXT: xvpermi.d $xr1, $xr0, 238
; CHECK-NEXT: xvfcvt.s.d $xr0, $xr1, $xr0
; CHECK-NEXT: xvftintrz.w.s $xr0, $xr0
; CHECK-NEXT: xvpickev.w $xr0, $xr1, $xr0
; CHECK-NEXT: vst $vr0, $a0, 0
Copy link
Member

@heiherheiherApr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

d: float64 l: int64 *a1: d0, d1, d2, d3 xr0: d3, d2, d1, d0 // xvld $xr0, $a1, 0 xr0: l3, l2, l1, l0 // xvftintrz.lu.d $xr0, $xr0 xr0: {l3h,l3l}, {l2h,l2l}, {l1h,l1l}, {l0h,l0l} xr1: l3, l2, l3, l2 // xvpermi.d $xr1, $xr0, 238 xr1: {l3h,l3l}, {l2h,l2l}, {l3h,l3l}, {l2h,l2l} xr0: l3l, l2l, l3l, l2l, l3l, l2l, l1l, l0l // xvpickev.w $xr0, $xr1, $xr0 vr0: l3l, l2l, l1l, l0l // vst $vr0, $a0, 0 *a0: l0l, l1l, l2l, l3l 
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment
3 participants
@tangaac@llvmbot@heiher
close