wolfCrypt implementations of LMS/HSS and XMSS/XMSS^MT signatures: build options and benchmarks (Intel x86)

At wolfSSL we’re excited about stateful hash-based signature schemes and the CNSA 2.0, and we just had a webinar on this subject. If you recall, previously we added initial support for LMS/HSS and XMSS/XMSS^MT, through external integration with the hash-sigs and xmss-reference implementations.

Recently however we have completed our own wolfCrypt implementations of these algorithms, and would like to share benchmarking results and some of the build options available. Generally the wolfCrypt implementations of these signature methods are faster, with more options available to tune build size and performance.

With that said, we’ll review some of the more relevant build options and benchmarking data for LMS/HSS, and XMSS/XMSS^MT. These benchmarks were obtained on a Fedora 38 workstation with an Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. Only a single core was used. wolfSSL was built with –enable-intelasm to utilize assembly speedups for all tests. Note: LMS/HSS and XMSS/XMSS^MT support a very wide range of parameters. For the sake of conciseness only a targeted range is benchmarked here.

LMS build options and benchmarking

The five main defines that customize the wolfCrypt LMS/HSS build are the following:

  • WOLFSSL_LMS_LARGE_CACHES
  • WOLFSSL_WC_LMS_SMALL
  • WOLFSSL_LMS_MAX_LEVELS=N
  • WOLFSSL_LMS_MAX_HEIGHT=H
  • WOLFSSL_LMS_VERIFY_ONLY

The define WOLFSSL_LMS_LARGE_CACHES will cache more of the authentication path into memory, speeding up signing operations for larger height trees.

The define WOLFSSL_WC_LMS_SMALL reduces code size and memory use overall, with the tradeoff of much slower signing operations. However the performance impact for verification is negligible.

The defines WOLFSSL_LMS_MAX_LEVELS, and WOLFSSL_LMS_MAX_HEIGHT set compile time limits on the size of the LMS/HSS hypertree, and mainly reduce code footprint without impacting performance. These can be used to slim the build size if you are only interested in a specific parameter set range. More specifically, WOLFSSL_LMS_MAX_LEVELS sets the max allowed levels in HSS (the number of trees in the hypertree), while WOLFSSL_LMS_MAX_HEIGHT sets the max allowed height per tree for both LMS and HSS.

The define WOLFSSL_LMS_VERIFY_ONLY restricts the build to a smaller verify-only subset (LMS API and data structures needed for keygen/signing are omitted). This does not impact verify performance, and is intended for embedded targets that need verify-only functionality (e.g. wolfBoot). WOLFSSL_LMS_VERIFY_ONLY can be combined with WOLFSSL_WC_LMS_SMALL, WOLFSSL_LMS_MAX_LEVELS, and WOLFSSL_LMS_MAX_HEIGHT for further footprint reduction.

In Table 1 we show benchmarking results (obtained with ./wolfcrypt/benchmark/benchmark -lms_hss) for these different build options, with the external LMS/HSS implementation provided for comparison.

In general we see the default wolfCrypt LMS/HSS performance (wc_lms) is much faster than the external integration (ext_lms) for all categories of operation (keygen, signing, verifying). The WOLFSSL_LMS_LARGE_CACHES (wc_lms large) option speeds up signing operations for larger height trees, but otherwise does not impact performance. The small variations in verify speed across wc_lms, wc_lms large, and wc_lms small are likely just system noise and do not represent a systematic trend. The WOLFSSL_WC_LMS_SMALL option (wc_lms small) significantly reduces signing speed, but leaves verification speed basically unchanged, making this option attractive for verify-only applications in embedded systems.


Table 1: Comparison of wolfCrypt LMS/HSS (wc_lms), wolfCrypt LMS/HSS with WOLFSSL_LMS_LARGE_CACHES (wc_lms large), wolfCrypt LMS/HSS with WOLFSSL_WC_LMS_SMALL (wc_lms small), and the external integration implementation (ext_lms). All values in units of ops/sec.

wc_lmswc_lms largewc_lms smallext_lms
L2_H10_W2 keygen6.4826.49412.8281.330
L2_H10_W2 sign4437.4695521.7966.526786.083
L2_H10_W2 verify13954.45014087.79413874.4504789.383
L2_H10_W4 keygen3.5673.5926.9540.764
L2_H10_W4 sign2452.3613052.3263.562443.225
L2_H10_W4 verify6482.8916707.2716962.2152281.440
L3_H5_W4 keygen70.92673.673227.37617.467
L3_H5_W4 sign4660.3704669.01974.653820.640
L3_H5_W4 verify4632.1184670.9634790.7421756.355
L3_H5_W8 keygen9.3959.41329.0412.265
L3_H5_W8 sign609.408605.1999.542106.059
L3_H5_W8 verify561.759554.635573.341214.093
L3_H10_W4 keygen2.3842.3687.1280.569
L3_H10_W4 sign2459.6983067.8482.376444.601
L3_H10_W4 verify4895.2034345.1304793.8531618.676
L4_H5_W8 keygen7.0457.01729.2581.770
L4_H5_W8 sign608.915607.3187.168106.881
L4_H5_W8 verify446.384443.804438.542145.672

Graph 1: Signing speeds for wolfCrypt LMS/HSS (wc_lms), wolfCrypt LMS/HSS with WOLFSSL_LMS_LARGE_CACHES (wc_lms large), and the external integration implementation (ext_lms). All values in units of ops/sec.

XMSS build options and benchmarking

Three important defines that customize the wc_xmss build are:

  • WOLFSSL_WC_XMSS_SMALL
  • WOLFSSL_XMSS_MAX_HEIGHT=N
  • WOLFSSL_XMSS_VERIFY_ONLY

The define WOLFSSL_WC_XMSS_SMALL reduces code size and memory use overall, with the tradeoff of much slower signing operations, and 20-30% slower verification.

The define WOLFSSL_XMSS_MAX_HEIGHT=N sets compile time limits on the max height of the hypertree, and mainly reduces code size without impacting performance.

The define WOLFSSL_XMSS_VERIFY_ONLY restricts the build to a smaller verify-only subset, and can be combined with WOLFSSL_WC_XMSS_SMALL, and WOLFSSL_XMSS_MAX_HEIGHT for further size reduction. It does not impact verify performance.

In Table 2 we show benchmarking results for XMSS/XMSS^MT for these options (obtained with ./wolfcrypt/benchmark/benchmark -xmss_xmssmt_sha256), with the external XMSS/XMSS^MT implementation for comparison. The default wolfCrypt XMSS/XMSS^MT (wc_xmss) is in general better than the external integration (ext_xmss), for all operations. There is a smaller difference between wc_xmss and ext_xmss as compared to wc_lms and ext_lms though, because ext_xmss can benefit from assembly speedups whereas ext_lms cannot. Similar to LMS, the WOLFSSL_WC_XMSS_SMALL option (wc_xmss small) significantly reduces signing performance, but verify speeds remain fast, making this a good option for embedded verify-only targets.

Table 2: Comparison of wolfCrypt XMSS/XMSS^MT (wc_xmss), wolfCrypt XMSS/XMSS^MT with WOLFSSL_WC_XMSS_SMALL (wc_xmss small), and the external integration implementation (ext_xmss). All values in units of ops/sec.

wc_xmsswc_xmss smallext_xmss
XMSS-SHA2_10_256 keygen1.5871.0790.943
XMSS-SHA2_10_256 sign363.6931.106226.782
XMSS-SHA2_10_256 verify3050.2762044.9951892.234
XMSSMT-SHA2_20/2_256 keygen0.8081.1000.472
XMSSMT-SHA2_20/2_256 sign298.1380.551191.214
XMSSMT-SHA2_20/2_256 verify1307.295982.836852.348
XMSSMT-SHA2_20/4_256 keygen9.88035.2747.309
XMSSMT-SHA2_20/4_256 sign390.9428.681290.516
XMSSMT-SHA2_20/4_256 verify729.433517.298443.444
XMSSMT-SHA2_40/4_256 keygen0.4061.1070.237
XMSSMT-SHA2_40/4_256 sign294.7380.276161.656
XMSSMT-SHA2_40/4_256 verify750.591487.257424.986
XMSSMT-SHA2_40/8_256 keygen5.60435.3183.755
XMSSMT-SHA2_40/8_256 sign469.7644.374293.184
XMSSMT-SHA2_40/8_256 verify361.289262.160225.254
XMSSMT-SHA2_60/6_256 keygen0.2661.0990.159
XMSSMT-SHA2_60/6_256 sign280.1600.185144.637
XMSSMT-SHA2_60/6_256 verify521.610352.718295.882
XMSSMT-SHA2_60/12_256 keygen4.14335.2802.505
XMSSMT-SHA2_60/12_256 sign514.6582.910292.371
XMSSMT-SHA2_60/12_256 verify247.682170.459152.471

Graph 2: Verify speeds for wolfCrypt XMSS/XMSS^MT (wc_xmss), wolfCrypt XMSS/XMSS^MT with WOLFSSL_WC_XMSS_SMALL (wc_xmss small), and the external integration implementation (ext_xmss). All values in units of ops/sec.

Conclusions

In general our wolfCrypt implementations for LMS/HSS and XMSS/XMSS^MT are significantly faster than the external reference implementations, with speedups of 20-30% to even 3x-4x possible depending on the combination of operation, algorithm, and parameters.

The small footprint build shows fast verification speeds for all parameters, making it an attractive choice for embedded verify-only applications (e.g. wolfBoot).

Overall our LMS/HSS implementation is faster than XMSS/XMSS^MT (at least on x86), which is consistent with what is known about these two methods. However which of the two is more appropriate for your use case will ultimately depend on other factors as well, such as signature size, target environment, and parameters used.

If you’re interested in learning more about our post-quantum work, or want to learn more about stateful hash-based signature schemes, contact us at wolfSSL by emailing facts@wolfSSL.com or calling us at +1 425 245 8247 to reach out to your regional wolfSSL business director.

Download wolfSSL Now