10
\$\begingroup\$

This is my C++ implementation of Computing π(x): the combinatorial method by Tomás Oliveira e Silva. The math involved is elementary number theory, but fiddly and not the focus here. I have implemented all the optimizations that have been described in detail.

My main concern is the structure of the program. It started off with X, Z, C, alpha cbrt X, etc. all being global variables, since they're needed by all the algorithms and contained within the main file. Since I am relatively new to C++, I decided to try refactoring related variables together into a struct/class. My thinking was also it would make unit testing easier. However, I am not that happy with the structure (aside from PhiBlock); I can't decide between making more little structs to hold the variables only used by each algorithm, and putting everything in just this one struct.

I am also not happy with Primecount's constructor, as it seems like being forced to have a "default constructor" for everything just makes the code more disorganized.

#include <vector> #include <iostream> #include <cmath> #include <cassert> #include "fenwick_tree.hpp" using namespace std; // signum: returns -1, 0, or 1 int sgn(int64_t x) { return (x > 0) - (x < 0); } // ceil(x/y) for positive integers int64_t ceil_div(int64_t x, int64_t y) { return x / y + (x % y > 0); } // only ints int64_t cube(int64_t n) { return n * n * n; } // Represents Bk = [zk1, zk) and functions to compute phi(y,b) // The physical block size is half the logical size by only storing odd values // For example, [51, 101) would map to ind [0, 25) via y -> (y-zk1)/2 struct PhiBlock { int64_t bsize; // (logical) block size int64_t zk1; // z_{k-1}, block lower bound (inclusive) int64_t zk; // z_k, block upper bound (exclusive) vector<bool> ind; // 0/1 to track [pmin(y) > pb] fenwick_tree phi_sum; // data structure for efficient partial sums vector<int64_t> phi_save; // phi_save(k,b) = phi(zk1-1,b) from prev block // b is only explicitly used here // init block at k=1 // TODO: phi_sum gets overwritten by first new_block anyway? PhiBlock(int64_t bsize, int64_t a) : bsize(bsize), zk1(1), zk(bsize + 1), ind(bsize/2, 1), phi_sum(ind), phi_save(a + 1, 0) { assert(bsize % 2 == 0); // requires even size } void new_block(int64_t k) { zk1 = bsize * (k-1) + 1; zk = bsize * k + 1; ind.assign(bsize/2, 1); phi_sum = fenwick_tree(ind); // does not reset phi_save! //cout << "block [" << zk1 << "," << zk << ")\n"; } // phi(y,b) compute int64_t sum_to(int64_t y, int64_t b) const { assert(y >= zk1); assert(y < zk); return phi_save[b] + phi_sum.sum_to((y - zk1)/2); } // sieve out p_b for this block void sieve_out(int64_t pb) { assert(pb > 2); int64_t jstart = pb * ceil_div(zk1, pb); if (jstart % 2 == 0) jstart += pb; // ensure odd for (int64_t j = jstart; j < zk; j += 2*pb) { if (ind[(j-zk1)/2]) // not marked yet { phi_sum.add_to((j-zk1)/2, -1); ind[(j-zk1)/2] = 0; } } } void update_save(int64_t b) { phi_save[b] += phi_sum.sum_to(bsize/2-1); } }; // Phi2 saved variables struct Phi2 { int64_t u; // largest p_b not considered yet int64_t v; // count number of primes up to sqrt(x) int64_t w; // track first integer represented in aux vector<bool> aux; // auxiliary sieve to track primes found up to sqrt(x) bool done; // flag for not accidentally running when terminated }; struct Primecount { // tuning parameters int64_t ALPHA; // tuning parameter int64_t C = 8; // precompute phi_c parameter int64_t BS; // sieve block size // global constants int64_t X; // Main value to compute pi(X) for int64_t Q; // phi_c table size, shouldn't be too large int64_t Z; // X^(2/3) / alpha (approx) int64_t ISQRTX; // floor(sqrt(X)) int64_t IACBRTX; // floor(alpha cbrt(X)) int64_t a; // pi(alpha cbrt(X)) int64_t astar; // p_a*^2 = alpha cbrt(X), i.e. a* = pi(sqrt(alpha cbrt X)) // precomputed tables vector<int64_t> MU_PMIN; // mu(n) pmin(n) for [1,acbrtx] vector<int64_t> PRIMES; // primes <= acbrtx vector<int64_t> PRIME_COUNT; // pi(x) over [1,acbrtx] vector<int32_t> PHI_C; // phi(x,c) over [1,Q] Primecount(int64_t x, int64_t alpha, int64_t bs) : ALPHA(alpha), BS(bs), X(x), // Q after sieve Z(cbrt(X) * cbrt(X) / ALPHA), // approx ISQRTX(sqrt(X)), IACBRTX(ALPHA * cbrt(X)) { // check alpha isn't set too large assert(ALPHA <= pow(X, 1/6.)); // ensure floating point truncated values are exact floors assert(ISQRTX*ISQRTX <= X && (ISQRTX+1)*(ISQRTX+1) > X); assert(cube(IACBRTX) <= cube(ALPHA) * X && cube(IACBRTX+1) > cube(ALPHA) * X); cout << "Computing for X = " << X << endl; cout << "Z = " << Z << endl; cout << "IACBRTX = " << IACBRTX << endl; cout << "ISQRTX = " << ISQRTX << endl; // precompute PRIMES, PRIME_COUNT, MU_PMIN sieve_mu_prime(); pre_phi_c(); a = PRIME_COUNT[IACBRTX]; cout << "a = " << a << endl; assert(PRIMES.size() > (size_t)a + 1); // need p_{a+1} astar = 1; while(PRIMES[astar+1] * PRIMES[astar+1] <= IACBRTX) ++astar; cout << "a* = " << astar << endl; } // precompute PRIMES, PRIME_COUNT, MU_PMIN with a standard sieve to acbrtx void sieve_mu_prime(void) { // Since p_{a+1} may be needed in S2, we introduce fudge factor // and hope it's less than the prime gap int64_t SIEVE_SIZE = IACBRTX + 200; MU_PMIN.assign(SIEVE_SIZE+1, 1); // init to 1s MU_PMIN[1] = 1000; // define pmin(1) = +inf PRIMES.push_back(1); // p0 = 1 by convention PRIME_COUNT.resize(SIEVE_SIZE+1); // init values don't matter here int64_t i, j; int64_t prime_counter = 0; // sieve of Eratosthenes, modification to keep track of mu sign and pmin for (j = 2; j <= SIEVE_SIZE; ++j) { if (MU_PMIN[j] == 1) // unmarked, so it is prime { for (i = j; i <= SIEVE_SIZE; i += j) { MU_PMIN[i] = (MU_PMIN[i] == 1) ? -j : -MU_PMIN[i]; } } } // complete MU_PMIN, compute PRIMES and PRIME_COUNT for (j = 2; j <= SIEVE_SIZE; ++j) { if (MU_PMIN[j] == -j) // prime { PRIMES.push_back(j); ++prime_counter; // mark multiples of p^2 as 0 for mu for (i = j*j; i <= SIEVE_SIZE; i += j*j) MU_PMIN[i] = 0; } PRIME_COUNT[j] = prime_counter; } } // Precompute phi(x,c) void pre_phi_c(void) { // compute Q as product of first C primes Q = 1; for (int64_t i = 1; i <= C; ++i) Q *= PRIMES[i]; PHI_C.resize(Q+1, 1); // index up to Q inclusive PHI_C[0] = 0; for (int64_t i = 1; i <= C; ++i) { int64_t p = PRIMES[i]; // ith prime, mark multiples as 0 for (int64_t j = p; j <= Q; j += p) PHI_C[j] = 0; } // accumulate for (int64_t i = 1; i <= Q; ++i) PHI_C[i] += PHI_C[i-1]; } // contribution of ordinary leaves to phi(x,a) int64_t S0_compute(void) { int64_t S0 = 0; for (int64_t n = 1; n <= IACBRTX; ++n) { if (abs(MU_PMIN[n]) > PRIMES[C]) { int64_t y = X / n; // use precomputed PHI_C table int64_t phi_yc = (y / Q) * PHI_C[Q] + PHI_C[y % Q]; S0 += sgn(MU_PMIN[n]) * phi_yc; } } return S0; } // Algorithm 1 int64_t S1_iter(int64_t b, const PhiBlock& phi_block, vector<int64_t>& m) { int64_t S1 = 0; int64_t pb1 = PRIMES[b+1]; // m decreasing for (; m[b] * pb1 > IACBRTX; --m[b]) { int64_t y = X / (m[b] * pb1); assert(y >= phi_block.zk1); if (y >= phi_block.zk) break; if (abs(MU_PMIN[m[b]]) > pb1) { S1 -= sgn(MU_PMIN[m[b]]) * phi_block.sum_to(y, b); } } return S1; } // Algorithm 2 hell void S2_iter(int64_t b, const PhiBlock& phi_block, vector<int64_t>& d2, vector<int64_t>& t, vector<int64_t>& S2 ) { int64_t pb1 = PRIMES[b+1]; while (d2[b] > b + 1) // step 2, main loop { int64_t y = X / (pb1 * PRIMES[d2[b]]); if (t[b] == 2) // hard leaves { if (y >= phi_block.zk) // step 7 { break; // pause until next block } else // step 8 (contribution using phi_block) { S2[b] += phi_block.sum_to(y, b); --d2[b]; // repeat loop } } else // t = 0 or 1, easy leaves { if (y >= IACBRTX) { t[b] = 2; // since t = 2 is set and d2 didn't change, the new loop // will go to step 7 } else // step 3/5 { int64_t l = PRIME_COUNT[y] - b + 1; if (t[b] == 0) // step 3 { // d' + 1 is the smallest d for which (12) holds: // phi(x / (pb1*pd), b) = pi(x / (pb1*pd)) - b + 1 int64_t d_ = PRIME_COUNT[X / (pb1 * PRIMES[b+l])]; // step 4 if ((PRIMES[d_+1]*PRIMES[d_+1] <= X / pb1) || (d_ <= b)) { t[b] = 1; // goto step 6 } else // step 5, clustered easy leaves { S2[b] += l * (d2[b] - d_); d2[b] = d_; } } if (t[b] == 1) // sparse easy leaves { // step 6 S2[b] += l; --d2[b]; } } } } return; // terminate } // Algorithm 3: computation of phi2(x,a) void P2_iter(const PhiBlock& phi_block, Phi2& P, int64_t& P2) { // step 3 loop (u decrement steps moved here) for (; P.u > IACBRTX; --P.u) { if (P.u < P.w) { // new aux sieve [w,u] of size IACBRTX+1 P.w = max((int64_t)2, P.u - IACBRTX); P.aux.assign(P.u - P.w + 1, true); for (int64_t i = 1; ; ++i) { int64_t p = PRIMES[i]; if (p*p > P.u) break; // only need to sieve values starting at p*p within [w,u] for (int64_t j = max(p*p, p*ceil_div(P.w,p)); j <= P.u; j += p) P.aux[j-P.w] = false; } } // check u to track largest pb not considered yet if (P.aux[P.u-P.w]) // prime { int64_t y = X / P.u; if (y >= phi_block.zk) return; // finish this block // phi(y,a) int64_t phi = phi_block.sum_to(y, a); P2 += phi + a - 1; ++P.v; // count new prime } } // step 3 terminate P2 -= P.v * (P.v - 1) / 2; P.done = true; return; } int64_t primecount(); }; int64_t Primecount::primecount(void) { // Sum accumulators int64_t S1 = 0; vector<int64_t> S2(a-1); vector<int64_t> t(a-1); int64_t P2 = a * (a-1) / 2; PhiBlock phi_block(BS, a); vector<int64_t> m(astar, IACBRTX); // S1 decreasing m vector<int64_t> d2(a-1); // S2 decreasing d Phi2 phi2 = { .u = ISQRTX, .v = a, .w = ISQRTX + 1, .aux = vector<bool>(), .done = false }; // Ordinary leaves int64_t S0 = S0_compute(); cout << "S0 = " << S0 << "\n"; // Init S2 vars for (int64_t b = astar; b < a - 1; ++b) { int64_t pb1 = PRIMES[b+1]; int64_t tb; if (X <= cube(pb1)) tb = b + 2; else if (pb1*pb1 <= Z) tb = a + 1; else tb = PRIME_COUNT[X / (pb1*pb1)] + 1; d2[b] = tb - 1; S2[b] = a - d2[b]; t[b] = 0; } //Main segmented sieve: For each interval Bk = [z_{k-1}, z_k) for (int64_t k = 1; ; ++k) { // init new block phi_block.new_block(k); if (phi_block.zk1 > Z) break; // For each b... // start at 2 to sieve out odd primes assert(C >= 2); assert(C < astar); for (int64_t b = 2; b <= a; ++b) { //cout << "b " << b << endl; int64_t pb = PRIMES[b]; // sieve out p_b for this block (including <= C) phi_block.sieve_out(pb); // S1 leaves, b in [C, astar) if (C <= b && b < astar) S1 += S1_iter(b, phi_block, m); // S2 leaves, b in [astar, a-1) else if (astar <= b && b < a - 1) S2_iter(b, phi_block, d2, t, S2); // phi2, after sieved out first a primes else if (b == a && !phi2.done) P2_iter(phi_block, phi2, P2); // for next block k phi_block.update_save(b); } } int64_t S2_total = 0; for (auto x : S2) S2_total += x; // accumulate final results cout << "S1 = " << S1 << endl; cout << "S2 = " << S2_total << endl; cout << "P2 = " << P2 << endl; return S0 + S1 + S2_total + a - 1 - P2; } int main(int argc, char* argv[]) { if (!(argc == 2 || argc == 4)) { cerr << "Usage: ./primecount X [BLOCKSIZE ALPHA]\n"; return 1; } // setup global constants // TODO: dynamically adjust constants // block size should be O(x^1/3) // alpha = O(log^3 x) int64_t X = atof(argv[1]); // read float like 1e12 from command line int64_t bs = 1 << 16; int64_t alpha = 6; if (argc == 4) // override defaults { bs = atof(argv[2]); alpha = atoi(argv[3]); } Primecount P(X, alpha, bs); cout << P.primecount() << endl; } 

fenwick_tree.hpp, which has been specialized for this problem by taking in a bool vector to save memory:

// Credit: cp-algorithms (Jakob Kobler), e-maxx.ru (Maxim Ivanov) #pragma once #include <vector> #include <cstddef> #include <cstdint> struct fenwick_tree { size_t len; // 0-based len std::vector<int64_t> t; // 1-based tree, indexes [1:len] fenwick_tree(std::vector<bool> const &a) : len(a.size()), t(len + 1, 0) { for (size_t i = 0; i < len; ++i) add_to(i, a[i]); } // 0-based input int64_t sum_to(size_t r) const { int64_t s = 0; for (++r; r > 0; r -= r & -r) s += t[r]; return s; } // 0-based input void add_to(size_t i, int64_t delta) { for (++i; i <= len; i += i & -i) t[i] += delta; } }; 

Some of my design decisions:

  • I used namespace std because the file is self-contained and there are no naming conflicts AFAIK.
  • Phi2 got its own struct because it maintains 5 variables (u, v, w, aux, done), but I couldn't decide if the algorithm for it should go in Primecount or in the Phi2 struct.
  • The Primecount constructor has a non-obvious order: sieve_mu_prime() generates 3 vectors at once, and a can only be calculated after those. Because they're returning void; it's not obvious.
  • I have no unit testing framework, which would've made debugging a lot easier.
  • It's hard to test small values of X like 1e4 in this whole program, because the C parameter is not scaled based on the input. Some of the asserts may be unnecessary.
  • Maybe different levels of detail in print statements can be turned into logging. But anything more than a little printing should be disabled in "full-speed" mode.

I appreciate any pointers and references (no pun intended) for the design of this kind of very niche specific mathematical code.

Update: I found Alg 2 can be simplified a little to not use continue, repeated step 6, or an early return. This is not the focus of this question.

Update 2: I just found out the code is bugged for 1e15. That's a real bummer. Maybe some test cases on smaller functions will fix this.

Bounty update: due to Stack Exchange's crappy design, I am forced to award a bounty to an existing answer even if no new answers were written, or else it will be auto-rewarded to the highest voted answer (in this case, neither of the two current answers fully answer my main concerns). https://meta.stackexchange.com/questions/166172/explicit-do-not-award-bounty-button I will not be using the bounty feature in the future to try to get more answers. But in the spirit of PPCG I am offering an "indefinite bounty" where anyone who answers well about the structure of the program and C++ OOP will be awarded at least 100 rep bounty by me to reward a good answer.

\$\endgroup\$
5
  • 3
    \$\begingroup\$While the changes to the code do not appear to invalidate the answer, please refrain from updating the code after an answer is received. Please see What should I do when someone answers my question?.\$\endgroup\$CommentedJan 8 at 17:57
  • \$\begingroup\$@Fe2O3 that was the other formula I found for ceiling division. The people on that SO question said the current version may actually be faster because x / y and x % y are calculated in one instruction. Actually I think everywhere I use ceil div(x, y), I immediately multiply by the divisor, as I use it as a round up to nearest multiple function. Regardless, S2's algorithm and PhiBlock are the bulk of computation time. Any significant improvements have to start there.\$\endgroup\$
    – qwr
    CommentedJan 10 at 4:49
  • \$\begingroup\$I don't think using unsigned would prevent overflow. If the bug with X = 1e15, the largest input I have tested, is with overflow, I think it is either with a multiplication or a mistake with trying to use integer operations onto real number math\$\endgroup\$
    – qwr
    CommentedJan 10 at 4:53
  • \$\begingroup\$@Fe2O3 x86, which is the only relevant architecture to actually run this on (except for maybe ARM but I don't care about those)\$\endgroup\$
    – qwr
    CommentedJan 13 at 3:20
  • \$\begingroup\$Disregard my previous comment. It applies for any relevant architecture to run numerical computing on, x86 and ARM included. ARM requires 2 instructions but apparently runs just as fast as x86. stackoverflow.com/questions/35351470/…\$\endgroup\$
    – qwr
    CommentedJan 13 at 3:29

3 Answers 3

7
\$\begingroup\$

Try to make constant with const specifier

There are several variables which values are not going to be changed. For reducing the cognitive load of the user reading the code, try to make constant with const specifier (e.g. PhiBlock constructor PhiBlock(const int64_t bsize, const int64_t a), new_block function void new_block(const int64_t k), sum_to function int64_t sum_to(const int64_t y, const int64_t b) const, sieve_out function void sieve_out(const int64_t pb), etc.)

Prefer '\n' to std::endl.

std::endl forces a flush after placing the new line on the stream. It is very rare that you actually want to force a flush.

Don't using namespace std;.

using namespace std pollutes global namespace and leads to possible conflicts with other functions. Why is "using namespace std" considered bad practice?

\$\endgroup\$
2
  • \$\begingroup\$Ok but you did not address my main concern. I already made a note about namespace std.\$\endgroup\$
    – qwr
    CommentedJan 8 at 14:38
  • 1
    \$\begingroup\$For my debug printing I actually do want to force a flush. But I think if I output to cerr it will do that automatically.\$\endgroup\$
    – qwr
    CommentedJan 8 at 14:42
7
+150
\$\begingroup\$

Important reference

The reference of "Computing π(x): the combinatorial method by Tomás Oliveira e Silva" and link deserve to also be in code.

Separate prime functions

Consider making the code to form primes a separate unit. That separation would make this application easier to understand and allow for prime code re-use in other applications.


Some minor things:

Consider large values

Algorithms deserve overflow consideration, if not now at least for the future.

Example:

// May overflow? // if (p*p > P.u) break; // Does not overflow // Takes advantage that p >= 2 is expected. if (p > P.u/p) break; 

Loss of precision

double can usually encode every integer value up to 253. Above that int64_t X = atof(argv[1]) risks rounding.

Consider strtold() or the like.

16-bit int?

Say code was ported to an embedded processer where 16-bit int still occur.

int64_t bs = 1 << 16; // ---> UB // Somehow, insure computation is done with `int64_t` math. int64_t bs = (int64_t)1 << 16; // Does not overflow. 

BTW, why naked magic number 16?

Types

Code uses uint64_t where size_t expected as in update_save(int64_t b) { phi_save[b] .... It is unclear if primes large than uint32_t are used.

To well assess this would take more review than I am not able to now.

Formatting

Code looks well formatted: good.

Yet code like bellows hints that is done manually: bad.

 if (X <= cube(pb1)) tb = b + 2; else if (pb1*pb1 <= Z) tb = a + 1; else tb = PRIME_COUNT[X / (pb1*pb1)] + 1; 

In a business setting, save time and use auto formatting.

\$\endgroup\$
4
  • \$\begingroup\$Thanks for the review. To respond: 1. I should've mentioned the paper is linked in the project README. The sieve that precomputes is inherently tied to PRIMES, PRIME_COUNT, and MU_PMIN all at once, which for this purpose is only used for this task. So I'm not sure I should separate it out from the constructor. Maybe with new C++ features I can return these from my void function as const. 2. Is p*p > u equivalent to p > u/p for all positive integers? I think it is in this case, as pp > u implies p > u/p > floor(u/p), and pp < u implies p <= u/p, which for integer p implies p <= u/p.\$\endgroup\$
    – qwr
    CommentedJan 8 at 22:18
  • 1
    \$\begingroup\$I tried to avoid integer division due to this floor behavior. 3. You are right about the double. I did not consider running my code past 9e15, but it is optimized to the point where that is feasible.\$\endgroup\$
    – qwr
    CommentedJan 8 at 22:18
  • \$\begingroup\$4. I am not particularly concerned about embedded processors, but you're right in that I should be more careful with my constants. 1 << 16 is a constant I experimentally determined works pretty well, but I should eventually set this dynamically by default. I don't think any prime greater than sqrt(x) is considered, so all primes should fit into an int32_t. But no value invovled anywhere exceeds an int64_t. 5. That specific part of code is formatted that way because of the way the equation is presented in the paper as cases (with a big brace). Although I will admit it does look pretty weird.\$\endgroup\$
    – qwr
    CommentedJan 8 at 22:19
  • \$\begingroup\$The floor division thing made me doubt myself so I put it on SO stackoverflow.com/questions/79340918/…\$\endgroup\$
    – qwr
    CommentedJan 8 at 23:57
3
\$\begingroup\$

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I ended up with the former even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
  • Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members. Move Phi2 info back into Primecount
  • Add defaults for block size (2^20) and alpha (scales with log^3 x) for my machine
  • Fix constants to work with small inputs

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

The paper also mentions dynamically sizing intervals and potentially replacing the Fenwick tree with a linear array for smaller y. Apparently the tree is terrible for cache locality. I don't plan on implementing these.

\$\endgroup\$
5
  • \$\begingroup\$Primes savings detail akin to your last paragraph: for every 30, there are at most 8 primes, thus 1 byte per 30. Yes it involves bit packing and /30, %30. (2*3*5 = 30)\$\endgroup\$
    – chux
    CommentedJan 10 at 2:43
  • \$\begingroup\$@chux the bit packing works for ind only. The Fenwick tree still has to store 8 int64s. Kim Walisch's primecount says he used something involving POPCNT instead of the fenwick tree but I didn't read into it. Maybe he is just doing the prefix sum every time.\$\endgroup\$
    – qwr
    CommentedJan 10 at 2:53
  • \$\begingroup\$Actually the Fenwick tree only stores sums up to phi(zk1-1,b). So if the block size is < 2^32, maybe I can get away with int32s.\$\endgroup\$
    – qwr
    CommentedJan 10 at 2:55
  • \$\begingroup\$To emphasize the separate prime functions idea, since prime generation, access, testing, etc. is such a well studied and optimized problem, having primes untangled from the rest of the task allows one to more easily import in existing optimized prime code. Good luck.\$\endgroup\$
    – chux
    CommentedJan 10 at 3:04
  • \$\begingroup\$@chux I understand your point, but really the primes are generated in a few lines to a table and take almost no time at all to generate. The bulk of the work is computing phi(y,b) in S2. Given that, I do want to add some small test cases for mu_pmin_sieve.\$\endgroup\$
    – qwr
    CommentedJan 10 at 3:14

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.