I just want to multiply two big numbers modulo a big number
by math_explorer, Mar 28, 2016, 5:15 AM
without overflowing or writing my own big integer structure or wasting too many processor cycles, because you never know, Time Limit Exceeded penalty is unforgiving. They're not that large. They'll fit into a long long just fine. How hard could it be?
This is simple and works with
, but it's kind of slow. (We are assuming in all these below functions that the caller has already modded
and
, so
. And that long longs are 64 bits, yes. You might be able to get an extra bit by making everything unsigned and being really careful? But let's not. Also, I am avoiding the remainder operation everywhere, because it's quite slow; since we know tight bounds on each of the arithmetic operations we do, we can get by with subtracting
as we find it necessary. Testing on my machine says this, using subtractions instead of modding, is a 9x speedup, which is a big deal if this is in an inner loop!)
While carelessly Googling, I found some references to an algorithm called Schrage multiplication that seemed to do the same thing much more carefully. But references were scattered and incomplete and paid much less attention to performance than I cared about. Finally I reconstructed roughly what it looks like:
You can work out the modular arithmetic yourself and discover that, somewhat amazingly,
But note that this does not work with arbitrary
in range. As usually written, Schrage's algorithm seems to require that
. However, we can loosen that a little; the key is really just that the computation
should not overflow, and this is roughly true if
. This means that to get a function that works for general
, we can split
into two and use this as a subroutine.
Also, we're aggressively avoiding the remainder operation again. Under the original Schrage requirements, the while can be replaced with an if, but our loosening will require it, and using the remainder operation loses our speed margin over mul1. Optimization is hard.
It's hard to loosen it in a more precise and useful way, though. I juggled expressions for quite a while but couldn't find a good intermediate point to abstract out, so instead, I'll just proclaim that this works, again as long as
:
This quickly and crudely computes an approximate square root
of
. It satisfies
. (Check this. I really hope I got it right.) As a result, in all three calls to schrage(...),
.
With this, we can work out that in schrage,
and
, so
is a safe multiplication that doesn't overflow, and we're done!
So, mul2 is faster. A little. Okay, it's just a 1.5x speedup compared to mul1, and I'm quite a bit more unsure on whether it always works, since proofs have bugs.
But now we have mul3, the enigma bequeathed from code of a different generation that prompted this entire odyssey. It doesn't quite reach the effective range of either of the above ranges — it seems to work up to
or
— but that's often enough, and blazing phoenixes, it's fast. It's, like, another 4.5x factor over mul2. How does it work? Why does it work?? I'm pretty sure it relies on undefined behavior???? And yet...
This is simple and works with





long long mul1(long long a, long long b, long long m) { long long res = 0; while (b) { if (b & 1) { res += a; if (res >= m) res -= m; } b >>= 1; a <<= 1; if (a >= m) a -= m; } return res; }
While carelessly Googling, I found some references to an algorithm called Schrage multiplication that seemed to do the same thing much more carefully. But references were scattered and incomplete and paid much less attention to performance than I cared about. Finally I reconstructed roughly what it looks like:
long long schrage(long long a, long long b, long long m) { if (a == 0) { return 0; } long long quot = m / a, rem = m % a; long long result = a * (b % quot) - rem * (b / quot); while (result < 0) { result += m; } return result; }
You can work out the modular arithmetic yourself and discover that, somewhat amazingly,
![\[ \mathtt{result} \equiv ab \bmod{m}. \]](http://latex.artofproblemsolving.com/d/5/2/d52e69a7e4db05f51649a1c4dac3a9ba5082050d.png)






Also, we're aggressively avoiding the remainder operation again. Under the original Schrage requirements, the while can be replaced with an if, but our loosening will require it, and using the remainder operation loses our speed margin over mul1. Optimization is hard.
It's hard to loosen it in a more precise and useful way, though. I juggled expressions for quite a while but couldn't find a good intermediate point to abstract out, so instead, I'll just proclaim that this works, again as long as

long long mul2(long long a, long long b, long long m) { int rtsize = (64 - __builtin_clzll(m)) / 2; long long rt = 1LL << rtsize; return (schrage(rt, schrage(a >> rtsize, b, m), m) + schrage(a & (rt - 1), b, m)) % m; }
This quickly and crudely computes an approximate square root




With this, we can work out that in schrage,


![\[ \mathtt{rem}\cdot\left\lfloor\frac{b}{\mathtt{quot}}\right\rfloor < (a-1)\cdot\frac{b}{m/a - 1} = (a-1)\cdot\frac{ab}{m - a} = \frac{b(a^2 - a)}{m - a} \leq \frac{b(2m-a)}{m-a} < 2b \]](http://latex.artofproblemsolving.com/c/2/3/c23c3da180de42d54b674f369e7c670e38389a00.png)
So, mul2 is faster. A little. Okay, it's just a 1.5x speedup compared to mul1, and I'm quite a bit more unsure on whether it always works, since proofs have bugs.
But now we have mul3, the enigma bequeathed from code of a different generation that prompted this entire odyssey. It doesn't quite reach the effective range of either of the above ranges — it seems to work up to


long long mul3(long long a, long long b, long long m) { return (a*b - (long long)(a/(long double)m*b + 1e-3)*m + m)%m; }