Okay, first let's establish a baseline. I'm on a 32-bit system now, so the allocation figures and also the timings are somewhat different than on a 64-bit system, but the trends are mostly the same. To see what we deal with, let's not compile for profiling, but just plain with optimisations (-O2
).
I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous seq
s, and slightly changed applyLoop
, to
applyLoop i v | i == n = v | otherwise = applyLoop (i + 1) $ f i v
which changes the result if apply
is initially called with a zero argument (the new one immediately returns the original argument, while the old looped until i
reaches -1
after wrap-around of the counter), but for positive arguments the semantics remain unchanged. For very small positive n
, the one extra comparison the new version does might make a measurable difference if f
is inlined and trivial, but for nontrivial n
or f
, the extra comparison is negligible, and the new version avoids code duplication that happens with the old.
That gives
$ ./test3 +RTS -s array (0,1) [(0,0),(1,1)] 560,049,112 bytes allocated in the heap 32,160 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1083 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 1.35s ( 1.35s elapsed) GC time 0.02s ( 0.02s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.37s ( 1.37s elapsed) %GC time 1.3% (1.3% elapsed) Alloc rate 415,944,041 bytes per MUT second Productivity 98.6% of total user, 98.4% of total elapsed
in contrast to
$ ./test4 +RTS -s (0,1) 47,492 bytes allocated in the heap 1,756 bytes copied during GC 42,632 bytes maximum residency (1 sample(s)) 18,808 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.04s ( 0.04s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.04s ( 0.04s elapsed) %GC time 0.5% (0.6% elapsed) Alloc rate 1,182,455 bytes per MUT second Productivity 98.9% of total user, 103.1% of total elapsed
from test4
. The difference is enormous. Well, let's look at the core to find out why, and how we can reduce it.
First, the core for using test4
:
Rec { Main.main_$s$wapplyLoop [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Word# -> GHC.Prim.Word# -> (# GHC.Word.Word32, GHC.Word.Word32 #) [GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL] Main.main_$s$wapplyLoop = \ (sc :: GHC.Prim.Int#) (sc1 :: GHC.Prim.Word#) (sc2 :: GHC.Prim.Word#) -> case sc of wild { __DEFAULT -> Main.main_$s$wapplyLoop (GHC.Prim.+# wild 1) (GHC.Prim.plusWord# sc2 (__word 1)) (GHC.Prim.minusWord# sc1 (__word 1)); 10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #) } end Rec }
You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seq
ing x
and y
in test4
is necessary here to aid the strictness analyser. Without it, you get a loop using boxed Word32
s building up huge thunks. The seq
s that originally were in apply
resp. in the function (\_ x -> x `seq` testN x)
passed to it make no difference whatsoever.
So, we have a lean and mean unboxed loop to aim for, and a sluggish alternative allocating tons. What is allocating so much, and taking so much time there?
The applyLoop
worker in this case begins with
case GHC.ST.runSTRep @ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32) (\ (@ s) (s :: GHC.Prim.State# s) -> let { n# [Dmd=Just L] :: GHC.Prim.Int# [LclId, Str=DmdType] n# = GHC.Prim.sizeofByteArray# ww4 } in case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) -> case {__pkg_ccall array-0.4.0.1 memcpy forall s. GHC.Prim.MutableByteArray# s -> GHC.Prim.ByteArray# -> GHC.Prim.Word# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}
allocating a new ByteArray#
- which is pretty fast, GHC's runtime is very good at such things, but nevertheless, 10 million allocations sum up to non-negligible time - and copying the contents of the old array over. The __pkg_ccall
is a quite fast C call, but it is considerably slower than a Haskell call, and calling over to C for copying 8 bytes is wasteful (the call takes much longer than the copying). I think - though I can't verify at the moment - that on a 64-bit system that is done without calling out to C, which would be much faster. Replacing the thaw
with a manual copying loop for an unsafeNewArray_
of the appropriate size (manual_copy.hs) brings down the time to 0.65 seconds - but increases the allocation figures:
$ ./manual_copy +RTS -s array (0,1) [(0,0),(1,1)] 760,049,112 bytes allocated in the heap 41,424 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1469 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.62s ( 0.63s elapsed) GC time 0.02s ( 0.02s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.64s ( 0.65s elapsed) %GC time 3.3% (3.2% elapsed) Alloc rate 1,226,997,169 bytes per MUT second Productivity 96.6% of total user, 95.3% of total elapsed
Note that it would be different for a large array, then calling to memcpy
of the C standard library would pay off.
Using unsafeThaw
here (unsafeThaw.hs) avoids allocating new ByteArray#
s (the underlying raw arrays that carry the payload of UArray
s) each iteration, and the involved copying, so it unsurprisingly speeds up the program, and reduces allocation:
$ ./unsafeThaw +RTS -s array (0,1) [(0,0),(1,1)] 400,049,112 bytes allocated in the heap 27,596 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 764 colls, 0 par 0.01s 0.01s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.28s ( 0.28s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.29s ( 0.29s elapsed) %GC time 4.4% (4.3% elapsed) Alloc rate 1,448,146,787 bytes per MUT second Productivity 95.5% of total user, 95.6% of total elapsed
So the speedup is about 4.5× compared to the original - not bad, but still a far cry from the test4
time - and the allocation is reduced by 160 million bytes (10 million times 8 bytes for the two Word32
plus 8 for bookkeeping [size] of the ByteArray#
), but it is still huge.
That is because
of _ { (# ipv4, ipv5 #) -> (# ipv4, Data.Array.Base.UArray @ GHC.Types.Int @ GHC.Word.Word32 ww1 ww2 ww3 ipv5 #) }
in each iteration it creates a new UArray Int Word32
. It allocates a new UArray
constructor, and pointers to the components, lower bound, upper bound, number of elements and the ByteArray#
. We can avoid that by returning the argument array - after we sneakily changed the contents of its ByteArray#
(unsafeThaw2.hs):
test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = runST (change arr) where change a' = do a <- Data.Array.Unsafe.unsafeThaw a' :: ST s (STUArray s Int Word32) x0 <- unsafeRead a 0 x1 <- unsafeRead a 1 unsafeWrite a 0 (x1 + 1) unsafeWrite a 1 (x0 - 1) return a'
which results in
$ ./unsafeThaw2 +RTS -s array (0,1) [(0,0),(1,1)] 80,049,100 bytes allocated in the heap 6,828 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 151 colls, 0 par 0.00s 0.00s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.17s ( 0.17s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.17s ( 0.17s elapsed) %GC time 1.7% (1.7% elapsed) Alloc rate 480,150,792 bytes per MUT second Productivity 98.1% of total user, 98.0% of total elapsed
another nearly twofold speedup and allocation figures reduced by 80%.
It takes so much longer and allocates so much more than the tight loop mostly because each iteration costs a GHC.ST.runSTRep
, which is NOINLINE
and inhibits a worker-wrapper transform of test3
that would allow cheaper access to the ByteArray#
.
You can get much better performance if you change the type of apply
to be monadic, and use a monadic version of test3
, e.g. (cleanST.hs)
import Data.Word import Data.Array.Unboxed import Data.Array.ST import Data.Array.Base import Control.Monad.ST test0 :: STUArray s Int Word32 -> ST s (STUArray s Int Word32) test0 arr = do x0 <- unsafeRead arr 0 x1 <- unsafeRead arr 1 unsafeWrite arr 0 (x1+1) unsafeWrite arr 1 (x0 - 1) return arr applyM :: Monad m => Int -> (Int -> a -> m a) -> a -> m a applyM n f v0 = do let loop i v | i < n = do v' <- f i v loop (i+1) v' | otherwise = return v loop 0 v0 main :: IO () main = let arr = listArray (0,1) [0,1] :: UArray Int Word32 in print $ runSTUArray $ do marr <- thaw arr applyM 10000000 (\_ a -> test0 a) marr
that comes close to the tight loop:
$ ./cleanST +RTS -s array (0,1) [(0,0),(1,1)] 49,148 bytes allocated in the heap 1,756 bytes copied during GC 42,632 bytes maximum residency (1 sample(s)) 18,808 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.05s ( 0.06s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.06s ( 0.06s elapsed) %GC time 0.4% (0.4% elapsed) Alloc rate 905,470 bytes per MUT second Productivity 99.2% of total user, 97.3% of total elapsed
but is of course still slower because the array reads and writes go to the L1 cache, while test4
operated completely in registers.
If you don't want to change the type of apply
, I know no clean way of coming close to the performance of the tight loop.
If you are prepared to get your hands really dirty, you can use low-level primitives (low_level.hs),
{-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-} import Data.Word import Data.Array.Unboxed import Data.Array.Base import GHC.Base test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = change arr where swp mbarr s = case readWord32Array# mbarr 0# s of (# s1, w0 #) -> case readWord32Array# mbarr 1# s1 of (# s2, w1 #) -> case writeWord32Array# mbarr 0# (w1 `plusWord#` 1##) s2 of s3 -> writeWord32Array# mbarr 1# (w0 `minusWord#` 1##) s3 change arr@(UArray _ _ _ ba) = case swp (unsafeCoerce# ba) realWorld# of !s -> arr
which gives more or less the same figures as cleanST
,
$ ./low_level +RTS -s array (0,1) [(0,0),(1,1)] 49,112 bytes allocated in the heap 1,756 bytes copied during GC 42,632 bytes maximum residency (1 sample(s)) 18,808 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.05s ( 0.06s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.06s ( 0.06s elapsed) %GC time 0.4% (0.4% elapsed) Alloc rate 928,001 bytes per MUT second Productivity 99.0% of total user, 95.1% of total elapsed
and more-or-less equivalent core (cleanST
's is cleaner), but is really as dirty as it looks. I cannot recommend going down that route except in extreme circumstances.
Note: new (better) timings, the machine was busier when the original runs were made
applyLoop
is not strict in either of its accumulators. You might just be getting lucky with the strictness analyzer when usingtest4
. Also,-O2
is the highest optimization level, althoughghc
seems to silently accept any optimization level.\$\endgroup\$where res = i `seq` v `seq` f i v
makeapplyLoop
strict (I know I haveseq
s all over the place, I still have little intuition about strictness analyzer abilities)? It does not seem to have any effect on the performance.\$\endgroup\$i `seq` v `seq` applyLoop (i + 1) res
, but that doesn't help either. I'll post more details in an answer.\$\endgroup\$