Optimizing unboxed array operations in Haskell

Question

Consider this simplified code which successively permutes array elements:

import Data.Word import Data.Bits import Data.Array.Unboxed import Data.Array.Base import Data.Array.ST test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = arr `seq` runSTUArray (change arr) where change a' = do a <- thaw a' x0 <- unsafeRead a 0 x1 <- unsafeRead a 1 unsafeWrite a 0 (x1 + 1) unsafeWrite a 1 (x0 - 1) return a test4 :: (Word32, Word32) -> (Word32, Word32) test4 (x, y) = x `seq` y `seq` (y + 1, x - 1) apply :: Int -> (Int -> a -> a) -> a -> a apply n f v0 = n `seq` v0 `seq` applyLoop 0 v0 where applyLoop i v | i == n - 1 = res | otherwise = applyLoop (i + 1) res where res = f i v main :: IO () main = let arr = listArray (0, 1) [0, 1] :: UArray Int Word32 in print $ apply 10000000 (\_ x -> x `seq` test3 x) arr

On my machine, with ghc 7.4.2 and -O3 flag the version with test4 (plain tuples) shows ~10 times better performance then the version with UArray (also, according to the profiler, UArray version has total allocation of ~1.5 Gb). Are there any other optimizations I could do? I'd prefer to use the array version, because it will be more convenient in the real application.

applyLoop is not strict in either of its accumulators. You might just be getting lucky with the strictness analyzer when using test4. Also, -O2 is the highest optimization level, although ghc seems to silently accept any optimization level. — Gabriella Gonzalez, CommentedApr 8, 2013 at 1:43
Will writing where res = i `seq` v `seq` f i v make applyLoop strict (I know I have seqs all over the place, I still have little intuition about strictness analyzer abilities)? It does not seem to have any effect on the performance. — fjarri, CommentedApr 8, 2013 at 2:57
Actually, what I had in mind was something like i `seq` v `seq` applyLoop (i + 1) res, but that doesn't help either. I'll post more details in an answer. — Gabriella Gonzalez, CommentedApr 8, 2013 at 3:16

Daniel Fischer · Accepted Answer · 2013-04-10 21:07:27Z

Okay, first let's establish a baseline. I'm on a 32-bit system now, so the allocation figures and also the timings are somewhat different than on a 64-bit system, but the trends are mostly the same. To see what we deal with, let's not compile for profiling, but just plain with optimisations (-O2).

I have created a repository with all versions of the code I ran. I took the liberty and removed all superfluous seqs, and slightly changed applyLoop, to

applyLoop i v | i == n = v | otherwise = applyLoop (i + 1) $ f i v

which changes the result if apply is initially called with a zero argument (the new one immediately returns the original argument, while the old looped until i reaches -1 after wrap-around of the counter), but for positive arguments the semantics remain unchanged. For very small positive n, the one extra comparison the new version does might make a measurable difference if f is inlined and trivial, but for nontrivial n or f, the extra comparison is negligible, and the new version avoids code duplication that happens with the old.

That gives

$ ./test3 +RTS -s array (0,1) [(0,0),(1,1)] 560,049,112 bytes allocated in the heap 32,160 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1083 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 1.35s ( 1.35s elapsed) GC time 0.02s ( 0.02s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 1.37s ( 1.37s elapsed) %GC time 1.3% (1.3% elapsed) Alloc rate 415,944,041 bytes per MUT second Productivity 98.6% of total user, 98.4% of total elapsed

in contrast to

$ ./test4 +RTS -s (0,1) 47,492 bytes allocated in the heap 1,756 bytes copied during GC 42,632 bytes maximum residency (1 sample(s)) 18,808 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.04s ( 0.04s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.04s ( 0.04s elapsed) %GC time 0.5% (0.6% elapsed) Alloc rate 1,182,455 bytes per MUT second Productivity 98.9% of total user, 103.1% of total elapsed

from test4. The difference is enormous. Well, let's look at the core to find out why, and how we can reduce it.

First, the core for using test4:

Rec { Main.main_$s$wapplyLoop [Occ=LoopBreaker] :: GHC.Prim.Int# -> GHC.Prim.Word# -> GHC.Prim.Word# -> (# GHC.Word.Word32, GHC.Word.Word32 #) [GblId, Arity=3, Caf=NoCafRefs, Str=DmdType LLL] Main.main_$s$wapplyLoop = \ (sc :: GHC.Prim.Int#) (sc1 :: GHC.Prim.Word#) (sc2 :: GHC.Prim.Word#) -> case sc of wild { __DEFAULT -> Main.main_$s$wapplyLoop (GHC.Prim.+# wild 1) (GHC.Prim.plusWord# sc2 (__word 1)) (GHC.Prim.minusWord# sc1 (__word 1)); 10000000 -> (# GHC.Word.W32# sc1, GHC.Word.W32# sc2 #) } end Rec }

You can't beat that without short-cutting. You get a tight loop using only unboxed values in registers, it does no allocation. But note that seqing x and y in test4 is necessary here to aid the strictness analyser. Without it, you get a loop using boxed Word32s building up huge thunks. The seqs that originally were in apply resp. in the function (\_ x -> x `seq` testN x) passed to it make no difference whatsoever.

So, we have a lean and mean unboxed loop to aim for, and a sluggish alternative allocating tons. What is allocating so much, and taking so much time there?

The applyLoop worker in this case begins with

case GHC.ST.runSTRep @ (Data.Array.Base.UArray GHC.Types.Int GHC.Word.Word32) (\ (@ s) (s :: GHC.Prim.State# s) -> let { n# [Dmd=Just L] :: GHC.Prim.Int# [LclId, Str=DmdType] n# = GHC.Prim.sizeofByteArray# ww4 } in case GHC.Prim.newByteArray# @ s n# s of _ { (# ipv, ipv1 #) -> case {__pkg_ccall array-0.4.0.1 memcpy forall s. GHC.Prim.MutableByteArray# s -> GHC.Prim.ByteArray# -> GHC.Prim.Word# -> GHC.Prim.State# GHC.Prim.RealWorld -> (# GHC.Prim.State# GHC.Prim.RealWorld, GHC.Prim.Addr# #)}

allocating a new ByteArray# - which is pretty fast, GHC's runtime is very good at such things, but nevertheless, 10 million allocations sum up to non-negligible time - and copying the contents of the old array over. The __pkg_ccall is a quite fast C call, but it is considerably slower than a Haskell call, and calling over to C for copying 8 bytes is wasteful (the call takes much longer than the copying). I think - though I can't verify at the moment - that on a 64-bit system that is done without calling out to C, which would be much faster. Replacing the thaw with a manual copying loop for an unsafeNewArray_ of the appropriate size (manual_copy.hs) brings down the time to 0.65 seconds - but increases the allocation figures:

$ ./manual_copy +RTS -s array (0,1) [(0,0),(1,1)] 760,049,112 bytes allocated in the heap 41,424 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1469 colls, 0 par 0.02s 0.02s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.62s ( 0.63s elapsed) GC time 0.02s ( 0.02s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.64s ( 0.65s elapsed) %GC time 3.3% (3.2% elapsed) Alloc rate 1,226,997,169 bytes per MUT second Productivity 96.6% of total user, 95.3% of total elapsed

Note that it would be different for a large array, then calling to memcpy of the C standard library would pay off.

Using unsafeThaw here (unsafeThaw.hs) avoids allocating new ByteArray#s (the underlying raw arrays that carry the payload of UArrays) each iteration, and the involved copying, so it unsurprisingly speeds up the program, and reduces allocation:

$ ./unsafeThaw +RTS -s array (0,1) [(0,0),(1,1)] 400,049,112 bytes allocated in the heap 27,596 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 764 colls, 0 par 0.01s 0.01s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.28s ( 0.28s elapsed) GC time 0.01s ( 0.01s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.29s ( 0.29s elapsed) %GC time 4.4% (4.3% elapsed) Alloc rate 1,448,146,787 bytes per MUT second Productivity 95.5% of total user, 95.6% of total elapsed

So the speedup is about 4.5× compared to the original - not bad, but still a far cry from the test4 time - and the allocation is reduced by 160 million bytes (10 million times 8 bytes for the two Word32 plus 8 for bookkeeping [size] of the ByteArray#), but it is still huge.

That is because

of _ { (# ipv4, ipv5 #) -> (# ipv4, Data.Array.Base.UArray @ GHC.Types.Int @ GHC.Word.Word32 ww1 ww2 ww3 ipv5 #) }

in each iteration it creates a new UArray Int Word32. It allocates a new UArray constructor, and pointers to the components, lower bound, upper bound, number of elements and the ByteArray#. We can avoid that by returning the argument array - after we sneakily changed the contents of its ByteArray# (unsafeThaw2.hs):

test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = runST (change arr) where change a' = do a <- Data.Array.Unsafe.unsafeThaw a' :: ST s (STUArray s Int Word32) x0 <- unsafeRead a 0 x1 <- unsafeRead a 1 unsafeWrite a 0 (x1 + 1) unsafeWrite a 1 (x0 - 1) return a'

which results in

$ ./unsafeThaw2 +RTS -s array (0,1) [(0,0),(1,1)] 80,049,100 bytes allocated in the heap 6,828 bytes copied during GC 42,632 bytes maximum residency (2 sample(s)) 22,904 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 151 colls, 0 par 0.00s 0.00s 0.0000s 0.0001s Gen 1 2 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s INIT time 0.00s ( 0.00s elapsed) MUT time 0.17s ( 0.17s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.17s ( 0.17s elapsed) %GC time 1.7% (1.7% elapsed) Alloc rate 480,150,792 bytes per MUT second Productivity 98.1% of total user, 98.0% of total elapsed

another nearly twofold speedup and allocation figures reduced by 80%.

It takes so much longer and allocates so much more than the tight loop mostly because each iteration costs a GHC.ST.runSTRep, which is NOINLINE and inhibits a worker-wrapper transform of test3 that would allow cheaper access to the ByteArray#.

You can get much better performance if you change the type of apply to be monadic, and use a monadic version of test3, e.g. (cleanST.hs)

import Data.Word import Data.Array.Unboxed import Data.Array.ST import Data.Array.Base import Control.Monad.ST test0 :: STUArray s Int Word32 -> ST s (STUArray s Int Word32) test0 arr = do x0 <- unsafeRead arr 0 x1 <- unsafeRead arr 1 unsafeWrite arr 0 (x1+1) unsafeWrite arr 1 (x0 - 1) return arr applyM :: Monad m => Int -> (Int -> a -> m a) -> a -> m a applyM n f v0 = do let loop i v | i < n = do v' <- f i v loop (i+1) v' | otherwise = return v loop 0 v0 main :: IO () main = let arr = listArray (0,1) [0,1] :: UArray Int Word32 in print $ runSTUArray $ do marr <- thaw arr applyM 10000000 (\_ a -> test0 a) marr

that comes close to the tight loop:

$ ./cleanST +RTS -s array (0,1) [(0,0),(1,1)] 49,148 bytes allocated in the heap 1,756 bytes copied during GC 42,632 bytes maximum residency (1 sample(s)) 18,808 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.05s ( 0.06s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.06s ( 0.06s elapsed) %GC time 0.4% (0.4% elapsed) Alloc rate 905,470 bytes per MUT second Productivity 99.2% of total user, 97.3% of total elapsed

but is of course still slower because the array reads and writes go to the L1 cache, while test4 operated completely in registers.

If you don't want to change the type of apply, I know no clean way of coming close to the performance of the tight loop.

If you are prepared to get your hands really dirty, you can use low-level primitives (low_level.hs),

{-# LANGUAGE MagicHash, UnboxedTuples, BangPatterns #-} import Data.Word import Data.Array.Unboxed import Data.Array.Base import GHC.Base test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = change arr where swp mbarr s = case readWord32Array# mbarr 0# s of (# s1, w0 #) -> case readWord32Array# mbarr 1# s1 of (# s2, w1 #) -> case writeWord32Array# mbarr 0# (w1 `plusWord#` 1##) s2 of s3 -> writeWord32Array# mbarr 1# (w0 `minusWord#` 1##) s3 change arr@(UArray _ _ _ ba) = case swp (unsafeCoerce# ba) realWorld# of !s -> arr

which gives more or less the same figures as cleanST,

$ ./low_level +RTS -s array (0,1) [(0,0),(1,1)] 49,112 bytes allocated in the heap 1,756 bytes copied during GC 42,632 bytes maximum residency (1 sample(s)) 18,808 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0002s 0.0002s INIT time 0.00s ( 0.00s elapsed) MUT time 0.05s ( 0.06s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.06s ( 0.06s elapsed) %GC time 0.4% (0.4% elapsed) Alloc rate 928,001 bytes per MUT second Productivity 99.0% of total user, 95.1% of total elapsed

and more-or-less equivalent core (cleanST's is cleaner), but is really as dirty as it looks. I cannot recommend going down that route except in extreme circumstances.

Note: new (better) timings, the machine was busier when the original runs were made

I'm having trouble following your steps, Daniel. If you could hpaste each code variant used on each step, it would probably help a great deal. If you still have them. :) If not, no matter. — Will Ness, CommentedApr 10, 2013 at 19:03
Urk. I'll try to reconstruct as good as possible. Will take a bit. — Daniel Fischer, CommentedApr 10, 2013 at 19:06
and you say you changed it to ByteArray#, but the test3 that follows is tagged with test3 :: UArray Int Word32 -> UArray Int Word32... ? — Will Ness, CommentedApr 10, 2013 at 19:23
so is it correct to say that while the original function did 10 million thaws and freezes, your monadic version does just one thaw (and doesn't even bother with unsafeThaw) and then does a 10 million loop inside the ST monad? It does make perfect sense for it to perform that much better! :) — Will Ness, CommentedApr 10, 2013 at 19:38
@Will here is a repo with as far as I remember all versions. I'll reply to the following comments in a moment. — Daniel Fischer, CommentedApr 10, 2013 at 20:11

Gabriella Gonzalez · Accepted Answer · 2013-04-08 03:45:16Z

So I profiled your code using the -prof -auto-all -caf-all options at first and then ran it with ./arr +RTS -p, which generates the arr.prof file. It gave me the following measurements:

 Sun Apr 7 20:16 2013 Time and Allocation Profiling Report (Final) arr +RTS -p -RTS total time = 10.42 secs (10419 ticks @ 1000 us, 1 processor) total alloc = 14,400,052,608 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc apply.applyLoop Main 32.9 0.0 test3 Main 30.7 55.6 test3.change Main 21.3 44.4 main.\ Main 7.7 0.0 apply.applyLoop.res Main 7.5 0.0

All the allocation is occurring in test3. I thought it was because you keep thawing array, which makes a new copy, but that was not the case. Replacing thaw with unsafeThaw and runSTArray with the unsafeFreeze equivalent did nothing:

{-# LANGUAGE RankNTypes #-} import Data.Word import Data.Bits import Data.Array.Unboxed import Data.Array.Base hiding (unsafeThaw, unsafeFreeze) import Data.Array.ST import Control.Monad.ST (ST, runST) test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = runST $ change arr where change a' = do a <- unsafeThaw a' :: forall s . ST s (STUArray s Int Word32) x0 <- unsafeRead a 0 x1 <- unsafeRead a 1 unsafeWrite a 0 (x1 + 1) unsafeWrite a 1 (x0 - 1) unsafeFreeze a test4 :: (Word32, Word32) -> (Word32, Word32) test4 (x, y) = x `seq` y `seq` (y + 1, x - 1) apply :: Int -> (Int -> a -> a) -> a -> a apply n f v0 = n `seq` v0 `seq` applyLoop 0 v0 where applyLoop i v | i == n - 1 = res | otherwise = (applyLoop $! i + 1) $! res where res = f i v main :: IO () main = let arr = listArray (0, 1) [0, 1] :: UArray Int Word32 in print $ apply 100000000 (\_ x -> x `seq` test3 x) arr

So I added more fine grained cost centers to zoom in on which part was the trouble-maker:

{-# LANGUAGE RankNTypes #-} import Data.Word import Data.Bits import Data.Array.Unboxed import Data.Array.Base hiding (unsafeThaw, unsafeFreeze) import Data.Array.ST import Control.Monad.ST (ST, runST) test3 :: UArray Int Word32 -> UArray Int Word32 test3 arr = {-# SCC "runST" #-} runST $ {-# SCC "change" #-} change arr where change a' = do a <- {-# SCC "unsafeThaw" #-} unsafeThaw a' :: forall s . ST s (STUArray s Int Word32) x0 <- {-# SCC "unsafeRead1" #-} unsafeRead a 0 x1 <- {-# SCC "unsafeRead2" #-} unsafeRead a 1 {-# SCC "unsafeWrite1" #-} unsafeWrite a 0 (x1 + 1) {-# SCC "unsafeWrite2" #-} unsafeWrite a 1 (x0 - 1) {-# SCC "unsafeFreeze" #-} unsafeFreeze a test4 :: (Word32, Word32) -> (Word32, Word32) test4 (x, y) = x `seq` y `seq` (y + 1, x - 1) apply :: Int -> (Int -> a -> a) -> a -> a apply n f v0 = n `seq` v0 `seq` applyLoop 0 v0 where applyLoop i v | i == n - 1 = res | otherwise = (applyLoop $! i + 1) $! res where res = f i v main :: IO () main = let arr = listArray (0, 1) [0, 1] :: UArray Int Word32 in print $ apply 100000000 (\_ x -> x `seq` test3 x) arr

... but that just gave even more confusing results. It says that change is the bottle-neck, but as far as I can tell, change is just the bind for ST, which should be just as fast as the IO monad's bind (which is fast).

 Sun Apr 7 20:39 2013 Time and Allocation Profiling Report (Final) arr +RTS -p -RTS total time = 29.24 secs (29237 ticks @ 1000 us, 1 processor) total alloc = 26,400,052,608 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc test3.change Main 45.4 15.2 apply.applyLoop Main 12.1 0.0 change Main 6.8 15.2 unsafeFreeze Main 6.3 24.2 unsafeThaw Main 6.2 24.2 unsafeRead1 Main 5.0 12.1 runST Main 4.9 0.0 unsafeWrite2 Main 4.3 9.1 apply.applyLoop.res Main 2.9 0.0 test3 Main 2.7 0.0 main.\ Main 2.7 0.0

So I'm sort of at a loss at this point for how to optimize it further, but perhaps somebody else can build off of this.

Stack Exchange Network

Optimizing unboxed array operations in Haskell

2 Answers 2

Linked

Hot Network Questions

Optimizing unboxed array operations in Haskell

2 Answers 2

Linked

Related

Hot Network Questions