7
\$\begingroup\$

I've written an interpreter for a simple assembly-like language and it's performing slower than I would like.

It's split into 3 files: the Parser that converts the source to a vector of ints, the VM that actually runs the bytecode, and Tests that has a bubble sort written in the language.

It sorts 100 numbers in about 6 seconds in GHCi. The profiler doesn't tell me much except that the most time is spent inside the step function as it's expected.

The Parser file isn't that needed because it's only run once so it doesn't affect performance.

Another thing to note is that it takes around 250 000 ticks (instructions executed) to do it so I'm pretty sure it could be much faster than 6 seconds.

Is there anything obvious that I could improve?

Parser

module Parser where import Data.Vector (Vector, fromList) import Data.Char (toUpper) import Data.List (sort) type ByteCode = [Int] data OpCode = Push | Pop | Add | Sub | Mult | Div | Store | Load | Jmp | Cmp | Not | Br | Dup | Inc | Dec | Swp deriving (Enum, Read, Show, Ord, Eq) arity :: Vector Int arity = (fromList . map snd . sort) $ zip [Push, Store, Load] [1, 1..] ++ zip [Pop, Add, Sub, Mult, Div] [0, 0..] charIsNumeric :: Char -> Bool charIsNumeric c = '0' <= c && '9' >= c stringIsNumeric :: String -> Bool stringIsNumeric ('-' : s) = all charIsNumeric s stringIsNumeric s = all charIsNumeric s capitalize :: String -> String capitalize [] = [] capitalize (x : xs) = toUpper x : xs wordToByteCode :: String -> Int wordToByteCode str = if stringIsNumeric str then read str else fromEnum opCodeEnum where opCodeEnum :: OpCode opCodeEnum = read $ capitalize str stringToByteCode :: String -> ByteCode stringToByteCode = map wordToByteCode . words sourceToByteCode :: String -> ByteCode sourceToByteCode = map wordToByteCode . concatMap words . lines 

VM

module VM where import Parser (ByteCode, OpCode(..), arity) import qualified Data.IntMap as IM import Data.Vector (Vector, (!)) import qualified Data.Vector as Vector import Data.List (intercalate) import Utility data VM = VM { byteCode :: Vector Int, programCounter :: Int, stack :: [Int], memory :: IM.IntMap Int } deriving (Show) fromCode :: ByteCode -> VM fromCode code = VM { byteCode = Vector.fromList code, programCounter = 0, stack = [], memory = IM.empty } step :: VM -> VM step vm = next where bc = byteCode vm pc = programCounter vm st = stack vm mm = memory vm inst = toEnum $ bc ! pc pop1 = tail st pop2 = tail pop1 top1 = head st top2 = head pop1 nextPc = pc + 1 next = case inst of Pop -> vm { stack = pop1, programCounter = nextPc } Push -> vm { stack = bc ! nextPc : st, programCounter = pc + 2 } Add -> vm { stack = (top1 + top2) : pop2, programCounter = nextPc } Sub -> vm { stack = (top2 - top1) : pop2, programCounter = nextPc } Mult -> vm { stack = (top1 * top2) : pop2, programCounter = nextPc } Div -> vm { stack = (top2 `div` top1) : pop2, programCounter = nextPc } Store -> vm { stack = pop2, programCounter = nextPc, memory = IM.insert top1 top2 mm } Load -> vm { stack = mm IM.! top1 : pop1, programCounter = nextPc } Jmp -> vm { stack = pop1, programCounter = top1 } Cmp -> vm { stack = signum (top2 - top1) : pop2, programCounter = nextPc } Not -> vm { stack = (if top1 > 0 then -1 else 1) : pop1, programCounter = nextPc } Br -> vm { stack = pop2, programCounter = if top2 > 0 then top1 else nextPc } Dup -> vm { stack = top1 : st, programCounter = nextPc } Inc -> vm { stack = (top1 + 1) : pop1, programCounter = nextPc } Dec -> vm { stack = (top1 - 1) : pop1, programCounter = nextPc } Swp -> vm { stack = top2 : top1 : pop2, programCounter = nextPc } endState :: VM -> Bool endState vm = programCounter vm == Vector.length (byteCode vm) run :: VM -> VM run = until endState step runCount :: VM -> (Int, VM) runCount = untilCount endState step debug :: (VM -> String) -> VM -> (VM, [String]) debug watch vm = if endState vm then (vm, []) else (nextVm, watch vm : logs) where (nextVm, logs) = debug watch (step vm) instructionLogger :: VM -> String instructionLogger vm = show (toEnum $ byteCode vm ! programCounter vm :: OpCode) watch :: Int -> VM -> String watch n vm = case IM.lookup n (memory vm) of Nothing -> "undefined" Just a -> show a composeLoggers :: [VM -> String] -> VM -> String composeLoggers loggers vm = (intercalate " " . map ($ vm)) loggers printDebug :: (VM -> String) -> VM -> IO () printDebug f v = putStr $ unlines $ snd $ debug f v 

Tests

module Tests where import qualified Parser as Parser import qualified VM as VM import qualified Data.IntMap as IM bubble = unlines [ "push 0", "push 1000", "store", "push 0", "push 1001", "store", "push 1000", "load", "load", "push 1001", "load", "load", "cmp", "push 38", "br", "push 1000", "load", "load", "push 1001", "load", "load", "push 1000", "load", "store", "push 1001", "load", "store", "push 1001", "load", "inc", "dup", "push 1001", "store", "push 100", "cmp", "not", "push 10", "br", "push 0", "push 1001", "store", "push 1000", "load", "inc", "dup", "push 1000", "store", "push 100", "cmp", "not", "push 10", "br" ] vm = VM.fromCode $ Parser.sourceToByteCode bubble vmWithData = vm { VM.memory = IM.fromList $ zip [0..100] [100, 99..0] } main = print $ VM.run $ vmWithData dbg = VM.printDebug (VM.composeLoggers [VM.instructionLogger, VM.watch 101, show . VM.programCounter]) vmWithData 
\$\endgroup\$
5
  • \$\begingroup\$I don't see anything obviously bad, except that you're benchmarking ghci's interpreter. It's not designed to be efficient. Try compiling with optimizations.\$\endgroup\$
    – Carl
    CommentedJul 22, 2014 at 5:25
  • \$\begingroup\$I did to profile it. It's obviously much faster but it still hangs at around 500, taking around 4 seconds.\$\endgroup\$
    – Darwin
    CommentedJul 22, 2014 at 8:13
  • \$\begingroup\$I'll take a closer look, then.\$\endgroup\$
    – Carl
    CommentedJul 22, 2014 at 15:05
  • \$\begingroup\$I can't actually compile and test because the Utility module is absent.\$\endgroup\$
    – Carl
    CommentedJul 22, 2014 at 15:12
  • \$\begingroup\$Ok, Utility was only used for untilCount, which I reimplemented. Compiled with -O2, this runs basically instantly.\$\endgroup\$
    – Carl
    CommentedJul 22, 2014 at 15:58

1 Answer 1

3
\$\begingroup\$

Based on my investigations, I'm going to say that your problem is exactly what I said in the comments: performance testing with ghci.

I modified VM.hs a bit, to get it to build:

{-# LANGUAGE BangPatterns #-} module VM where import Parser (ByteCode, OpCode(..), arity) import qualified Data.IntMap as IM import Data.Vector (Vector, (!)) import qualified Data.Vector as Vector import Data.List (intercalate) data VM = VM { byteCode :: Vector Int, programCounter :: Int, stack :: [Int], memory :: IM.IntMap Int } deriving (Show) fromCode :: ByteCode -> VM fromCode code = VM { byteCode = Vector.fromList code, programCounter = 0, stack = [], memory = IM.empty } step :: VM -> VM step vm = next where bc = byteCode vm pc = programCounter vm st = stack vm mm = memory vm inst = toEnum $ bc ! pc pop1 = tail st pop2 = tail pop1 top1 = head st top2 = head pop1 nextPc = pc + 1 next = case inst of Pop -> vm { stack = pop1, programCounter = nextPc } Push -> vm { stack = bc ! nextPc : st, programCounter = pc + 2 } Add -> vm { stack = (top1 + top2) : pop2, programCounter = nextPc } Sub -> vm { stack = (top2 - top1) : pop2, programCounter = nextPc } Mult -> vm { stack = (top1 * top2) : pop2, programCounter = nextPc } Div -> vm { stack = (top2 `div` top1) : pop2, programCounter = nextPc } Store -> vm { stack = pop2, programCounter = nextPc, memory = IM.insert top1 top2 mm } Load -> vm { stack = mm IM.! top1 : pop1, programCounter = nextPc } Jmp -> vm { stack = pop1, programCounter = top1 } Cmp -> vm { stack = signum (top2 - top1) : pop2, programCounter = nextPc } Not -> vm { stack = (if top1 > 0 then -1 else 1) : pop1, programCounter = nextPc } Br -> vm { stack = pop2, programCounter = if top2 > 0 then top1 else nextPc } Dup -> vm { stack = top1 : st, programCounter = nextPc } Inc -> vm { stack = (top1 + 1) : pop1, programCounter = nextPc } Dec -> vm { stack = (top1 - 1) : pop1, programCounter = nextPc } Swp -> vm { stack = top2 : top1 : pop2, programCounter = nextPc } endState :: VM -> Bool endState vm = programCounter vm == Vector.length (byteCode vm) run :: VM -> VM run = until endState step runCount :: VM -> (Int, VM) runCount = untilCount endState step where untilCount f g = go 0 where go !n x | f x = (n, x) | otherwise = go (n + 1) (g x) debug :: (VM -> String) -> VM -> (VM, [String]) debug watch vm = if endState vm then (vm, []) else (nextVm, watch vm : logs) where (nextVm, logs) = debug watch (step vm) instructionLogger :: VM -> String instructionLogger vm = show (toEnum $ byteCode vm ! programCounter vm :: OpCode) watch :: Int -> VM -> String watch n vm = case IM.lookup n (memory vm) of Nothing -> "undefined" Just a -> show a composeLoggers :: [VM -> String] -> VM -> String composeLoggers loggers vm = (intercalate " " . map ($ vm)) loggers printDebug :: (VM -> String) -> VM -> IO () printDebug f v = putStr $ unlines $ snd $ debug f v 

My changes were:

  1. Enable the BangPatterns extension to make it easier to efficiently write untilCount
  2. Remove the import of Utility.
  3. Add untilCount into runCount.

I also changed Tests to use runCount just to be sure I was getting the same operation count as you.

After those changes, this is a sample session:

carl@debian:~/hask/codereview/stackint$ ghc -O2 -main-is Tests Tests.hs [1 of 3] Compiling Parser ( Parser.hs, Parser.o ) [2 of 3] Compiling VM ( VM.hs, VM.o ) [3 of 3] Compiling Tests ( Tests.hs, Tests.o ) Linking Tests ... carl@debian:~/hask/codereview/stackint$ time ./Tests (267252,VM {byteCode = fromList [0,0,0,1000,6,0,0,0,1001,6,0,1000,7,7,0,1001,7,7,9,0,38,11,0,1000,7,7,0,1001,7,7,0,1000,7,6,0,1001,7,6,0,1001,7,13,12,0,1001,6,0,100,9,10,0,10,11,0,0,0,1001,6,0,1000,7,13,12,0,1000,6,0,100,9,10,0,10,11], programCounter = 73, stack = [], memory = fromList [(0,0),(1,1),(2,2),(3,3),(4,4),(5,5),(6,6),(7,7),(8,8),(9,9),(10,10),(11,11),(12,12),(13,13),(14,14),(15,15),(16,16),(17,17),(18,18),(19,19),(20,20),(21,21),(22,22),(23,23),(24,24),(25,25),(26,26),(27,27),(28,28),(29,29),(30,30),(31,31),(32,32),(33,33),(34,34),(35,35),(36,36),(37,37),(38,38),(39,39),(40,40),(41,41),(42,42),(43,43),(44,44),(45,45),(46,46),(47,47),(48,48),(49,49),(50,50),(51,51),(52,52),(53,53),(54,54),(55,55),(56,56),(57,57),(58,58),(59,59),(60,60),(61,61),(62,62),(63,63),(64,64),(65,65),(66,66),(67,67),(68,68),(69,69),(70,70),(71,71),(72,72),(73,73),(74,74),(75,75),(76,76),(77,77),(78,78),(79,79),(80,80),(81,81),(82,82),(83,83),(84,84),(85,85),(86,86),(87,87),(88,88),(89,89),(90,90),(91,91),(92,92),(93,93),(94,94),(95,95),(96,96),(97,97),(98,98),(99,99),(100,100),(1000,101),(1001,0)]}) real 0m0.155s user 0m0.012s sys 0m0.116s 

Nearly everything there is in sys time as well, which usually means doing IO. Let me do some proper benchmarking. Proper benchmarking in haskell involves using the criterion package. I added a new file to contain the criterion code, Main.hs:

import Criterion.Main import qualified VM import qualified Tests main :: IO () main = defaultMain [bench "sort" $ whnf (fst . VM.runCount) Tests.vmWithData] 

As a quick explanation of Criterion in general - benchmarking in a lazy language can be tricky. Criterion provides tools to let you make sure you're doing it right. I used the whnf function to benchmark counting the number of steps the program runs. Since it's impossible to determine how many steps it runs without actually running it, that ensures that the benchmarking isn't fooled by laziness. And here's another terminal log:

carl@debian:~/hask/codereview/stackint$ ghc -O2 Main.hs [3 of 4] Compiling Tests ( Tests.hs, Tests.o ) [flags changed] [4 of 4] Compiling Main ( Main.hs, Main.o ) Linking Main ... carl@debian:~/hask/codereview/stackint$ ./Main warming up estimating clock resolution... mean is 21.01945 us (40001 iterations) found 2284 outliers among 39999 samples (5.7%) 678 (1.7%) low severe 1345 (3.4%) high severe estimating cost of a clock call... mean is 16.59960 us (6 iterations) benchmarking sort collecting 100 samples, 1 iterations each, in estimated 6.015491 s mean: 52.26690 ms, lb 50.39930 ms, ub 57.08317 ms, ci 0.950 std dev: 14.30133 ms, lb 6.851806 ms, ub 29.74848 ms, ci 0.950 found 7 outliers among 100 samples (7.0%) 3 (3.0%) high mild 4 (4.0%) high severe variance introduced by outliers: 96.804% variance is severely inflated by outliers 

Criterion gives you a bunch of statistical analysis of its results. It tells me, among other things, that benchmarking in a VirtualBox VM introduces a lot of jitter. That's what all the stuff about variance and outliers is about. However, if you look at the absolute timings, that doesn't matter too much. Even with the inflated variance, the timing ranges from about 50ms to 57ms. In other words, your code is pretty darn fast already.

But if you're going to benchmark, do it properly.


Now, it is possible to improve upon this code a bit. It suffers from some minor excessive laziness.

  1. Change the import of IntMap to Data.IntMap.Strict. This will keep unevaluated expressions from building up in the values in the IntMap.
  2. Add strictness annotations to the fields that would benefit from it in the VM record.

With those two changes, I cut the time spent in the criterion benchmark in half. Here's what I settled on for the definition of VM:

data VM = VM { byteCode :: Vector Int, programCounter :: !Int, stack :: [Int], memory :: !(IM.IntMap Int) } deriving (Show) 

Note that this is a valid data definition without any extensions. It's basic haskell that putting a ! on a field in a data declaration marks that field as strict. More precisely, it means "when the constructor of this type is evaluated, also evaluate this field to whnf".

The byteCode field never changes during execution, so it's not necessary to mark it as strict. Once it's evaluated, it stays evaluated. The stack field is a recursive data type that is always built directly from constructors. It doesn't help anything to make it strict, it's always already in whnf.

Of the two fields that strictness annotations do help on, I was very surprised that it's actually the programCounter field that gets a huge benefit from becoming strict. In retrospect, that's probably because I'm working with GHC 7.8, which automatically unboxes strict "small" fields in data types, and Int is small. The auto unboxing significantly reduces the amount of pointer chasing during the loop, so it does make sense that it would improve things.

To get that same improvement on older versions of GHC, you would have to define the data type as:

data VM = VM { byteCode :: Vector Int, programCounter :: {-# UNPACK #-} !Int, stack :: [Int], memory :: !(IM.IntMap Int) } deriving (Show) 

The UNPACK pragma indicates to ghc that it should unbox the next field in a data declaration, assuming it is strict. (If it isn't strict, unboxing it would change the semantics, so the pragma is ignored.)

\$\endgroup\$
1
  • \$\begingroup\$Thank you so much for taking the time to review my code.\$\endgroup\$
    – Darwin
    CommentedJul 22, 2014 at 17:00

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.