Does compilation that produces an interim bytecode (like with Java), rather than going "all the way" to machine code, generally involve less complexity (and thus likely take less time)?
5 Answers
Yes, compiling to Java bytecode is easier than compiling to machine code. This is partially because there's only one format to target (as Mandrill mentions, though this only reduces compiler complexity, not compile time), partly because the JVM is a much simpler machine and more convenient to program than real CPUs — as it's been designed in tandem with the Java language, most Java operations map to exactly one bytecode operation in a very simple way. Another very important reason is that practically no optimization takes place. Almost all efficiency concerns are left to the JIT compiler (or to the JVM as a whole), so the entire middle end of normal compilers disappears. It can basically walk through the AST once and generate ready-made bytecode sequences for each node. There is some "administrative overhead" of generating method tables, constant pools, etc. but that's nothing compared to the complexities of, say, LLVM.
- You wrote "...middle end of...". Did you mean "...middle to end of..."? Or maybe "...middle part of..."?CommentedJun 14, 2015 at 0:55
- 6@Julian "middle end" is a real term, coined in analogy with "front end" and "back end" with no regard for semantics :)– user7043CommentedJun 14, 2015 at 1:13
A compiler is simply a program that takes human-readable1 text files and translates them into binary instructions for a machine. If you take a step back and think about your question from this theoretical perspective, the complexity is roughly the same. However, on a more practical level, byte code compilers are simpler.
What broad steps have to happen to compile a program?
- Scanning, parsing and validating source code.
- Converting source to an abstract syntax tree.
- Optional: process and improve the AST if the language specification allows it (e.g. removing dead code, reorder operations, other optimizations)
- Converting the AST to some form that a machine understands.
There only two real differences between the two.
In general, a program with multiple compilation units requires linking when compiling to machine code and generally does not with byte code. One could split hairs about whether linking is part of compiling in the context of this question. If so, byte code compilation would be slightly simpler. However, the complexity of linking is made up for at run-time when many linking concerns are handled by the VM (see my note below).
Byte code compilers tend not to optimize as much because the VM can do this better on the fly (JIT compilers are a fairly standard addition to VMs nowadays).
From this I conclude that byte code compilers can omit the complexity of most optimizations and all of linking, deferring both of these to the VM runtime. Byte code compilers are simpler in practice because they shovel many complexities onto the VM that machine code compilers take on themselves.
1Not counting esoteric languages
- 3Ignoring optimizations and such is silly. These "optional steps" make up a great deal of the code base, complexity, and compile time of most compilers.– user7043CommentedJun 14, 2015 at 0:34
- In practice, that is correct. I was waxing academic here, I updated my answer.– user22815CommentedJun 14, 2015 at 1:46
- Is there any language specification that actually forbids optimisations? I understand that some languages make it difficult, but not allow any to begin with?– DavidmhCommentedJun 15, 2015 at 10:10
- @Davidmh I am not aware of any specs that forbid them. My understanding is that most say that the compiler is allowed to, but do not go into any detail. Each implementation is different because many optimizations rely on details of the CPU, OS, and the target architecture in general. For this reason a byte code compiler would be less likely to optimize and instead punt that to the VM which knows the underlying architecture.– user22815CommentedJun 15, 2015 at 14:14
I would say that simplifies the compiler design since the compilation is always Java to generic virtual machine code. That also means you only need to compile the code once and it will run on any plataform (instead of having to compile on each machine). I am not so sure if the compilation time will be lower because you can consider the virtual machine just like a standardized machine.
On the other hand, each machine will have to have the Java Virtual Machine loaded so it can interpret the "byte code" (which is the virtual machine code resulted from the java code compilation), translate it to the actual machine code and run it.
Imo this is good for very big programs but very bad for small ones (because the virtual machine is a waste of memory).
- I see. So you think the complexity of mapping the bytecode to the standard machine (i.e. the JVM) would match that of mapping the source code to a physical machine, leaving no reason to think bytecode would result in shorter compilation time?CommentedJun 14, 2015 at 0:00
- That is not what I said. I said mapping Java code to byte code (which is the Virtual Machine Assembler) would match that of mapping the source code (Java) to the physical machine code.– MandrillCommentedJun 14, 2015 at 0:05
The complexity of compilation depends largely on the semantic gap between the source language and the target language and the level of optimization you want to apply while bridging this gap.
For example, compiling Java source code to JVM byte code is relatively straight forward, since there is a core subset of Java that maps pretty much directly to a subset of JVM byte code. There are some differences: Java has loops but no GOTO
, the JVM has GOTO
but no loops, Java has generics, the JVM doesn't, but those can be easily dealt with (the transformation from loops to conditional jumps is trivial, type erasure slightly less so, but still manageable). There are other differences but less severe.
Compiling Ruby source code to JVM byte code is a lot more involved (especially before invokedynamic
and MethodHandles
were introduced in Java 7, or more precisely in the 3rd Edition of the JVM spec). In Ruby, methods can be replaced at runtime. On the JVM, the smallest unit of code that can be replaced at runtime is a class, so Ruby methods have to be compiled not to JVM methods but to JVM classes. Ruby method dispatch doesn't match JVM method dispatch and before invokedynamic
, there was no way to inject your own method dispatching mechanism into the JVM. Ruby has continuations and coroutines, but the JVM lacks the facilities to implement those. (The JVM's GOTO
is restricted to jump targets within the method.) The only control flow primitive the JVM has, that would be powerful enough to implement continuations are exceptions and to implement coroutines threads, both of which are extremely heavyweight, whereas the whole purpose of coroutines is to be very lightweight.
OTOH, compiling Ruby source code to Rubinius byte code or YARV byte code is again trivial, since both of those are explicitly designed as a compilation target for Ruby (although Rubinius has also been used for other languages such as CoffeeScript, and most famously Fancy).
Likewise, compiling x86 native code to JVM byte code is not straight-forward, again, there is a pretty large semantic gap.
Haskell is another good example: with Haskell, there are several high-performance, industrial strength production-ready compilers which produce native x86 machine code, but to this date, there is no working compiler for either the JVM or the CLI, because the semantic gap is so big that it is very complex to bridge it. So, this is an example where compilation to native machine code is actually less complex than compiling to JVM or CIL byte code. This is because native machine code has much lower level primitives (GOTO
, pointers, …) that can be easier "coerced" to do what you want than using higher level primitives such as method calls or exceptions.
So, one could say that the higher level the target language is, the more closely it has to match the semantics of the source language in order to reduce the complexity of the compiler.
In practice, most JVM today are very complex software, doing JIT compilation (so the bytecode is dynamically translated to machine code by the JVM).
So while the compilation from Java source code (or Clojure source code) to JVM byte code is indeed simpler, the JVM itself is doing complex translation to machine code.
The fact that this JIT translation inside the JVM is dynamic enables the JVM to focus on the most relevant parts of the bytecode. Practically speaking, most JVM optimize more the hottest parts (e.g. the most called methods, or the most executed basic blocks) of the JVM bytecode.
I'm not sure the the combined complexity of JVM + Java to bytecode compiler is significantly less than the complexity of ahead-of-time compilers.
Notice also that most traditional compilers (like GCC or Clang/LLVM) are transforming the input C (or C++, or Ada, ...) source code into an internal representation (Gimple for GCC, LLVM for Clang) which is quite similar to some bytecode. Then they are transforming that internal representations (first optimizing it into itself, i.e. most GCC optimizations passes are taking Gimple as input and producing Gimple as output; later emitting assembler or machine code from it) to object code.
BTW, with recent GCC (notably libgccjit) and LLVM infrastructure, you could use them to compile some other (or your own) language into their internal Gimple or LLVM representations, then profit from the many optimization abilities of the middle-end & back-end parts of these compilers.