How to make a fast dynamic language interpreter

250 points by pizlonator 5 days ago|59 comments

•

pansa2 5 days ago

In a similar vein, see this page about the performance of the interpreter for the dynamic language Wren: https://wren.io/performance.html

Unlike the Zef article, which describes implementation techniques, the Wren page also shows ways in which language design can contribute to performance.

In particular, Wren gives up dynamic object shapes, which enables copy-down inheritance and substantially simplifies (and hence accelerates) method lookup. Personally I think that’s a good trade-off - how often have you really needed to add a method to a class after construction?

•

versteegen 5 days ago

Yes, language design is a hugely important determinant of interpreter or JIT speed. There are many highly optimised VMs for dynamic languages but LuaJIT is king because Lua is such a small and suitable language, and although it does have a couple difficult to optimise features, they are few enough that you can expend the effort. It's nothing like Python. It's not much of an exaggeration to say Python is designed to minimise the possibility of a fast JIT, with compounding layers of dynamism. After years of work, the CPython 3.15 JIT finally managed ~5% faster than the stock interpreter on x86_64.

•

pjmlp 4 days ago

CPython current state is more a reflection of resources spent, than what is possible.

See experience with Smalltalk and Self, where everything is dynamic dispatch, everything is an object, in a live image that can be monkey patched at any given second.

PyPy and GraalPy, and the oldie IronPython, are much better experiences than where CPython currently stands on.

•

dec0dedab0de 4 days ago

The problem is that AI has been dominating the conversation for so many years, and they'll get more improvements from removing the GIL than they would from adopting the PyPy JIT.

The JIT would help everyone else more than removing the GIL, I wish PyPy became the reference implementation during 2.7

•

pjmlp 4 days ago

Actually because AI has been driving the conversation that CPython JIT efforts are finally happening and being upstreamed.

It is also because of AI, that Intel, AMD and NVidia are now getting serious about Python GPU JITs, that allow writing kernels in a Python subset.

To the point that I bet Mojo will be too late to matter.

•

dontlaugh 4 days ago

Python is worse, but not by all that much. After all, PyPy has been several times faster for many years.

•

vlovich123 4 days ago

That is an incorrect analysis. CPython is difficult to JIT because of the lack of thought to the native bindings / extensions, not because of the language itself (as others point out PyPy was way faster long ago)

•

versteegen 4 days ago

You're correct. I neglected that; extension API compatibility is a big (the most important?) difference between PyPy and CPython's JIT. Amongst language features that affect optimisation potential, an extension API can be the worst.

Edit: I think what you're alluding to is that tracing JITs can overcome a lot of dynamic language features which make things hopeless for method JITs. Where LuaJIT really shines vs PyPy is outside of JITed loops. (Also memory and compile overheads). I realise this is a bit of a motte and bailey.

•

psychoslave 5 days ago

That’s basically what is done all the time in languages where monkey patching is accepted as idiomatic, notably Ruby. Ruby is not known for its speed-first mindset though.

On the other side, having a type holding a closed set of applicable functions is somehow questioning.

There are languages out there that allows to define arbitrary functions and then use them as a methods with dot notation on any variable matching the type of the first argument, including Nim (with macros), Scala (with implicit classes and type classes), Kotlin (with extension functions) and Rust (with traits).

•

pjmlp 4 days ago

It is getting better, now that they finally got the Smalltalk lessons from 1984.

"Efficient implementation of the smalltalk-80 system"

https://dl.acm.org/doi/10.1145/800017.800542

•

IshKebab 4 days ago

> Ruby is not known for its speed-first mindset though.

Or its maintainability, and this is one of the big reasons why. Methods and variables are dynamically generated at runtime which makes it impossible to even grep for them. If you have a large Ruby codebase (say Gitlab or Asciidoctor), it can be almost impossible to trace through code unless you are familiar with the entire codebase.

Their "answer" is that you run the code and use the debugger, but that's clearly ridiculous.

So I would say dynamically defined classes is not only bad for performance; it's just bad in general.

•

psychoslave 4 days ago

That's yet an other topic, as monkey patching can definitely be explicit in ruby. The dynamically generated things at runtime are generally through the catch all method missing facility that can be overwritten. This can also be done in, say, PHP. It just that the community is less fond of it. Not sure about what most popular ahead of time oriented languages expose as facility in this area, obviously one can always even decide to generate automodifying executable. There is nothing special about ruby when it comes to go into forbidden realms, except maybe it doesn't come to much in your way when you try to express something, even if that is not the most maintenance friendly path.

•

igouy 4 days ago

> Ruby is not known for its speed-first mindset though.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

•

naasking 4 days ago

> In particular, Wren gives up dynamic object shapes, which enables copy-down inheritance and substantially simplifies (and hence accelerates) method lookup.

A general rule of thumb is that if you can assign an expression a static type, then you can compile it fairly efficiently. Complex dynamic languages obviously actively fight this in numerous ways, and so end up being difficult to optimize. Seems obvious in retrospect.

•

jiusanzhou 5 days ago

The jump from change #5 to #6 (inline caches + hidden-class object model) doing the bulk of the work here really tracks with how V8/JSC got fast historically — dynamic dispatch on property access is where naive interpreters die, and everything else is kind of rounding error by comparison. Nice that it's laid out so you can see the contribution of each step in isolation; most perf writeups just show the final number.

•

jimmypk 4 days ago

@jiusanzhou The interesting implementation detail in change #6 is how the inline caching is done in an AST-walking interpreter specifically. In bytecode interpreters, IC rewriting is natural — the "cache site" is a stable byte offset in the bytecode stream you can patch. Here, the cache site is an AST node, so @pizlonator uses placement new to construct a specialized AST node on top of the generic one in-place (via constructCache<>). It's self-modifying code at the AST level.

The tradeoff is that this requires mutable AST nodes, which conflicts with the immutable-AST assumption most compilers rely on (e.g., for sharing subtrees or parallelizing compilation). For a single-threaded interpreter it works cleanly, but it'd be a problem if you wanted to JIT-compile from the same AST on a background thread while the interpreter is mutating nodes.

•

Someone 4 days ago

I agree, but there’s a tiny caveat that this is for one specific benchmark that, I think, doesn’t reflect most real-world code.

I’m basing that on the 1.6% improvement they got on speeding up sqrt. That surprised me, because, to get such an improvement, the benchmark must spend over 1.6% of its time in there, to start with.

Looking in the git repo, it seems that did happen in the nbody simulation (https://github.com/pizlonator/zef/blob/master/ScriptBench/nb...).

•

pizlonator 4 days ago

Before that specialization, sqrt calls were hilariously slow - so even calling it sparingly could significantly impact performance.

Basically the flow was:

- check if we’re calling a method of an object

- nope, ok, so cascade through 10+ symbol comparisons

- sqrt was towards the bottom of the cascade

•

grg0 5 days ago

Interesting, thanks for sharing. It is a topic I'd like to explore in detail at some point.

I also like how, according to Github, the repo is 99.7% HTML and 0.3% C++. A testament to the interpreter's size, I guess?

•

pizlonator 5 days ago

I committed the statically generated site, which is wastefully large because how I generate the code browsers

But yeah the interpreter is very small

•

vascocosta 4 days ago

Really interesting read, especially after I released the initial version of my own interpreter which is also an AST-walking interpreter. My main goal was to understand at a basic level what it takes to build an interpreted programming language.

I didn't want any optimisation complexities and just focused on being able to understand my own Rust code. I was surprised by the performance I got simply by using my favourite language and as a bonus, since Rust takes care of all the ownership and lifetimes, I don't need a garbage collector. For sure, right now I'm being super conservative and rely on cloning stuff to avoid lifetime hell in stuff like closures, but the speed and memory profile is still very decent.

For anyone interested in a simple to understand tree-walking interpreter in Rust, which is heavily based in expressive enums where code is data, here's my interpreter:

https://gluonscript.org/

•

lambertsimnel 3 days ago

GluconScript looks cool, but this sounds too good to be true:

> as a bonus, since Rust takes care of all the ownership > and lifetimes, I don't need a garbage collector.

I can imagine GluconScript's memory handling comes at a cost, even if the tradeoff of using a borrow checker is well worth it. Was that your experience?

Relatedly, since you commented there has been submission about garbage collectors in Rust ("Garbage Collection Without Unsafe Code"):

https://news.ycombinator.com/item?id=47821853

•

vascocosta 3 days ago

GluonScript really doesn't need a garbage collector, because at the end of the day when a script is executing, it is just safe Rust code evaluating expressions. So it's just a Rust program running really, with its automatic static memory management guarantees.

As far as I researched, only closures could generate dangling references and therefore need memory cleanup, but only if I allowed closures to access their environment (variables and functions) by reference / mutable reference.

To avoid this and simplify both my code as well as the mental model for the users of GluonScript, as of now, closures capture their environment by cloning it immutably. There's an increased memory usage with all the copying of the environment but there are never references to something that isn't being used anymore and therefore no need for a GC. At the end of the day all values captured by closures are owned Rust values that are dropped by Rust when no longer in scope.

So this can lead to high memory usage in hot loops but it can't lead to memory leaks.

•

injidup 5 days ago

What is this YOLO-c++ compiler that is referenced in the article? Google searches turn up nothing and chatgpt seems not to know it either.

•

electroly 5 days ago

The author of Fil-C, who is also the author of this language, uses "Yolo-C/C++" to mean regular C/C++ without Fil-C.

•

tikotus 4 days ago

This is very interesting and well done.

I've gone through something similar, but for a more functional language (a Scheme). It's interesting how here the biggest wins are from optimizing the objects, while the biggest wins in my case were optimizing closures. The optimizations were very similar.

"Three implementation models for scheme" gives all the answers to make a fast enough scheme, though it has something of a compilation step, so it's not interpreting the original AST.

https://www.cs.unm.edu/~williams/cs491/three-imp.pdf

•

valorzard 5 days ago

Do you think this exercise has taught you anything that could make fil c itself better?

•

pizlonator 4 days ago

Yeah I really need to have a better fix for how I handle unions.

And the fact that having outline calls to methods of value objects is so expensive

•

achierius 4 days ago

> And the fact that having outline calls to methods of value objects is so expensive

Is this tied to unions? Or otherwise, when does this happen? I don't see the connection w/ invisicaps or &c

•

pizlonator 2 days ago

In Fil-C, currently, all stack allocations that “escape” need to be allocated in the heap.

“Escape” is defined very loosely; it currently means: some function other than the one that owns the stack allocation needs a pointer to that allocation.

For example even if you could prove that `bar(Value* p)` never stashes p anywhere, the fil-C compiler will currently heap allocate that value anytime bar is called. The one exception is if bar had already been inlined, and so from the FilPizlonator’s perspective there isn’t even a call.

This is clearly dumb and fixable. It’s dumb because lots of functions aren’t worth inlining but their body is analyzable. Slow paths are like that. It’s fixable because those slow paths - and lots of code like them - takes ptrs as arguments and then obviously just uses them for loads and stores but doesn’t escape them any further.

You’ll sometimes hear me say that Fil-C is nowhere near as optimal as it could be. This is just one example of that

•

boulos 5 days ago

How's your experience with Fil-C been? Is it materially useful to you in practice?

•

pizlonator 5 days ago

I’m biased since I’m the Fil.

It was materially useful in this project.

- Caught multiple memory safety issues in a nice deterministic way, so designing the object model was easier than it would have been otherwise.

- C++ with accurate GC is a really great programming model. I feel like it speeds me up by 1.5x relative to normal C++, and maybe like 1.2x relative to other GC’d languages (because C++’s APIs are so rich and the lambdas/templates and class system is so mature).

But I’m biased in multiple ways

- I made Fil-C++

- I’ve been programming in C++ for like 35ish years now

•

thesz 2 days ago

If I may (and correctly understand what is going inside Fil-C), it would be not so hard to add support for software transactional memory by adding some library calls.

This will greatly reduce coordination bugs in parallel programs and may even speed things up.

•

HarHarVeryFunny 4 days ago

Are you using malloc + GC in preference to smart pointers, and if so why? I thought Fil-C was just C not C++?

It doesn't seem like that is necessarily a performance win, especially since you could always use a smart pointer's raw pointer (preferably const) in a performance critical path.

•

vlovich123 5 days ago

I’m curious. Given the overheads of Fil-C++, does it actually make sense to use it for greenfield projects? I like that Fil-C fills a gap in securing old legacy codebases, I’m just not sure I understand it for greenfield projects like this other than you happen to know C++ really well.

•

pizlonator 5 days ago

It made sense because I was able to move very quickly, and once perf became a problem I could move to Yolo-C++ without a full rewrite.

> happen to know C++ really well

That’s my bias yeah. But C++ is good for more than just perf. If you need access to low level APIs, or libraries that happen to be exposed as C/C++ API, or you need good support for dynamic linking and separate compilation - then C++ (or C) are a great choice

•

vlovich123 5 days ago

Hmmm… I did about 20+ years of C++ coding and since I’ve been doing Rust I haven’t seen any of these issues. It has trivial integrations with c/c++ libraries (often with wrappers already written), often better native libraries to substitute those c++ deps wholesale, and separate compilation out of the box. It has dynamic linking if you really need it via the C ABI or even rlib although I’ll grants the latter is not as mature.

The syntax and ownership rules can take some getting used to but after doing it I start to wonder how I ever enjoyed the masochism of the rule of 5 magic incantation that no one else ever followed and writing the class definition twice. + the language gaining complexity constantly without ever paying back tech debt or solving real problems.

•

tiffanyh 5 days ago

I see Lua was included, wish LuaJIT was as well.

•

pizlonator 5 days ago

I bet LuaJIT crushes Zef! Or rather, I would hope that it does, given how much more engineering went into it

There are many runtimes that I could have included but didn’t.

Also, it’s quite impressive how much faster PUC Lua is than QuickJS and Python

•

raincole 5 days ago

Because QuickJS is really slow. Don't be fooled by the name. It's almost an order of magnitude slower than node/v8.

(I suppose the quick in QuickJS means "quick for a pure interpreter without JIT compilation or something...)

•

pizlonator 5 days ago

based on this data, it’s probably slower than JSC’s or V8’s interpreter

So like that’s wild

•

zephen 5 days ago

> it’s quite impressive how much faster PUC Lua is than QuickJS and Python

Python's execution time is mostly spent looking up stuff. I don't think lua is quite as dynamic.

•

pizlonator 5 days ago

Lua is way more dynamic

•

mikemike 4 days ago

To illustrate this, here's the contorted Lua code from https://news.ycombinator.com/item?id=11327201

    local t = setmetatable({}, {
      __index = pcall, __newindex = rawset,
      __call = function(t, i) t[i] = 42 end,
    })
    for i=1,100 do assert(t[i] == true and rawget(t, i) == 42) end

Arguably this exercises only the slow paths of the VM.

A more nuanced take is that Lua has many happy fast paths, whereas Python has some unfortunate semantic baggage that complicates those. Another key issue is the over-reliance on C modules with bindings that expose way too many internals.

•

zephen 4 days ago

> A more nuanced take is that Lua has many happy fast paths, whereas Python has some unfortunate semantic baggage that complicates those.

This is a good way to describe it. Most of the semantic baggage doesn't make some speed improvements, up to and including JITing, impossible, but it certainly complicates them.

And of course, any semantic baggage will be useful to someone.

https://xkcd.com/1172/

•

zephen 5 days ago

I suppose it depends on where you are looking for dynamicity. In some ways, lua is much more laissez faire of course.

But in Python, everything is an object, which is why, as I said, it spends much of its time looking things up. And things like bindings for closures are late, so that's more lookups as well.

In lua, many things aren't objects, and, for example, you can add two numbers without looking anything up. Another issue, of course, when you do that, is that you could conceivably overflow an integer, but that can't happen in Python either.

The Python interpreter has some fast paths for specific object types, but it is really limited in the optimizations it can do, because there simply aren't any unboxed types.

•

psychoslave 5 days ago

Nop, Python is not full object. Not even Ruby is fully object, try `if.class` for example. Self, Smalltalk, Lisp, and Io are fully object in that sense. But none as far as I know can handle something like `(.class`.

•

zelphirkalt 4 days ago

Aren't you mixing up syntax and the concepts it expresses? Why would (.class have to be a thing? Is space dot class a thing? I don't think this makes sense and it doesn't inform about languages "being fully object". Such syntax is merely for producing an AST and that alone doesn't mean "object" or "not object". It could just as well be all kinds of different things, or functions, or stack pushes and pops or something.

•

gdwatson 4 days ago

I think the idea is that SmallTalk replaced conditional syntax with methods on booleans. You could call `ifTrue:` on a boolean, passing it a code block; a true boolean would execute the block, and a false boolean would not. (There was also an `ifFalse:` method.)

This feels more like a party trick than anything. But it does represent a deep commitment to founding the whole language on object orientation, even when it seems silly to folks like me.

•

true_religion 4 days ago

Linguistically, it meant your control structures looked the same as native language control structures so there was never any dividing line visually between your code and the system.

It also made it really easy to ingest code, and do meta programming.

•

psychoslave 4 days ago

>Why would (.class have to be a thing?

It doesn’t have to in the absolute. It just that if some speech seel that a programing language is completely object oriented, it’s fun to check to which point it actually is.

There are many valid reasons why one would not to do that, of course. But if it’s marketed as if implicitly one could expect it should, it seems fair to debunk the myth that it’s actually a fully object language.

>Is space dot class a thing?

Could be, though generally spaces are not considered like terms – but Whitespace shows it’s just about what is conventionally retained.

So, supposing that ` .class` and `.class` express the same value, the most obvious convention that would come to my mind then would be to consider that it’s applied to the implicit narrower "context object" in the current lexical scope.

Raku evaluate `.WHAT` and `(.WHAT)` both as `(Any)` for giving a concrete example of related choice of convention.

>Such syntax is merely for producing an AST and that alone doesn't mean "object" or "not object".

Precisely, if the language is not providing complete reflection facility on every meaningful terms, including syncategorematic ones, then it’s not fully object. Once again, being almost fully object is fine, but it’s not being fully object.

https://en.wikipedia.org/wiki/Syncategorematic_term

•

zephen 5 days ago

You obviously realize that different languages have different syntactic requirements, yet you are willing to cut one language a break when its minimal syntactical elements aren't objects, and refuse to cut other languages a break because they have a few more syntactical elements?

•

pizlonator 4 days ago

I think you’re describing deficiencies in the Python impl not anything about the language

•

zephen 4 days ago

> I think you’re describing deficiencies in the Python impl not anything about the language

To some extent, sure. And, looking at your implementation of your language, something like the optimizations on passing small numbers of parameters could probably help Python out. It spends an inordinate amount of time packing and unpacking parameter tuples.

But, for example, you can easily create a subclass of an integer and alter a small portion of its behavior, without having to code every single operation, which I don't think you can do in lua.

So, the dynamicity I'm describing is what the language has to do (more work at runtime) to support its own semantics.

Don't get me wrong. There are certainly opportunities to make Python go faster, and the core team is working on some of them (for example, one optimization is similar to your creation of additional subtree nodes for attribute lookup for known cases, but in bytecode instead), but I also think that the semantics of Python make large classes of optimization more difficult than for other languages.

For a major example of this kind of dynamicity, lua doesn't chain metatables when looking up metamethods, but Python will look stuff up in as many tables as you have subclasses, and has the complexity of dealing with MRO. That's not something that couldn't be JITed, but the edge cases of what you need to update if someone decides to add or modify a method in a superclass get pretty hairy pretty quickly.

Whereas, in lua, if you want to modify a metamethod and have it affect a particular object, yes, absolutely, you can do that, but it is up to you to modify the direct metatable of the object, rather than some ancestor, because lua is not going to dynamically follow the chain of references on every lookup.

And, back to to the parameter optimization case, I haven't thought that much about it, but there are a lot of Python edge cases in parameter passing, that might make that difficult.

And, of course, the use of ref counting instead of mark/sweep has a cost, but people don't like, e.g., PyPy, because their __del__ methods aren't guaranteed to be called immediately when the object goes out of scope. Lua is more like PyPy in this respect.

So Python has a lot of legacy decisions that make optimization harder.

Then things that try to be called Python often take shortcuts that make things faster, but don't get any traction, because they aren't 100% compatible.

So cPython is a Schelling point with semantics that are more complicated than some other language Schelling points, with enough momentum that it becomes difficult for other Python implementations to keep up with the standard, while simultaneously having enough inertia to keep people engaged in using it even though the optimizations are coming slowly.

I think the sPy language (discussed here a few weeks ago) has the right idea. "Hey, we're not Python, but if you like Python you might like us." Things that claim to be Python but faster either wither on the vine because of incompatibilities with cPython, or quickly decide they aren't really Python after they've lost their, ahem, Mojo, or both.

(The primary exception to this is microPython, which has a strong following because it literally can go where no other Python can go.)

•

tnelsond4 4 days ago

I use the bounds checker in TCC to check for memory errors in C, should I switch to Fil-C instead to debug my code? Obviously yolo-C is my target.

•

_alphageek 4 days ago

Good writeup. The Arguments arc (#7→#13) hits close — did basically the same dance for an async step evaluator in Rust a while back. Went all in on Cow<'_, Input> assuming borrow-in-the-common-case would earn its keep. Microbenches looked great. Real workload: the Cow discriminant plus lifetime gunk bled into every combinator past the first await, inlining fell off a cliff, the whole point of Cow evaporated. Ripped it out for NoInput / OneInput<T> / MultiInput(Vec<T>) at the evaluator boundary — same split as your ZeroArguments / OneArgument / TwoArguments, just arrived at the ugly way. One thing I keep wondering: have you stacked arity specialization with type specialization on the native path? Binary<add, int, int> style, drops the isInt probe altogether. Guessing the code size math didn't work out, or ICs are already soaking up whatever's hot on the object side so the native fast paths don't matter much. Which one?

•

catlifeonmars 4 days ago

Do you run an optimization pass on the AST between parsing and evaluation?

•

HarHarVeryFunny 4 days ago

For an interpreter / AST executor, I think a big win would be efficient parsing in the first place, in particular using a precedence parser for expressions vs recursive descent, which would avoid the need to optimize the AST to remove the 1:1 "unit productions" in the grammar.

•

pizlonator 4 days ago

I run a “resolve” pass.

That’s where for example getter inference happens.