Std::vector v = script.getIntVector ('somearray') Another useful thing is getting list of table keys and calling Lua functions from C. Part 2 will cover those two features. You can also call Lua functions from C. And C functions from Lua. This will probably be in Part 3.
Product & Tech
The Roblox engine is written in a combination of C++ and Lua, with the code that performs computationally intensive operations written in optimized C++, while game logic and scripts are written in Lua, for ease of development. For this model to be effective, it requires that the transitions between Lua and C++ be as fast as possible, as any time spent in this no man’s land is essentially just wasted milliseconds.
Over the past couple of months, we’ve been rolling out various improvements to this part of the system. One part specifically—the actual invocation of C++ methods from Lua—was particularly interesting, as it led to considerable speed improvements and required digging around in the guts of Lua to understand how things worked under the hood.
We ended up modifying the Lua VM itself, but before we get to that, we need to lay some groundwork.
Compilers, VM, and bytecode
When Lua source code is compiled, it’s compiled into Lua bytecode, which the Lua VM then runs. Lua bytecode has around 35 instructions in total, for things like reading/writing tables, calling functions, performing binary operations, jumps and conditionals, and so on. The Lua VM is register-based, as opposed to being stack-based like many other VMs, so part of what the compiler does when it generates bytecode is determine which registers each instruction should use.
Each instruction has the form “OP_CODE A B,” or “OP_CODE A B C,” where “OP_CODE” is the opcode (for example, CALL for calling a function) and A/B/C are the opcode arguments. The arguments (or registers) aren’t actual values. Instead, they are indices that point into one of two tables: the constant table (written Kst(..)) or the register table (written R(..)).
For a detailed description of Lua bytecode, see “A No-Frills introduction to Lua 5.1 VM Instructions.” It’s a lot more exciting than it sounds; I promise!
To give you a feel for what Lua bytecode looks like, we’re going to go over some simple programs first and then progress to some more relevant examples.
Using the Chunkspy utility, we can disassemble Lua bytecode into Lua assembly and get a listing of the code, as well as the constant table, so we can essentially see what bytecode gets generated for any given Lua source code.
Basic bytecode examples
A simple program like “x = 10” compiles into:
The first two lines show the constant table (with the string value “x” in slot 0 and the integer value 10 in slot 1), and the following two lines are the disassembled opcodes.
[Line 1] Looking up the LOADK opcode in “No Frills,” we see that it has the form “LOADK A Bx — R(A) := Kst(Bx).” So, LOADK has two arguments (registers A and B) and its operation is to assign the value found in the constant table at the slot given by the second register, Kst(Bx), to the register table in the slot given by the first argument, R(A). “Bx” just means that because the opcode only has two arguments, the B register is extended and assigned more bits.
[Line 2] SETGLOBAL has the form “SETGLOBAL A Bx — Gbl[Kst(Bx)] := R(A).” It assigns a value to the global table using the key given by the constant table at the slot of the second argument. Because the second argument is 0 and the value of the constant table at 0 is “x,” it writes something to the global table using the key “x.” Whatever is in the register table at the slot given by the first argument is what’s being written, which the previous instruction loaded with the value 10.
Let’s look at a slightly more complicated example, “x = 10; y = x.” I’ll leave the manual execution of the code as an exercise for the reader. 🙂
Bytecode for function calls
Let’s look at the code generated for “foo(10):”
To execute function calls, the function must be loaded into the first register and arguments into subsequent ones. The semantics for “CALL A B C” are such that A holds the function, B is the number of arguments (actually, it’s the number of arguments +1, due to the way “…” is implemented), and C is the number of return values (again, it’s the number of return values +1, to handle multiple return values).
We’re familiar with the first two lines; they load a value into register table slot 0 and the value 10 into register table slot 1. The third line is what performs the function call, using the value in register A (register table slot 0, which was loaded with “foo”), with B specifying the number of arguments, and C the number of return values (remember, both B and C should have 1 added to their values). Before the function can be called, the VM also verifies that the value in R(A) is in fact callable.
Interlude — Metafunctions, metatables, and userdata
Lua has a mechanism that allows users to extend the functionality of tables by associating a metatable with an existing table. The metatable contains fallback methods that are invoked if a certain method or operation can’t be performed on the main table (see https://www.lua.org/pil/13.html for a thorough description).
For our purposes, the most relevant entries in the metatable are the “__index” and “__call” fields. __index is used when looking up an element in a table, so the code “local x = my_table” would first call the __index method on my_table. If that failed, it would instead try to call __index on my_table’s metatable. __call is similarly used when you try to treat something as a function and call it “local x = foo(42),” for example
In order for Lua and C++ to interoperate, they need some way to be able to share functions and data. Lua facilitates this by providing a data type called UserData. UserData objects can be created in C++ land, and because they are native Lua data types, they can be adorned with metatables that allows Lua code to interact with them as if they were just regular Lua objects.
Member function calls
Okay, back to looking at some bytecode! This next example is a bit more interesting because it shows what happens when you have code like “foo:bar(10),” which is calling the bar method on the instance foo (an instance of the class Foo).
The new thing here is the self instruction [line 2], which we haven’t seen before. Self has the syntax “SELF A B C — R(A) := R(B)[RK(C)]; R(A+1) := R(B),” so let’s break this down. In the register table at slot R(A), it will place the result of looking up the table in register slot R(B) using the key in slot RK(C). It will also copy whatever was in slot R(B) into slot R(A+1), but more on this later. You might notice that the value of the C register is 257. This is valid because Lua is using RK(C) to look up the value, and RK will either use the register table or the constant table, depending on the value of 9:th bit. If it’s a 1, which it is in this case, then the constant table is used; otherwise, the lookup goes to the register table (after masking out the highest bit).
Line 3 places 10 in slot 2, and finally line 4 will execute the function call.
The SELF instruction serves two purposes. First, it looks for the “bar” method on the Foo class, and places it in R(A). Secondly, because foo is an instance method and we need the instance of the class that we’re invoking the method on when doing the call, it places this instance in R(A+1). If you’re familiar with classes in Python, then you might recognize the concept: methods are usually written as “def my_method(self, arg1, arg2..),” where self is the class instance.
We’ll need to dig into this a little deeper and take a look at what happens when the foo instance is a C++ object, represented in Lua as a UserData object.
The SELF call can be seen as a table lookup, i.e. Foo[“bar”] (capital Foo represents the class Foo, as opposed to foo, the instance), and we know that lookups will use the __index method. When the foo instance was created in C++ land, a metatable was associated with the instance, and the metatable had its __index field set to a piece of C++ code that will get called when __index is invoked.
Interlude — lua_State
When C/C++ gets called from Lua, the only data that gets passed is a lua_State object. This object contains everything related to the currently running Lua thread. The most important information in the state object is the Lua stack, which contains the function arguments (accessed via the lua_tointeger/tostring etc family of functions) and is also used to return values back to Lua.
In pseudo-C++, our __index function looks something like this:
Lots of internals are glossed over, but here’s the gist of it. Given that the UserData object passed as the first argument on the Lua stack, we’re able to find a descriptor that describes the actual C++ class, and via the descriptor we can see if this class has a method with the given name. If it does, a function pointer to a method invoker is pushed on the Lua stack, and we return success.
After this call, the Lua VM will place the rest of the arguments in the register table, and then call the function we returned from the metaIndex method, which will again call into C++, and land in the invoker function:
The methodInvoker also uses the ClassDescriptor, but this time it’s able to invoke the member function, and pop the correct arguments from the stack.
The home stretch!
Now that we can clearly see the two round trips from Lua to C++, we can try to figure out how to optimize this.
Our end goal is to do a single function call from Lua to C++ and have all the pieces we need on the Lua stack to be able to do method lookup and invocation at once. The problem seems to be that we have one register too few. When we call our combined lookup/invoker function, we want the Lua stack to look like [self, method name, arg1, arg2, …], but looking at SELF, we see that it uses its first slot for the result of looking up the method function and the second slot for storing the instance.
A key realization came when we looked at the way the __call metamethod worked. If an object has the __call metamethod, then before the _call function gets invoked, the object itself is pushed on the stack and all arguments are shifted up. By piggy-backing on this functionality, there was a way of getting “self” on the stack without having to explicitly store it in a register.
The second part involved getting the method name on the stack as well. For this, we had to be sneaky and alter the workings of the SELF opcode.
Remember that in the default case, SELF would try to find the member function and store this in R(A) together with the instance in R(A+1). We ended up skipping the lookup altogether and stored the actual object in R(A) and the method name in R(A+1).
If we now made sure that the object in R(A) had a __call metamethod, then we’d also end up pushing self on the stack. So, we’d have a stack that looked like [self, method name, args…] and did just a single call into C++. Perfect! Well, almost. 🙂
Before considering this done, we wanted to put some final polish on it. We didn’t want to overload the semantics of the __call metamethod, so instead we added a specific metamethod for this type of invocation—called __namecall—that was only available on UserData objects. We also modified the SELF opcode so it only uses the new semantics if the object has a __namecall metamethod.
The second thing we did was mostly make the new path and the old path able to share code easily. Instead of having the method-name as the second argument, we pushed it to the last argument. So, after it had been used to look up the method pointer, it could easily be popped off the stack and the stack looked like it did if the function had been invoked via the old path.
How much of an impact does this optimization have? Well, like most things in programming, the answer is “it depends.” For functions that are heavyweight—and you don’t call that often—you won’t see much of an improvement. But for smaller functions that you call often, the savings can be considerable.
People on the Developer Forum quickly noticed the appearance of this strange, new metamethod, and a table was presented that compared the speed of __namecall against both the old method of calling instance methods and against a workaround that developers had been using to optimize method calling:
The first loop uses the new __namecall code path, but because all the magic happens under the hood, there is no need for developers to change any existing code to benefit from the optimization.
The second loop emulates the old way of doing an instance method call; first doing a lookup to find the method and then invoking it.
And finally, the third loop shows a common optimization that developers were doing, where the method was first looked up, stored in a local variable, and then the variable was invoked.
The nice thing here is that it shows that with the __namecall optimization, it’s no longer necessary to explicitly cache instance functions, as it’s just as fast as the cached optimization, so the most straightforward code will also be the most performant.
Now that __namecall has been deployed, and we’re happy with the results we’re seeing, it’s time to turn our focus to memory usage, and see what we can do to improve the client in that area!
The FFI library allows calling external C functions andusing C data structures from pure Lua code.
The FFI library largely obviates the need to write tedious manualLua/C bindings in C. No need to learn a separate binding language— it parses plain C declarations! These can becut-n-pasted from C header files or reference manuals. It's up tothe task of binding large libraries without the need for dealing withfragile binding generators.
The FFI library is tightly integrated into LuaJIT (it's not availableas a separate module). The code generated by the JIT-compiler foraccesses to C data structures from Lua code is on par with thecode a C compiler would generate. Calls to C functions canbe inlined in JIT-compiled code, unlike calls to functions bound viathe classic Lua/C API.
This page gives a short introduction to the usage of the FFI library.Please use the FFI sub-topics in the navigation bar to learn more.
Motivating Example: Calling External C Functions
It's really easy to call an external C library function:
So, let's pick that apart:
① Load the FFI library.
② Add a C declarationfor the function. The part inside the double-brackets (in green) isjust standard C syntax.
③ Call the namedC function — Yes, it's that simple!
Actually, what goes on behind the scenes is far from simple:
Ok, so maybe the use of printf() wasn't such a spectacularexample. You could have done that with io.write() andstring.format(), too. But you get the idea ...
So here's something to pop up a message box on Windows:
Bing! Again, that was far too easy, no?
Compare this with the effort required to bind that function using theclassic Lua/C API: create an extra C file, add a C functionthat retrieves and checks the argument types passed from Lua and callsthe actual C function, add a list of module functions and theirnames, add a luaopen_* function and register all modulefunctions, compile and link it into a shared library (DLL), move it tothe proper path, add Lua code that loads the module aaaand ... finallycall the binding function. Phew!
Motivating Example: Using C Data Structures
The FFI library allows you to create and access C datastructures. Of course the main use for this is for interfacing withC functions. But they can be used stand-alone, too.
Lua is built upon high-level data types. They are flexible, extensibleand dynamic. That's why we all love Lua so much. Alas, this can beinefficient for certain tasks, where you'd really want a low-leveldata type. E.g. a large array of a fixed structure needs to beimplemented with a big table holding lots of tiny tables. This imposesboth a substantial memory overhead as well as a performance overhead.
Here's a sketch of a library that operates on color images plus asimple benchmark. First, the plain Lua version:
This creates a table with 160.000 pixels, each of which is a tableholding four number values in the range of 0-255. First an image witha green ramp is created (1D for simplicity), then the image isconverted to greyscale 1000 times. Yes, that's silly, but I was inneed of a simple example ...
And here's the FFI version. The modified parts have been marked inbold:
Ok, so that wasn't too difficult:
Lua Function Example
① First, load the FFIlibrary and declare the low-level data type. Here we choose astruct which holds four byte fields, one for each componentof a 4x8 bit RGBA pixel.
② Creating the datastructure with ffi.new() is straightforward — the'?' is a placeholder for the number of elements of avariable-length array.
Can Lua Call A C++ Function For A
③ C arrays arezero-based, so the indexes have to run from 0 ton-1. One might want to allocate one more element instead tosimplify converting legacy code.
④ Since ffi.new()zero-fills the array by default, we only need to set the green and thealpha fields.
⑤ The calls tomath.floor() can be omitted here, because floating-pointnumbers are already truncated towards zero when converting them to aninteger. This happens implicitly when the number is stored in thefields of each pixel.
Now let's have a look at the impact of the changes: first, memoryconsumption for the image is down from 22 Megabytes to640 Kilobytes (400*400*4 bytes). That's a factor of 35x less! So,yes, tables do have a noticeable overhead. BTW: The original programwould consume 40 Megabytes in plain Lua (on x64).
Next, performance: the pure Lua version runs in 9.57 seconds (52.9seconds with the Lua interpreter) and the FFI version runs in 0.48seconds on my machine (YMMV). That's a factor of 20x faster (110xfaster than the Lua interpreter).
The avid reader may notice that converting the pure Lua version overto use array indexes for the colors ( instead of.red,  instead of .green etc.) ought tobe more compact and faster. This is certainly true (by a factor of~1.7x). Switching to a struct-of-arrays would help, too.
However the resulting code would be less idiomatic and rathererror-prone. And it still doesn't get even close to the performance ofthe FFI version of the code. Also, high-level data structures cannotbe easily passed to other C functions, especially I/O functions,without undue conversion penalties.