Playing With The .NET JIT Part 2

posted in Washu's Journal
Published September 14, 2014
Advertisement
Previously I discussed various potential issues the x86 JIT had with inlining non-trivial methods and functions taking or returning value types. In this entry I hope to cover some potential pitfalls facing would be optimizers, along with discussing some unexpected optimizations that do take place.Optimizations That Aren't
It is not that uncommon to see people advocating the usage of unsafe code as a means of producing "optimized" code in the managed environment. The idea is a simple one, by getting down to the metal with pointers and all that fun stuff, you can somehow produce code that will be "optimized" in ways that typical managed code cannot be.
Unsafe code does not allow you to manipulate pointers to managed objects in whatever manner you please. Certain steps have to be taken to ensure that your operations are safe with regards to the managed heap. Just because your code is marked as "unsafe" doesn't mean that it is free to do what it wants. For example, you cannot assign a pointer the address of a managed object without first pinning the object. Pointers to objects are not tracked by the GC, so should you obtain a pointer to an object and then attempt to use the pointer, you could end up accessing a now collected region of memory. What can also happen is that you could obtain a pointer to an object, but when the GC runs your object could be shuffled around on the heap. This shuffling would invalidate your pointer, but since pointers are not tracked by the GC it would not be updated (while references to objects are updated). Pinning objects solves this problem, and hence is why you are only allowed to take the address of an object that's been pinned. In essence, a pinned object cannot be moved nor collected by the GC until it is unpinned. This is typically done through the use of the fixed keyword in C# or the GCHandle structure.
Much like how a fixed object cannot be moved by the GC, a pointer to a fixed object cannot be reassigned. This makes it difficult to traverse primitive arrays, as you end up needing to create other temporary pointers, or limiting the size of the fixed area to a small segment. Fixed objects, and unsafe code, increase the overall size of the produced IL by a fairly significant margin. While an increase in the IL is not indicative of the size of the produced machine code, it does prevent the runtime from inlining such methods. As an example, the two following snippets reveal the difference between a safe inner product and an unsafe one; note that in the unmanaged case it was using a fixed sized buffer.public float Magnitude() { return (float)Math.Sqrt(X * X + Y * Y + Z * Z);}.method public hidebysig instance float32 Magnitude() cil managed{ .maxstack 8 L_0000: ldarg.0 L_0001: ldfld float32 PerformanceTests.Vector3::X L_0006: ldarg.0 L_0007: ldfld float32 PerformanceTests.Vector3::X L_000c: mul L_000d: ldarg.0 L_000e: ldfld float32 PerformanceTests.Vector3::Y L_0013: ldarg.0 L_0014: ldfld float32 PerformanceTests.Vector3::Y L_0019: mul L_001a: add L_001b: ldarg.0 L_001c: ldfld float32 PerformanceTests.Vector3::Z L_0021: ldarg.0 L_0022: ldfld float32 PerformanceTests.Vector3::Z L_0027: mul L_0028: add L_0029: conv.r8 L_002a: call float64 [mscorlib]System.Math::Sqrt(float64) L_002f: conv.r4 L_0030: ret}public float Magnitude() { fixed (float* p = V) { return (float)Math.Sqrt(p[0] * p[0] + p[1] * p[1] + p[2] * p[2]); }}.method public hidebysig instance float32 Magnitude() cil managed{ .maxstack 4 .locals init ( [0] float32& pinned singleRef1, [1] float32 single1) L_0000: ldarg.0 L_0001: ldflda PerformanceTests.Unsafe.Vector3/e__FixedBuffer0 PerformanceTests.Unsafe.Vector3::V L_0006: ldflda float32 PerformanceTests.Unsafe.Vector3/e__FixedBuffer0::FixedElementField L_000b: stloc.0 L_000c: ldloc.0 L_000d: conv.i L_000e: ldind.r4 L_000f: ldloc.0 L_0010: conv.i L_0011: ldind.r4 L_0012: mul L_0013: ldloc.0 L_0014: conv.i L_0015: ldc.i4.4 L_0016: conv.i L_0017: add L_0018: ldind.r4 L_0019: ldloc.0 L_001a: conv.i L_001b: ldc.i4.4 L_001c: conv.i L_001d: add L_001e: ldind.r4 L_001f: mul L_0020: add L_0021: ldloc.0 L_0022: conv.i L_0023: ldc.i4.8 L_0024: conv.i L_0025: add L_0026: ldind.r4 L_0027: ldloc.0 L_0028: conv.i L_0029: ldc.i4.8 L_002a: conv.i L_002b: add L_002c: ldind.r4 L_002d: mul L_002e: add L_002f: conv.r8 L_0030: call float64 [mscorlib]System.Math::Sqrt(float64) L_0035: conv.r4 L_0036: stloc.1 L_0037: leave.s L_0039 L_0039: ldloc.1 L_003a: ret}
Note that neither of these two appear to be candidates for inlining, both being well over the 32 byte IL limit. The produced IL, while not directly indicative of the assembly produced by the JIT compiler, does tend to give an overall idea of how much larger we should expect this method to be when reproduced in machine code. Fixed length buffers have other issues that need addressing: You cannot access a fixed length buffer outside of a fixed statement. They are also an unsafe construct, and so you must indicate that the type is unsafe. Finally, they produce temporary types at compilation time that can throw off serialization and other reflection based mechanisms.
In the end, unsafe code does not increase performance, and the reliance upon platform structures to ensure safety, such as the fixed construct, introduces more problems than it solves. Furthermore, even the smallest method that might be inlined tends to bloat up to the point where inlining by the JIT is no longer possible.Surprising Developments and JIT Optimizations
Previously I noted that the JIT compiler can only inline a method that is a maximum of 32 bytes of IL in length. However, I wasn't completely honest with you. In some cases the JIT compiler will inline chunks of code that are longer than 32 bytes of IL. I have not dug in-depth into the reasons for this, nor when these conditions may arise. As such this information is presented as an informal experimental result. In the case of a function returning the result of an intrinsic operation, there may arise a condition whereby the result is inlined. Two examples of this behavior will be shown, note that in both cases the function used is an intrinsic math function and that neither are passed value types (which will prevent inlining). The first is the Magnitude function, which we saw above. Calling it results in it being inlined and produces the following inlined assembly.00220164 D945D4 fld dword ptr [ebp-2Ch]00220167 D8C8 fmul st,st(0)00220169 D945D8 fld dword ptr [ebp-28h]0022016C D8C8 fmul st,st(0)0022016E DEC1 faddp st(1),st00220170 D945DC fld dword ptr [ebp-24h]00220173 D8C8 fmul st,st(0)00220175 DEC1 faddp st(1),st00220177 DD5D9C fstp qword ptr [ebp-64h]0022017A DD459C fld qword ptr [ebp-64h]0022017D D9FA fsqrt
We note that this is the optimal form for the magnitude function, with a minimal number of memory reads, the majority of the work taking place on the FPU stack. Compared to the unsafe version, which is shown next, you can clearly see how much worse unsafe code is.007A0438 55 push ebp007A0439 8BEC mov ebp,esp007A043B 57 push edi007A043C 56 push esi007A043D 53 push ebx007A043E 83EC10 sub esp,10h007A0441 33C0 xor eax,eax007A0443 8945F0 mov dword ptr [ebp-10h],eax007A0446 894DF0 mov dword ptr [ebp-10h],ecx007A0449 D901 fld dword ptr [ecx]007A044B 8BF1 mov esi,ecx007A044D D80E fmul dword ptr [esi]007A044F 8BF9 mov edi,ecx007A0451 D94704 fld dword ptr [edi+4]007A0454 8BD1 mov edx,ecx007A0456 D84A04 fmul dword ptr [edx+4]007A0459 DEC1 faddp st(1),st007A045B 8BC1 mov eax,ecx007A045D D94008 fld dword ptr [eax+8]007A0460 8BD8 mov ebx,eax007A0462 D84B08 fmul dword ptr [ebx+8]007A0465 DEC1 faddp st(1),st007A0467 DD5DE4 fstp qword ptr [ebp-1Ch]007A046A DD45E4 fld qword ptr [ebp-1Ch]007A046D D9FA fsqrt007A046F D95DEC fstp dword ptr [ebp-14h]007A0472 D945EC fld dword ptr [ebp-14h]007A0475 8D65F4 lea esp,[ebp-0Ch]007A0478 5B pop ebx007A0479 5E pop esi007A047A 5F pop edi007A047B 5D pop ebp007A047C C3 ret
Next up is a fairly ubiquitous utility function which obtains the angle between two unit length vectors, note that acos is not directly producible as a machine instruction, none the less it is considered an intrinsic function. As we see below, this produces a nicely optimized set of instructions, with only a single call to a function (which computes acos).public static float AngleBetween(ref Vector3 lhs, ref Vector3 rhs) { return (float)Math.Acos(lhs.X * rhs.X + lhs.Y * rhs.Y + lhs.Z * rhs.Z);}.method public hidebysig static float32 AngleBetween(PerformanceTests.Vector3& lhs, PerformanceTests.Vector3& rhs) cil managed{ .maxstack 8 L_0000: ldarg.0 L_0001: ldfld float32 PerformanceTests.Vector3::X L_0006: ldarg.1 L_0007: ldfld float32 PerformanceTests.Vector3::X L_000c: mul L_000d: ldarg.0 L_000e: ldfld float32 PerformanceTests.Vector3::Y L_0013: ldarg.1 L_0014: ldfld float32 PerformanceTests.Vector3::Y L_0019: mul L_001a: add L_001b: ldarg.0 L_001c: ldfld float32 PerformanceTests.Vector3::Z L_0021: ldarg.1 L_0022: ldfld float32 PerformanceTests.Vector3::Z L_0027: mul L_0028: add L_0029: conv.r8 L_002a: call float64 [mscorlib]System.Math::Acos(float64) L_002f: conv.r4 L_0030: ret}007A01D9 8D55D4 lea edx,[ebp-2Ch] 007A01DC 8D4DC8 lea ecx,[ebp-38h] 007A01DF D902 fld dword ptr [edx] 007A01E1 D809 fmul dword ptr [ecx] 007A01E3 D94204 fld dword ptr [edx+4] 007A01E6 D84904 fmul dword ptr [ecx+4] 007A01E9 DEC1 faddp st(1),st 007A01EB D94208 fld dword ptr [edx+8] 007A01EE D84908 fmul dword ptr [ecx+8] 007A01F1 DEC1 faddp st(1),st 007A01F3 83EC08 sub esp,8 007A01F6 DD1C24 fstp qword ptr [esp] 007A01F9 E868A5AF79 call 7A29A766 (System.Math.Acos(Double), mdToken: 06000b28)
Finally there is the issue of SIMD instruction sets. While the JIT will not use SIMD instructions on the x86 platform, it will utilize them for other operations. One common operation you see is the conversion of floating point numbers to integers. In .NET 2.0 the JIT will optimize this to use the SSE2 instruction. For instance, the following snippet of code will result in the assembly dump following.int n = (int)r.NextDouble();002A02FB 8BCB mov ecx,ebx 002A02FD 8B01 mov eax,dword ptr [ecx] 002A02FF FF5048 call dword ptr [eax+48h] 002A0302 DD5DA0 fstp qword ptr [ebp-60h] 002A0305 F20F1045A0 movsd xmm0,mmword ptr [ebp-60h] 002A030A F20F2CF0 cvttsd2si esi,xmm0
While not quite as optimal as it could be if the JIT were using the full SSE2 instruction set, this minor optimization can go a long way.
So what is left to visit? Well, there's obviously the x64 platform, which is growing in popularity. The x64 platform presents new opportunities to explore, including certain guarantees and performance benefits that aren't available on the x86 platform. Amongst them are a whole new set of optimizations and available instruction sets that the JIT can take advantage of. Finally there is the case of calling to unmanaged code for highly performance intensive operations. Hand optimized SIMD code and the potential performance benefits or hazards calling to an unmanaged function can incur.

Source
0 likes 0 comments

Comments

Nobody has left a comment. You can be the first!
You must log in to join the conversation.
Don't have a GameDev.net account? Sign up!
Advertisement