July 24, 2016

C++/CLI type system as seen from .NET and CIL

For some time now, I have delved, out of necessity, into C++ and all its quirks and secrets. One thing that has always fascinated me in C++ is its marvelously complex type system. In C#, you can mostly just work with arrays, pointers are possible but not recommended, and also generics. C++ has loads of declarators – arrays with fixed size, pointers, references, and this all can be combined with anything else and also qualifiers – mostly const and volatile. These two keywords also exist in C#, but they have vastly different meaning.

Now you may be surprised or not, but all these constructions are actually accurately representable in Common Intermediate Language (CIL for short), the language that separates all .NET languages from native code. Though C# doesn't show the full potential of its type system (however, the newest drafts have quite appealing suggestions, like ref returns), in CIL, the type system is as rich as C++'s, maybe even richer.

C++/CLI is a beautiful and well-designed combination of C++ and .NET, and shall serve as an example of how its types are translated into CIL. If you are not familiar with C++ at all, don't worry, I'll be easy on you. The following descriptions will be mainly from the C# perspective; all in all, that's from whence I come.

Let's start with an overview of what types we'll have to consider:

Managed classes

.NET classes are probably what you all already know. In C#, they are defined with "class Name". They are reference types, meaning their actual data is always stored on the heap, and is only refered to via an opaque value (called reference, internally a pointer to an object). C++/CLI defines them with "ref class Name" or "ref struct Name". Be aware that class and struct in C++ have totally different meaning than in C#. Class has all its members private by default, whilst struct has all members public by default. There's no other considerable difference between the two keywords in C++.

Managed structs

.NET value types can be located on the stack, and are usually used for simple values or aggregates of values. C# defines them with "struct Name". In C++/CLI, it's again either "value class Name" or "value struct Name".

Native types

User-defined native type is created in C++/CLI via "struct Name" or "class Name", again the difference being only in the default visibility of members. All native types are translated as value types in C++/CLI, because they have value semantics. Compiled native types have only their size specified in the CIL assembly, and all their methods are located in the global module type.

In the current version of C++/CLI, "mixed" types aren't supported. You cannot have a managed type containing a native type member, or vice versa. However, managed methods can take, return, and have locals of native types. The problem in mixed types are so-called "interior pointers". They point to the inside of an object, and objects on managed heap are frequently moved around the heap by the GC (garbage collector). In this case, a function expecting a pointer to a value (possibly as this) would result in the location being moved in the middle of the function. There are plans to unify the type system, but nothing so far released.

Now as I have listed the basic types, let's look at how the compiler translates some more complicated ones.

No declarators

A native type without any declarators:

static std::exception TestFunc()

translates into the following line in CIL (got via ildasm):

.method private hidebysig static valuetype std.exception* modreq([mscorlib]System.Runtime.CompilerServices.IsUdtReturn)
        TestFunc(valuetype std.exception* A_0) cil managed


For those yet unfamiliar with CIL, .method starts a method definition, private is the same as in C# and hidebysig specifies hiding members of base class. After "static" comes the return type:

valuetype std.exception* modreq([mscorlib]System.Runtime.CompilerServices.IsUdtReturn)

Yes, that's the whole return type. See what I meant with rich type system? Anyway, valuetype std.exception is a type reference to the user-defined class (as I said, all native types are value types), * is a good old pointer to it, and now comes the fun. A "feature" completely missing in C# (at the moment, it might be in C# 7 or future) are optional and required modifiers. You see, a custom attribute is a piece of metadata represented with a specific type, attached to a method, field, parameter etc. A modifier is a type attached to another type. That's the most important aspect of modifiers. An optional modifier (modopt) can, by definion, be ignored safely by any compiler that doesn't understand it without any risks. On the other hand, a required modifier (modreq) has to be understood by the compiler in order to use the type correctly. I assume that the C# compiler isn't aware of the modifier, and therefore shouldn't allow for the access to this method.

MSDN explains the modifying type IsUdtReturn quite well:
The IsUdtReturn modifier is used by the C++ compiler to mark return types of methods that have native C++ object return semantics. The managed debugger recognizes this modifier to correctly determine that the native calling convention is in use.

Another quite interesting thing about this method is the single parameter. I am not much into the internals of C++, but it seems from the method body that the actual type is initialized or copied right into the variable accessed by the pointer. It is not surprising much, though, given that you can even overload the assignment operator in C++, and it is also possibly faster this way.

For comparison, let's see how a method looks if it returns a managed struct:

.method private hidebysig static valuetype [mscorlib]System.DateTime
        TestFunc() cil managed


As you can see, there is no difference between C# and C++/CLI output in this case, which is actually no surprise given it has to be compatible.

I used struct, but what about a regular class? C++/CLI can actually support value type semantics for reference types, so how would it look with a managed reference type?

.method private hidebysig static void modreq([mscorlib]System.Runtime.CompilerServices.IsUdtReturn)
        TestFunc(class [mscorlib]System.Exception& A_0) cil managed


It looks quite similar to the first method but this time the return type is void. It seems logical, because a pointer to a managed reference type is not valid. The parameter is also analogous to the first method, but it's a reference (ref in C#, & in CIL) this time (again because a pointer would be invalid).

As you can see, using bare managed reference type as a return type is not a good example, so we'll try just parameters further on (there is little to no difference in case of the other types with declarators).

class [mscorlib]System.Exception modreq([mscorlib]System.Runtime.CompilerServices.IsByValue)

There is simply no way a reference type instance could be located on a stack in .NET, and the value behaviour is just simulated in C++/CLI. CLR still sees just a reference being passed to the method.

MSDN on IsByValue:
The IsByValue class is used by the Microsoft C++ compiler to denote method parameters and return values whose semantics follow the C++ rules for objects passed by value

Pointers

There is nothing special to the translation of pointers:

valuetype std.exception*

valuetype [mscorlib]System.DateTime*


Both native and managed value types are translated without any special modifiers, being compatible with C#.

References

A reference in C++ is just a pointer under the hood, but it doesn't have to be dereferenced (the * operator) in order to access its value, because it is implicitly dereferenced.

valuetype std.exception* modopt([mscorlib]System.Runtime.CompilerServices.IsImplicitlyDereferenced)

The same for managed value types. As you can see, it's modopt and not modreq this time, so C# should be able to use this method. There is no reason for it not to be able to.

MSDN on IsImplicitlyDereferenced will be quoted further below, because it doesn't describe the current usage context.

As reference is just a pointer, it also cannot be used for managed reference types.

Handles

A handle is what C# guys would call a reference to an instance. In C++/CLI, the syntax is System::Object^ and it can be used only for managed types.

object

That's what ildasm displays. It could also be [mscorlib]System.Object, but this one is shorter in the assembly code. As you can see, a handle to a managed reference type is simply the reference type itself in both CIL and C#. But what about a value type?

class [mscorlib]System.ValueType modopt([mscorlib]System.DateTime) modopt([mscorlib]System.Runtime.CompilerServices.IsBoxed)

A "handle" to a value type is simply a boxed value. CIL cannot represent this directly, but since all value types inherit from ValueType, it is the best type for it. It is also modified with the actual expected type (not enforced by CLR or C# if you pass a different type, though) and IsBoxed.


MSDN on IsBoxed:
Indicates that the modified reference type is a boxed value type. This class cannot be inherited.
The Microsoft C++ compiler supports boxed value types directly in the language. Information about boxed value types is emitted into metadata as a custom modifier, where the modifier decorates a reference to the value type being boxed.
There cannot be a handle to anything else. A handle to a pointer would be somewhat logical and possible, but only if pointers were able to be boxed, which they aren't (they are boxed to IntPtr, actually).

Tracking references

A tracking reference is a reference which the GC knows about, and moves its address accordingly. As you probably see now, it's a plain old ref in C#, albeit with a fancier name. The syntax in C++/CLI is System::DateTime%. They can be used with any other type, except with another reference (ECMA-335 for CLI explicitly states that a managed pointer [= ref] can point to unmanaged data without any problems, and the GC won't move it).

Native types and value types are translated the same:

valuetype [mscorlib]System.DateTime&

It is completely identical to ref DateTime in C#. It gets interesting for bare reference types (System::Object%), though:

object modreq([mscorlib]System.Runtime.CompilerServices.IsImplicitlyDereferenced)

The parameter cannot be a reference, because that would be setting the instance reference, and not its data.

Now it's good time to quote the MSDN:
The C++ compiler uses the IsImplicitlyDereferenced modifier to distinguish reference classes that are passed by managed reference from those passed by managed pointer. The IsImplicitlyDereferenced class and its partner, the IsExplicitlyDereferenced class, disambiguate reference parameters from pointer parameters.
The description is a bit cryptic, but it shows the point. "managed reference" is a tracking reference, and "managed pointer" is a handle.

Arrays

In C++ an array is usually specified with a fixed size, but an "array of unknown bound" is also possible. If used as a field, it occupies the space of the object (like unsafe fixed arrays in C#) like a value type. If used as a parameter, it decays into a const pointer. As a local variable, it allocates space on the stack and also decays to a pointer.

static void TestFunc(int (&arr)[10])

There is no difference in the values being passed around, but this time, the array type is preserved in the method signature... mostly:

valuetype '<CppImplementationDetails>'.$ArrayType$$$BY09H* modopt([mscorlib]System.Runtime.CompilerServices.IsImplicitlyDereferenced)

$ArrayType$$$BY09H is actually quite similar to mangled names for C++ functions. You can read the details here, but Y is an array prefix, 0 means 1 dimension, 9 is the size of the array (10 in fact, but encoded as 9, see the documentation), and H is signed int. typeid(int(&)[10]).raw_name() returns .$$BY09H. IsImplicitlyDereferenced is used again to distinguish a reference from a pointer.

The actual array type is just a value type with a custom size:

.class private sequential ansi sealed beforefieldinit '<CppImplementationDetails>'.$ArrayType$$$BY09H
       extends [mscorlib]System.ValueType
{
  .pack 0
  .size 40
  .custom instance void [Microsoft.VisualC]Microsoft.VisualC.DebugInfoInPDBAttribute::.ctor() = ( 01 00 00 00 )
  .custom instance void [mscorlib]System.Runtime.CompilerServices.UnsafeValueTypeAttribute::.ctor() = ( 01 00 00 00 )
  .custom instance void [mscorlib]System.Runtime.CompilerServices.NativeCppClassAttribute::.ctor() = ( 01 00 00 00 )
  .custom instance void [Microsoft.VisualC]Microsoft.VisualC.MiscellaneousBitsAttribute::.ctor(int32) = ( 01 00 41 00 00 00 00 00 )
}


The size is sizeof(int)*10, i.e. 40. I haven't been able to find what the last bits mean.

Managed arrays in C++/CLI have a nice little template-like syntax:

array<type, dimensions>^

In principle, .NET arrays can also have additionally specified index for their dimensions. No .NET language know to me (except CIL) can take advantage of this.

Const and volatile qualifiers

modopt([mscorlib]System.Runtime.CompilerServices.IsConst)
modopt([mscorlib]System.Runtime.CompilerServices.IsVolatile)

Not a much surprise, after all those modifiers we have seen. The modifier follows the type it modifies, like in C++ (if it's not directly on the left).

Function pointers

Another lesser known feature of CIL, especially because of their poor handling in .NET reflection, are function pointers. Normally, you would use a delegate if you wanted to capture a method into an object. However, there is something more raw than that. CIL allows you to specify a pointer to any method with any parameters you want. Let's look at how void(*)() (a pointer to a method with no parameters returning nothing) is translated:

method unmanaged cdecl void modopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl) *()

method is the keyword here – it begins a function type signature, followed by the method's attributes, return type modified by CallConvCdecl, *, and its parameter list.

Primitive types

Long (int) has in this implementation the same size as regular int, but the difference is also stored in the signature:

int32 modopt([mscorlib]System.Runtime.CompilerServices.IsLong)

MSDN on IsLong:
The C++ standard indicates that a long value and an integer value are distinct types. However, they are both represented using ELEMENT_TYPE_I4 in an assembly. To distinguish a long from an integer in C++, the Microsoft C++ compiler adds the IsLong modifier to any instance of a long when the instance is emited. This process is critically important for maintaining language-level type safety.
What the documentation forgets is that also double can be long.

float64 modopt([mscorlib]System.Runtime.CompilerServices.IsLong)

It should be noted that all float types in .NET actually use the CPU's long floating point type for computations (called F in CLI, but this is a topic for another article).

Char can be signed, unsigned, or neither. The first is sbyte in C#, the second byte, and without a sign specifier:

int8 modopt([mscorlib]System.Runtime.CompilerServices.IsSignUnspecifiedByte)

MSDN on IsSignUnspecifiedByte:
Some programming languages, such as C++, recognize three distinct char values: signed char, unsigned char, and char. To distinguish the unmodified char type from the others, the Microsoft C++ compiler adds the IsSignUnspecifiedByte modifier to each char type emitted to an assembly.

As you can see, there are many CIL features unknown to C#, but used heavily in C++/CLI. My best advice if you want to find more, is to compile some C++/CLI code yourselves, and browse the assembly with ildasm or ILSpy. Have fun!

No comments:

Post a Comment