Wednesday, January 9, 2019

Variable Length Arrays (VLA) in C: what are they, really?

For reasons that are not entirely clear to me, every time the topic of C99 VLA pops up in a discussion, people start talking predominantly about the possibility of declaring run-time-sized arrays as local objects (i.e. creating them "on the stack"). This is rather surprising and misleading, since this facet of VLA functionality - support for local array declarations - happens to be a rather auxiliary, secondary capability provided by VLA. It does not really play any significant role in VLA functionality. Most of the time, the matter of local VLA declarations and their accompanying potential pitfalls is forced into the foreground by VLA critics, who use it as a "straw man" intended to derail the discussion and bog it down among barely relevant details.

The essence of VLA support in C is, first and foremost, a revolutionary qualitative extension of the language's concept of type. It involves the introduction of such fundamentally new kind of types as variably modified types. Virtually every important implementation detail associated with VLA is actually attached to its type, not to the VLA object per se. It is the very introduction of variably modified types into the language that makes up the bulk of the proverbial VLA cake, while the ability to declare objects of such types in local memory is nothing more than a insignificant and fairly inconsequential icing on that cake.

Consider this: every time one declares something like this in the code

/* Block scope */
int n = 10;
...
typedef int A[n];
size-related characteristics of the variably modified type A (e.g. the value of n) are finalized at the exact moment when the control passes over the above typedef-declaration. Any changes in the value of n made further down the line, below this declaration of A will not affect the size of A. Stop for a second and think about what it means. It means that the implementation is supposed to associate with A a hidden internal variable, which will store the size of the array type. This hidden internal variable is initialized from n when the control passes over the declaration of A.

This gives the above typedef-declaration a rather interesting and unusual property: it generates executable code (!). Moreover, it doesn't just generate executable code, it generates critically important executable code. For this reason the language imposes some unusual restrictions on such variably modified declarations: the language prohibits passing control into their scope from outside of their scope

/* Block scope */
int n = 10;
goto skip; /* Error: invalid goto */

typedef int A[n];
skip:;
Note once again that the above code does not define any VLA arrays. It simply declares an innocent alias for a variably modified type. Yet, it is illegal to jump over such typedef-declaration.

When one declares an actual VLA object, in addition to allocating the actual array memory the compiler also creates one or more hidden internal variables, which hold the size(s) of the array in question. The array itself is implemented under the hood as a regular pointer, which points to memory allocated through alloca-like mechanism

/* Block scope */
int n = 10, m = 20;
typedef int A[n][m];
A a;
a[5][6] = 42;

/* The above is translated into something like */

int n = 10, m = 20;
size_t _internal_A1 = n, _internal_A2 = m;
int *a = alloca(_internal_A1 * _internal_A2 * sizeof(int)); 
a[5 * _internal_A2 + 6] = 42;
(Of course, in addition to that the compiler will generate code to deallocate a at the end of the block).

But even in this example one has to understand that these hidden variables are associated not with the array itself, but rather with its variably modified type. Multiple VLA-array definitions and/or variably modified type declarations with identical run-time parameters might share the same hidden variables to store their sizes.

One important and remarkable consequence of this approach is as follows: the additional information about array size, associated with a VLA, is not built into the object representation of the VLA. It is actually stored besides the array, as "sidecar" data. This means that object representation of a (possibly multidimensional) VLA is fully compatible with object representation of an ordinary classic compile-time-sized array of the same dimensionality and the same sizes. For example

void foo(unsigned n, unsigned m, unsigned k, int a[n][m][k]) {}
void bar(int a[5][5][5]) {}

int main(void)
{
  unsigned n = 5;
  int vla_a[n][n][n];
  bar(a);

  int classic_a[5][6][7];
  foo(5, 6, 7, classic_a); 
}
Both function calls in the above code are perfectly valid and their behavior is fully defined by the language, despite the fact that we pass a VLA where a "classic" array is expected, and vice versa. Granted, the compiler cannot control the type compatibility in such calls (since at least one of the involved types is run-time-sized). However, if desired, the compiler (or the user) has everything necessary to perform the run-time check in debug version of code.

(Note: As usual, parameters of array type are always implicitly adjusted into parameters of pointer type. This applies to VLA parameter declarations exactly as it applies to "classic" array parameter declarations. This means that in the above example parameter a actually has type int (*)[m][k]. This type is unaffected by the value of n. I intentionally added a few extra dimensions to the array to maintain its dependence on run-time values.)

Compatibility between VLA and "classic" arrays as function parameters is also supported by the fact that the compiler does not have to accompany a variably modified parameter with any additional hidden information about its size. Instead, the language syntax forces the user to pass this extra information in the open. In the above example the user was forced to first include parameters n, m and k into function parameter list. Without declaring n, m and k first, the user would not have been able to declare a (see also the above note about n). These parameters, explicitly passed into the function by the user, will bring over the information about the actual sizes of a.

Declaration of VLA arrays may not include initializers, which also makes it impossible to use VLA in compound literals

int n = 10;
int a[n] = { 0 }; /* ERROR: no initializers allowed */
The rationale behind for this limitation, if I remember correctly, is that there's no good answer to the question of what to do with potential excessive initializers.

For another example, by taking advantage of VLA support we can write the following code

#include <stdio.h>
#include <stdlib.h>

void init(unsigned n, unsigned m, int a[n][m])
{
  for (unsigned i = 0; i < n; ++i)
    for (unsigned j = 0; j < m; ++j)
      a[i][j] = rand() % 100;
}

void display(unsigned n, unsigned m, int a[n][m])
{
  for (unsigned i = 0; i < n; ++i)
    for (unsigned j = 0; j < m; ++j)
      printf("%2d%s", a[i][j], j + 1 < m ? " " : "\n");
  printf("\n");
}

int main(void) 
{
  int a1[5][5] = { 42 }; 
  display(5, 5, a1);
  init(5, 5, a1);
  display(5, 5, a1);

  unsigned n = rand() % 10 + 5, m = rand() % 10 + 5;
  int (*a2)[n][m] = malloc(sizeof *a2);
  init(n, m, *a2);
  display(n, m, *a2);
  free(a2);
}
This code is intended to draw your attention to the following fact: this code makes heavy use of valuable properties of variably modified types. It is impossible to implement elegantly without VLA. Yet at the same time, not even a single VLA is created in local memory in the above program, meaning that this popular vector of VLA criticism is not applicable to this code at all.

Sometimes variably modified types can be used to build some rather confusing constructs of questionable practical value and non-obvious behavior. For example, the following function declarations are allowed

/* File scope */
int n = 100;
void foo(int a[n++]) {}

void bar(int m, int a[++n][++m]) {}

int hello(void) { return printf("Hello World\n"); }
void baz(int a[hello()]) {}
Expressions contained inside VLA declarations in the above parameter list declarations will be evaluated (together with all their side-effects) every time the corresponding function is called. Note that even though array parameters will be transformed into pointer parameters, all expressions present in the original array declaration will still be evaluated. E.g. every call to function baz will be accompanied with string "Hello World\n" being sent to standard output.