Fixing C arrays, four decades later

One of the greatest challenges in teaching C is the idiosyncratic rules that govern string and array variables. While teachers can delay this hurdle until the second or third week by focusing on programs that use printf with static strings, it can’t be put off much further. Any C program that takes command line arguments needs to grapple with how arrays, pointers, and strings relate. The first Google result for “C main” introduces main functions with a signature that highlights the pointer/array duality:

int main( int argc, const char* argv[] );

Explaining how to work with the elements of this signature—how to manipulate the string char* argv, the array char* argv[], or the equivalents char** argv and char argv[][]—requires teaching not only the runtime semantics of pointer access, but also the particularities of C array variables and the rules that govern their conversion to pointers, an entirely orthogonal concept which is slippery in its own right. When I finally came across an explanation of the rationale for C’s array semantics in a retrospective by Dennis Ritchie, I found that while C’s approach is tied up with the legacy of long-ago PDP-11 code, it also addresses an inherently tricky language design problem that has taken decades to unravel.¹

When Ritchie developed the pre-C variant NB, he allocated two separate areas in storage for an array declaration like char carray[10]: first, the backing array of ten char’s, and second a pointer to them, which the variable carray refers to. The problem with this scheme, it turned out, was that it couldn’t readily be applied to declaring the fields of structured types:

For example, the directory entries of early Unix systems might be described in C as
struct {
	int	inumber;
	char	name[14];
};
I wanted the structure not merely to characterize an abstract object but also to describe a collection of bits that might be read from a directory. Where could the compiler hide the pointer to name that the semantics demanded?

Constrained by the need to mostly preserve the semantics of existing programs, the solution Ritchie hit on was to manifest the pointer at the point of use: only one storage area, for the array content, would be allocated, but an array-type variable in an expression would be treated as a pointer to the first value in the array.

This rule leaves arrays variables, in some ways, as awkward second-class citizens. In proto-C, given arrays defined as char a[10], b[10] it was possible to write a = b to point a to the array defined for b. In the world after Ritchie’s retrofit, this is problematic: according to the rules, b should be treated as a pointer, but if a were treated the same way there would be no coherent way to define assignment to a struct member; in the expression s.a = b, there is no pointer in the struct to act as the target of the assignment. This restriction manifests in the C89 specification as a rather inconspicuous clause restricting the applicability of array variable expressions:

Except when it is the operand of the sizeof operator or the unary & operator, or is a character string literal used to initialize an array of character type … an lvalue that has type “array of type” is converted to an expression that has type “pointer to type” that points to the initial member of the array object and is not an lvalue.

This excerpt also highlights the other set of restrictions that came with Ritchie’s rule. For the sizeof operator, the array is not treated as a pointer, since that would leave no way to determine the size of the underlying storage. That means that when switching a variable from a statically sized array to an unsized pointer, the meaning of the sizeof operator changes, a trap that makes it easy to introduce bugs when refactoring. C compounds the problem by introducing the empty array notation, as in argv[][], as syntactic sugar for a pointer **argv, an affordance that Ritchie ultimately concluded “serves as much to confuse the learner as to alert the reader.”

It might seem that, like many of C’s quirks, the array semantics are just an unfortunate legacy. But I think Ritchie was right to insist that, relatively speaking, C’s semantics ultimately provide “a uniform and simple mechanism” for array manipulation. Inspection of how some more recent languages handle arrays reveals a fundamental impedance between modeling arrays as scalar pointers and as sized regions of memory.²

Go follows C in allowing for arrays to be declared on the stack or inline within a structure, without extra pointer indirection. Unlike C, it distinguishes clearly between array-typed variables and pointers to arrays, defining array assignment as a value copy which works for both variables and struct members (see below).

However, Go’s uniform treatment of arrays breaks down when moving from statically sized arrays to dynamically sized slices. While they borrow the C-style empty array syntax, Go treats slices as a fundamentally different entity. Assigning the value of one slice to another happens by reference. As with C’s sizeof operator, this is a common vector for bugs: a simple refactoring to remove a constant bound, if not done carefully, can completely change a program’s semantics.

package main

import "fmt"

type aggregate struct{ i [4]int }

func main() {
	var a1, a2 aggregate
	var t1, t2 [4]int
	var s1, s2 []int

	// structs: copy by value
	a1.i[0] = 5
	a2.i = a1.i
	a1.i[0] = 10

	fmt.Printf("%v; %v\n", a1, a2)
	// ↪ {[10 0 0 0]}; {[5 0 0 0]}

	// arrays: copy by value
	t1[0] = 5
	t2 = t1
	t1[0] = 10

	fmt.Printf("%v; %v\n", t1, t2)
	// ↪ [10 0 0 0]; [5 0 0 0]

	// slices: copy by reference
	s1 = t1[:]

	s1[0] = 15
	s2 = s1
	s1[0] = 20

	fmt.Printf("%v; %v\n", s1, s2)
	// ↪ [20 0 0 0]; [20 0 0 0]
}

There is an alternative, which, to my knowledge, is best exemplified by Rust. Like Go, Rust allows fixed-sized arrays to be stack allocated or directly embedded in an aggregate type. Fundamentally, it treats dynamically sized arrays—including subarrays of a fixed-size array—as ordinary values, with no automatic pointerization. It accomplishes this by restricting how those variables are actually used, forbidding values of types that do not have the Sized trait from being allocated on the stack. The result is that the alias-instead-of-copy behavior of slices ends up being explicit, and therefore also benefits directly from Rust’s checks against modification of aliased data:

#[derive(Default)]
struct Aggregate {
    i: [u8; 4],
}

fn main() {
    // structs: copy by value
    let mut a1: Aggregate = Default::default();
    let mut a2: Aggregate = Default::default();

    a1.i[0] = 5;
    a2.i = a1.i;
    a1.i[0] = 10;

    println!("{:?}; {:?}", a1.i, a2.i);
    // ↪ [10, 0, 0, 0]; [5, 0, 0, 0]

    // arrays: copy by value
    let mut t1: [u8; 4] = Default::default();
    let t2: [u8; 4];

    t1[0] = 5;
    t2 = t1;
    t1[0] = 10;

    println!("{:?}; {:?}", t1, t2);
    // ↪ [10, 0, 0, 0]; [5, 0, 0, 0]

    // slices: borrowed reference
    let mut s1: &mut[u8] = &mut a1.i;
    let s2: &mut[u8];

    s1[0] = 15;
    s2 = s1;
//  ------- borrow of `*s1` occurs here
    s1[0] = 20;
//  cannot use `*s1` because it was mutably borrowed

    println!("{:?}, {:?}", s1, s2);
}

Rust’s approach does come at the expense of an expansion of the language space; beside the restrictions on unsized type usage, both the the type system and the runtime representation of pointers had to be extended to accommodate these types, the former to handle values with attributes (array size or concrete type) left indeterminate and the latter to allow fat pointers to expose the missing information at runtime. However, it creates semantics that effectively generalize fixed size, dynamically sized, and embedded arrays, with the special-casing visible to the user mostly restricted to the hard constraints of the computer architecture.

After some reflection, I believe Rust’s approach is about the most general possible in a language that exposes array allocation on the stack. Ritchie sacrificed clarity, as well as generality of the assignment operator, for a backwards-compatible and uniform treatment of variable and struct fields. Go sacrifices a uniform treatment of arrays and slices in order to maintain primitives that can be efficiently manipulated. Rust restricts only the use, but not the expression, of patterns that can’t be easily accommodated by conventional stack allocation.

As an academic exercise, it is possible to imagine a fully general array semantics in a language, like Go, that does not have a strict stack/heap distinction. In Go, locally declared variables are already heap-allocated if they could be used by reference outside of the local stack frame. It’s possible to contemplate leveraging this to implement copy-based semantics for dynamically sized slices, with non-pointer assignments copying to the heap and users taking the address of a slice explicitly to assign by reference. The ABI already supports passing pointers for nominally non-pointer values in order to support closures; this could be extended to accommodate heap pointers and concrete sizes to return dynamically sized slices. However, given the impracticality of a language that prioritizes the (potentially much more) expensive access pattern by default, I think the design space Ritchie wrestled with has reached a stable point.

This is not to make any absolute claims to priority. Ritchie notes that languages contemporary to C, like Pascal and Algol 68, also had problematic array facilities, and no mainstream language in the interim that I am aware of has made the kind of design choices I describe here—not Ada or C++ or D, not reference-centric languages like Java and most scripting languages, and not the mostly immutable functional languages and Lisps. I’d be interested to hear about earlier precedents. ↩
For the purpose of this post, I’m going to ignore stack-allocated variable-sized arrays, although Ritchie alludes to them briefly and Rust is growing partial support to match C’s. Suffice it to say that because variably sized allocations can overflow the stack—with their safety depending on non-local factors like function parameters and the use of temporaries—they should be treated as a much more niche tool than array allocation on the heap or static stack allocation. I also consider the optimization complexity that arises from C’s propensity for pointer aliasing, another concern Ritchie highlights, to be more of a problem with pointers than with array semantics. ↩