Skip to main content

Unsigned Types

Most modern languages have stopped offering the use of unsigned types. However, in C and C++ we do have the choice. Why would you use unsigned types?

One benefit of using an unsigned type for variables whose domain does not include negative numbers is that we get to use that extra bit. So if we represent the number of non-zero entries in a sparse matrix as uint32_t instead of int32_t, we can have four billion non-zeros instead of just two billion. This use to be a pretty strong argument when most workstations/compute nodes could only operate on matrices with less than four billion non-zeros (~32GB).

Range

But now that high end machines have more than 128GB of memory, using a uint32_t for the number of non-zeros (or non-zero indexing), we need to use 64-bit integers for big matrices. The real benefit of using an unsigned type to refer to the number of non-zeros or indices, is in range checking.

We assume you're a good programmer and thus prefix array accesses with assertions. So when using a signed integer i:

  assert(i >= 0);
  assert(i < arrayLen);
  x = array[i];

Now, since we should never have a negative index, using an unsigned type seems reasonable. Furthermore, we only need to check the upper bound.

  assert(i < arrayLen);
  x = array[i];

Now, imagine we get lazy and we do not check the bounds.

If i was a signed integer and were to take on the value of -1, we would simply access the memory location one spot before the array begins. For most programs the location just before an allocated chunk of memory is accessible (either because its part of a previous heap or stack allocation, or because its used for bookkeeping by the memory allocator). However, if i were an unsigned integer and underflowed, instead of taking on the value of -1, it would have the value of 4,294,967,296. For most programs array[4294967296] will refer to a memory location that the program does not have access to. And, even though this is undefined behavior for C and C++, it is not undefined behavior on most operating systems when not run by a privilidged user. In this case it will reliably generate a segmentation fault or bus error, which is much easier to debug.

Traps

A common pitfall (and subsequently common complaint), is when using unsigned variables in decrementing loops.

uint32_t i;
for (i = arrayLen-1; i >= 0; --i) {
  ...
}

Here, the loop never exits, as instead of going to -1, it goes to 4,294,967,296, and continues decrementing forever. However, the solution for this is easy.

uint32_t i;
for (i = arrayLen; i > 0;) {
  --i;
  ...
}

While having the decrement operation outside of the for statement may seem a little unnatural, it makes the first two clauses more in line with an incrementing for loop. We initialize i to be equal to the length of the array, and we check against zero with > rather than >=.

The real issue to watch for when using unsigned types, is arithmetic. Both C and C++ use type promotion, and signed types get promoted to unsigned types. For example,

uint32_t x = 10;
int32_t y = 12;

if (x - y < 0) {
  ...
}

results in the if statement evaluating false. This can be easy to miss, especially if x and y are declared many lines earlier. Perhaps even worse though, is when promotion occurs during comparison.

uint32_t x = 10;
int32_t y = -2;

if (x > y) {
  ...
}

Here, y gets promoted to 4,294,967,295, and thus is considered greater than x.

So, how do you prevent these two issues? The easiest way is through compiler warnings (think -Wconversion and -Wsign-compare for gcc). The better, but harder way, is through semantics.

The Metaphyics of Quantity

The way to use semantics to protect against underflowing unsigned numbers, is to differentiate between quantities and deltas. A prime example of this are the types size_t and ptrdifft_t. When expressing the location of an item in an array, or the number of items in an array, size_t should be used. When expressing the offset of an item from a given location, or the distance between two items in an array, a ptrdiff_t should be used. Most programming languages require these semantics when working with times and dates (e.g., <ctime>'s time() returns a time_t, but difftime() must be used to compare times which returns a double or Python's datetime and timedelta).

The benefit here, is a variable's type is now being used to communicate what is happening in a program. Going back to the original topic of indexing sparse matrices, how do we avoid underflowing our index type? Because an index is really just an offset memory location, a better question might be:

Does it ever make sense to talk about a 'negative' memory location?

It doesn't, so why should your program calculate one?