Skip to main content

How to talk about C

According to the StackOverflow 2017 Survey, nearly 20% of programmers use C. There are many good C programmers whose knowledge of technical details and terminology is well below their ability to write good code in C. This can lead to people under estimating the skill levels of these programmers, and in some cases can lead to very unpleasant bugs. So below is a basic set of things to know about C to sound like an expert.

The anatomy of a C program

Compilation units: The source files of your program once the pre-processor has inserted all header files referenced by #include. We talk about what code is in a given compilation unit to reference the visibility of static variables and functions, as well as what inlining optimizations can be performed by the compiler.

Header files: The .h files of your program. Header files will contain things like function prototypes (i.e., function declarations), globally accessible enum and struct definitions, as well as forward declarations of structs. It is important to avoid any unused includes in your header files, because anything included by a header file, is included by anything that includes the header file.

/* make sure we don't have multiple definitions in a compilation unit */
#ifndef MY_FILE_H
#define MY_FILE_H

/* forward declared struct -- can only be referenced/passed by pointer */
struct heap;

/* function prototypes */
struct heap * create_heap(
    int size);
#endif

Header files should always be wrapped in include guards, the #ifndef and #define statements. This prevents things from being declared and/or defined multiple times within the same compilation unit. For example, if you include two header files that both make use of size_t, then both of these somewhere in their include tree's contain <stddef.h>. The include guards prevent multiple definitions of size_t.

Header files should not be compiled directly. You will see header files included in source files with either #include <my_file.h> or #include "my_file.h". While generally the <> notation is used for system and external library headers (e.g., #include <stddef.h> and #include <omp.h>) and the "" notation is used local headers (e.g., #include "my_file.h"), there is no difference between the two. The pre-processor searches the same set of directories whether you use the <> or "" notation.

Source files: The .c files of your program. This is where the bulk of the logic of your program should be.

#include "my_file.h"
#include <stdlib.h>

/* struct definition */
struct heap {
  int size;
  int maxsize; 
  int * data;
};

struct heap * create_heap(
    int const size)
{
  struct heap * h;
  h = malloc(sizeof(*h)); 
  h->size = 0;
  h->maxsize = size;
  h->data = malloc(sizeof(*(h->data))*size);
  return h;
}

A source file is what is given directly to the compiler:

gcc -c my_file.c

which when pre-processed pulls in the header files.

The phases of binary construction

  1. Pre-processing:

    (Can be achieved with gcc -E)

    The C Pre-processor turns all of the lines of code prefixed by # into regular C code. This includes macros like #include and #define, expressions like #if and #ifndef, and any #pragma statements.

    Keep in mind that the replacement of #include with the contents of the file differs from the replacement of #define variables, as the contents of the included file also get processed by the pre-processor. Though generally un-advisable, this can be exploited to perform complex tasks at compile time otherwise not possible via pure C.

    struct heap;
    
    struct heap * create_heap(
        int size);
    void insert(
        struct heap * self,
        int element);
    
    /* <-- 1,227 lines from stdlib.h would be placed here */
    
    struct heap {
      int size;
      int maxsize;
      int * data;
    };
    
    struct heap * create_heap(
        int const size)
    {
      struct heap * h;
      h = malloc(sizeof(*h));
      h->size = 0;
      h->maxsize = size;
      h->data = malloc(sizeof(*(h->data))*size);
      return h;
    }
    

    Notice that comments are stripped out as part of the pre-processor phase.

  2. Compilation:

    (Can be achieved with gcc -S)

    Compilation is where the C code is turned into assembly code. This phase is important to programmers as it is where syntax errors will be reported, and where most optimizations will be performed. This includes things like function inlining, loop unwrapping, and auto-vectorization.

      .file "my_file.c"
      .text
      .globl    create_heap
      .type create_heap, @function
    create_heap:
    .LFB5:
      .cfi_startproc
      pushq %rbp
      .cfi_def_cfa_offset 16
      .cfi_offset 6, -16
      movq  %rsp, %rbp
      .cfi_def_cfa_register 6
      subq  $32, %rsp
      movl  %edi, -20(%rbp)
      movl  $16, %edi
      call  malloc@PLT
      movq  %rax, -8(%rbp)
      movq  -8(%rbp), %rax
      movl  $0, (%rax)
      movq  -8(%rbp), %rax
      movl  -20(%rbp), %edx
      movl  %edx, 4(%rax)
      movl  -20(%rbp), %eax
      cltq
      salq  $2, %rax
      movq  %rax, %rdi
      call  malloc@PLT
      movq  %rax, %rdx
      movq  -8(%rbp), %rax
      movq  %rdx, 8(%rax)
      movq  -8(%rbp), %rax
      leave
      .cfi_def_cfa 7, 8
      ret
      .cfi_endproc
    .LFE5:
      .size create_heap, .-create_heap
      .ident    "GCC: (GNU) 7.2.0"
      .section  .note.GNU-stack,"",@progbits
    

    The compilation phase removed all of the un-used code included from stdlib.h.

  3. Assembly

    (Can be achieved with gcc -c)

    Generally, the turning of assembly code into machine code is not of much interest to programmers, as compilers generally do not find errors in this phase (unless you are handwriting some assembly). The end of this phase produces object files (.o files). When creating a library, object files are combined into a single archive (.a for static libraries and .so for shared libraries on Linux).

                     U _GLOBAL_OFFSET_TABLE_
    0000000000000000 T create_heap
                     U malloc
    

    When we use the tool nm to inspect the generated object file, we can see that our function create_heap is defined at address 0.

  4. Linking:

    (Will happen by default with gcc)

    Linking is where object files are combined into a fully executable binary. This is the stage where you will see errors such as undefined reference, if you are missing a library or object file.

The chronology of language features

When writing C programmers, it is important to know which features your are able to use on a given platform.

  • K&R C: In 1978, Brian Kernighan and Dennis Ritchie published The C Programming Language, While C existed prior to this book, the features described in it were often used the base set of C you could rely on compilers supporting. Many things were missing that are now common in modern C code, such as function prototypes, void pointers, and C++ style comments //.

  • ANSI C: In 1989, the C standard was ratified by the American National Standards Institute. This is often referred to as C89. This added function prototypes, void pointers (which replaced using char * for general purpose memory locations), and extra pre-preprocessor functionality.

  • C99: In 1990, the International Organization for Standardization, ratified a new C standard. This included many new features including, the inline and restrict keywords, C++ style comments //, variable length arrays, and intermingled declarations and code (new variables did not have be declared at the top of a function). Changes important to math intensive codes were also included, such as support for complex numbers, expanded library functions, and improved support for IEEE754 floating point numbers.

  • C11: This was the next version of ISO C, originally called C1x, before being ratified in 2011. C11 adds threading support via the threads.h header, atomic primitives, and type-generic expressions via the _Generic keyword.

Because C is used on an extremely wide range of platforms, and for a wide range of purposes, there are two types of C implementations. A hosted C implementation, is one which provides the full standard library (e.g., rand(), qsort(), malloc(), etc.), and fully supports the language. A free-standing C implementation is one which supports most if not all of the language, but does not provide all or any of the standard library.

Often neglected keywords

While every C programmer will know and use things like if, void, and struct, many large code bases have been written without the use of typedef, static, or const.

The static keyword

This is used to restrict the scope of a variable to the current compilation unit. This allows defining constants and functions which should not be used outside of a .c file.

static void clear(
    int * const ptr,
    size_t const num)
{
  ...
}

A common practice is to place static functions above the non-static ones they are called by in a .c file, which means they do not need a function prototype.

This can also be used for constants, without worrying about other constants with the same name elsewhere in the program:

static size_t const BUFFER_SIZE = 1024;

The typedef keyword

This is used to create aliases to types. This can be useful to add semantics to your data types:

typedef float distance_type;

or to allow changing a data type in a single location of a program:

/* can be changed to double later if we need to */
typedef float real_type;

Some programmers find it useful in order to avoid prefixing struct variables with the struct keyword (this happens in C++ by default). Below, my_struct needs to be referenced as struct my_struct, but my_other_struct needs to be referenced by my_other_struct, without the struct keyword.

struct my_struct {
  int x;
};

typedef struct {
  int x;
} my_other_struct;

Less often, you will see someone allow for both:

typedef struct dual_struct {
  int x;
} dual_struct;

This can then be referenced as struct dual_struct or just dual_struct.

The const keyword

This is the most under used keyword in the language. In fact, so much so, that many modern languages leave it out. Not only should you know what it does (beyond just that it disallows modification of a variable via compile time-enforcement), you should use it as much as possible! It can be post-fixed or pre-fixed on a variables type:

const int x = 0;
const int * y = NULL;

is the same as

int const x = 0;
int const * y = NULL;

Keep in mind, that while the compiler will stop you from performing ++x, you can modify the value of x with type casting (e.g., ++((int*)&x)), or accidentally writing the memory location.

When using const with pointer types, it can refer the pointer itself (e.g., int * const ptr, the eight byte memory address), or the memory pointed to (e.g., int const * ptr or const int * ptr). When you declare the memory pointed to as const you are changing the type of the pointer. Thus trying to pass char const * into a function taking char * should produce a warning or error, depending on your compiler. However, the reverse is valid; you can pass a pointer to mutable memory into a function which takes a pointer to constant memory without performing a cast.

Finally, function declarations need not apply const to their parameters, and the function definitions can make those same parameters constant.

int foobar(
  int x,
  int const * y);

...

int foobar(
    int const x
    int const * const y)
{
  ...
}

In this example, the function definition has to define the memory pointed to by y as const, but the pointer y itself can be declared const only in the function definition. Declaring the parameters themselves as const in the function declaration is ignored by the compiler. This is because the parameters are passed by value, and any modifications to x (e.g., x = 2) or y (e.g., y = &x), would be limited to the scope of foobar(). Modifications to the memory pointed at by y would be visible outside of foobar() (e.g., y[0] = 2), and thus the constness matters for what y points to.