Skip to main content

Pruning Inheritance Trees

dinasour national monument

This weekend I began tackling some technical debt in my WildRiver library for reading and writing matrices and graphs. When I started writing the library a few years ago I made the mistake of re-using lots of code via inheritance. It stemmed from a desire to have a couple of methods that a class for reading and writing a given file format would need to implement.

Sparse Matrices

|  1.0  0.0  0.0 -1.0 |
|                     |
|  0.0  2.0  0.0  0.0 |
|                     |
|  0.3  0.8  0.0  0.0 |
|                     |
|  0.0  0.0  0.0 -3.1 |

A sparse matrix is matrix where only non-zero values are stored. One of the best performing read-only data structures for storing sparse matrices in memory is Compressed Sparse Row. WildRiver uses this as the primary data-structure for matrices (and graphs) that it reads/writes. The first file format I supported in WildRiver was a .csr file, used frequently by my lab in graduate school. The file format is a text file with each line of the file representing a row in the matrix, with the column index of each non-zero followed by the value. The above matrix would be represented as:

0 1.0 3 -1.0
1 2.0
0 0.3 1 0.8
3 -3.1

The issue

At the start, the class hierarchy for the CSRFile class looked like:

CSRFile inheritance

Having methods exposed across multiple parents makes the over all design quite rigid.

Good refactoring starts at the root

Because our goal is to improve the quality of the CSRFile class, changing MatrixTextFile to use composition instead of inheritance isn't really fixing the issue. It is easy be tempted to say MatrixTextFile should be composed of a TextFile instead of inheriting from one. This would eliminate our multiple inheritance and reduce the size of our inheritance tree. However, this would lock us into inheriting from MatrixTextFile class, which inherits from MatrixFile, and implements several interfaces.

Instead, focus should be on the idea that CSRFile will implement interfaces and inherit no code. Inheritance muddies the line between interface and implementation. Not that inheritance is always bad, but it should only be used when interface and implementation should be coupled.

Our code re-use will come entirely from composition. This minimizes the rigidity of CSRFile. It also gives us a strong incentive, to minimize the responsibilities of the CSRFile class, as we'll have to provide some minimal code for each method, even if it just calls a method of a member object. Large inheritance trees make it far too easy to create classes with large numbers of public methods.

Good design starts with good interfaces

Well defined interfaces are the hallmark of good design. I want to keep the current method of having classes that implement the IMatrixWriter and IMatrixReader interfaces, and have them initialized by the MatrixFactor::make() function.

if (CSRFile::hasExtension(name)) {
  file.reset(new CSRFile(name));
} else if (MatrixMarketFile::hasExtension(name)) {

How code re-use is accomplished should be hidden from code which uses the classes.

To emulate the current structure where CSRFile implements the low-level methods and the actual read() and write() are implemented by a parent class, templates could be used (e.g., MatrixTextFile<CSRParser>). To use the same interface, a simple

typedef MatrixTextFile<CSRParser> CSRFile

would suffice. However, this makes CSRFile a rather rigid class, as adding a function, overriding a function, etc., would require inheriting from the template specialization:

class CSRFile :
    public MatrixTextFile<CSRParser>

Which doesn't seem like much of an improvement of the previous design. Plus, I prefer to avoid templates when their use doesn't have a strong justification.

Method Pruning

Whenever I find myself with time to tackle technical debt, I like to look at the public methods exposed by my classes, and apply Uncle Bob's rule of leaving code cleaner than you found it. If you have unused methods, remove them.

Doing some work with grep, I found the following methods could be removed from my IMatrixReader and IMatrixWriter interfaces:

  • firstRow(): Only called from within the implementing classes.

  • getNextRow(): Originally meant to allow processing matrices and graphs in an out-of-core fashion, but was never used.

  • getName(): Unused.

  • getFilename(): Only called from within the implementing classes.

Removing these methods leaves the IMatrixFile interface empty, aside from inheriting from IMatrixReader and IMatrixWriter. The only functions left in IMatrixReader are getInfo() and read(), and in IMatrixWriter are setInfo() and write(). These interfaces now do a good job of respecting the single responsibility principle, and CSRFile now only has interfaces as its parents:

class CSRFile :
  public IMatrixReader,
  public IMatrixWriter

A re-design without a re-write

I didn't want to re-write CSRFile, as it worked fine, and despite the design flaws, was well written. First things first, I changed TextFile from being a parent of CSRFile, to be a member of it. This preserved all of the code and function signatures, only changing how they were called.

The low level .csr specific functions were in CSRFile and the higher level functions were in MatrixTextFile. This made writing new text based matrix storage formats where the non-zero entries are stored in row-major order simple. To preserve this code and this availability of code re-use, I'm added CSREncoder and CSRDecoder classes, to encompass the high-level code for reading and writing matrices using a CSR data structure (rowptr, ind, and val).

The CSREncoder and CSREncoder classes take implementations of the IRowMatrixWriter and IRowMatrixReader respectively. CSRFile implements both of these, providing the writeHeader(), readHeader(), getNextRow(), and setNextRow() methods.

These changes resulted in the inheritance of CSRFile looking like:

reduced CSRFile

Now, while the interfaces are quite clean, having only two methods each (excluding the virtual destructors), the class itself has eight public methods (and one static function). The four methods for the IRowMatrixWriter and IRowMatrixReader interfaces are only exposed so that the member instances of CSREncoder and CSRDecoder can call them. Down the road, I may split those methods out into a private nested class of CSRFile, but the fine-grain interfaces of CSRFile mean the rest of the code is insulted from this.

Design is an iterative process...