Prefer List of Tuples over Tuple of Lists

Using a list of tuples instead of a tuple of lists is an easy way to help make illegal states unrepresentable without any custom types in your code.

We can do this in places where the multiple lists are required to be the same length, like a method taking in multiple lists or a structuring storing multiple lists. Let's take a look at an example of this:

def foo(y_pred: List[float], y_true: List[float]) -> float:
...

We need to check that the two inputs are of equal length and raise an exception if not:

def foo(y_pred: List[float], y_true: List[float]) -> float:
if len(y_pred) != len(y_true):
raise ValueError("Lists need to be of the same length")
...

While this is okay and we've handled our error case, it would be better if our function did not have to check this. Programmers are fallible, and it's all too easy to forget this check, especially if we come back to this code in the future and refactor (or if someone else comes across the code).

I argue that the following is better for many cases:

def foo(y_pred_true_pairs: List[Tuple[float, float]]) -> float:
...

While we gain a longer parameter name, we also eliminate the possibility of there being a different number of y_pred and y_true elements just by using a different data structure!

Of course, if an upstream process was producing two different sized lists, then changing our function signature doesn't magically fix that problem. It does push these concerns away from our function and closer to the source of any problems, which is a positive change.

It should not be our foo function's concern that the lengths of the two lists could be different. Any error handling that foo has to implement increases its complexity. Also, any caller of the original foo function should be made aware that it can raise an exception on some inputs, namely inputs where the two lists are of a different length. Put another way, the original foo is a partial function, not a total function.

The new foo pushes the concern upstream because of the following:

To call the foo function, the caller needs to convert the two lists (or whatever format the data is in) into a list of tuples. If the caller can do this successfully, then it will be able to call foo without having to worry about any exceptions. If the data is invalid, then the caller will not be able to convert the data.

If this is due to a bug in the caller's implementation, then great, we've exposed the bug very close to the source. If this is due to a bug upstream of the caller, then our caller has a couple of options on how to handle this:

  1. If the caller is able to gracefully handle this error and still compute the result, then it can.
  2. If the caller needs the result of foo to compute its result, then we are in the same situation as before, just one level up. We should consider having the caller take the data in the same format as foo in order to push this concern upwards once more.

The continual pressure that this applies is good because it pushes data-related concerns closer to the boundary of our system. By handling any invalid data at the boundary, we free up the rest of our internal code from these worries.

For more information about this, I highly recommend the excellent Parse, don't validate post.

And one last thing: While our foo function is better, we can further improve it (provided that we don't mutate y_pred_true_pairs) by mimicking immutability with our type hints. So with this change:

def foo(
y_pred_true_pairs: Sequence[Tuple[float, float]]
) -> float:
...

and a static analysis tool like mypy, we're able to get an error if we accidentally mutate it!

All of this is part of a trend about pushing potential runtime errors to compile/static analysis time. It's much easier to catch a bug during development with a compiler or linter than it is to push out a fix to customers in production. The small amount of additional development time will pay dividends over the long run.

So, next time you are in a situation where you have multiple lists that need to be the same length, consider requiring a list of tuples instead!