[ samkhn Home | Comments or bugs? Contact me ]


Abstraction, in moderation.

Published: January 13, 2023

    Abstractions are everywhere in software. Operating systems are an abstraction over hardware that make computers secure, interactive, and accessible. Many tech companies sell abstractions as a key offering: Google Cloud and AWS provide an abstraction for cloud-based VMs, Stripe provides an abstraction over financial infrastructure, Docker provides an abstraction over operating systems so that developers can deploy in more places, etc. It's evident that abstractions allow organizations to save costs by re-using other people's work and by helping form a lingua franca for discussing technical ideas. Despite all of this, I still think abstraction should be practiced in moderation.

    New programmers learn abstractions through function calls, OOP and interfaces. Some students may even dive deeper into object oriented programming and learn about the dangers of over-exposed APIs. They introduce copy/move constructors. They defend their data with get and set functions that prevent state changes. They may end up with something like this (they haven't learned about templates yet):

class CharVector {
public:
  CharVector() {...}
  CharVector(char* new) {...}
  CharVector(CharVector&& other) {...}
  CharVector& operator=(CharVector&& other) {...}
  // Copy constructor disallowed
  // Getter/setter
  size_t Length() const { return len_; }
private:
  size_t resize();
  double load_factor_;
  size_t len_;
  char *memory_;
};

static const char* kMessage = "Hello";

int main(int argc, char **argv) {
  CharVector cv(kMessage);
  size_t length = cv.Length();
}

Now...is this better than just declaring
static const char* kMessage = "Hello";
static size_t kMessageLen = 5;
and then using kMessage and kMessageLen? Globals are dangerous: Sure but how is this program being used? How would the autograder that is going to run this software going to misuse this? This is a toy program used to practice/demonstrate conceptual understanding. My program is noticably simpler if I just use these things that are deemed dangerous. When students and young programmers are told something is dangerous, many will opt towards never using them rather than learning the costs and benefits of a tool. The example above alludes to my original statement: abstraction (and frankly everything) should be used in moderation and in light of what needs to get done. For most, a program runs on a server or our own machine, on top of a processor and memory and maybe some disk storage (on hardware NOT in a browser NOT in Linux/Windows). If we don't have security or the engineering requirements of a large company, let's not introduce patterns that work for them that won't work for us.

    The example can be a little elementary so let me go through a more concrete example: Docker. Docker is an abstraction (docker->OS) of an abstraction (OS->hw). Docker is built on the idea of namespaces and cgroups from the Linux Kernel. Google engineers proposed control groups as a way for (groups of) processes to have scoped set of resources from the kernel and prevent any single process from forcing other processes to OOM. As you may already be able to tell, Docker isn't really used to prevent one process from exhausting another. Rather, Docker is used so that you can ship your program anywhere and not have to worry about dependencies. Instead of just cloning a repo and then recreating the developer's environment manually, Docker (and dockerfiles+dockerhub) simplifies the process of spinning up a general dev environment.

The geometry of abstraction

    Like any conceptual tool, you should build abstractions when the benefit it brings outweights the cost. Here is an (not very flushed out) idea for how to reason about abstraction: (1) I think we should reason about software complexity by correlating it to size and (2) software size can be understood in 2 dimensions: LOC and jumps/stack frames.

    Let's start with #2 (software size can be understood in 2 dimensions: LOC and jumps/stack frames) and assume #1 (for now). In the code example above, the class abstraction of the msg* increased the line count more than it saved. Let's say that line width is fixed at some value like 80 or 100. This let's us reason about code complexity as correlating code LOC, which correlates with the area of a single file or stack. An abstraction may or may not reduce LOC but it definitely increases the height by introducing a new frame/scope. Each use of an abstraction aims to reduce length by increasing height.

void Foo()
-> void Bar()
---> void Baz()
-----> void Faz()
In the snippet, we have a height of four while implementing Foo, Bar, Baz, and Faz has reduced LOC elsewhere. From this, we can reason about managing software complexity as a balancing act between this length and height where the length should always be greater than the height. I'm not sure what the ratio between length and height should be, though.

    Let's go back to #1 (software complexity correlates to size). I think this is a fair assumption or axiom. I want to define simple as "no longer reducible." Whenever a programmer debugs, they are running through the program in their head. Each line the programmer reads updates the running state of relevant objects. Each function call and jump introduces a new stack frame to mentally track. Each abstraction can reduce the number of lines they need to reason about but it introduces the cost of reasoning about a function call.

    So does this (likely incomplete) geometric interpretation of abstraction and complexity apply to Docker abstracting the OS and exec formats and to operating systems abstracting hardware? Yes! The Dockerfile ends up playing a similar role to CMake and GNU configure scripts (they scan system dependencies and configure the compilable unit to the machine). Each Dockerfile gets parsed and "executed" line by line to change the running state. Each state change is incredibly powerful, loading entire OS images and installing packages/dependencies. Although I do think Docker is a band aid to the problem of operating systems not working together (read: Windows PE format), it provides a powerful and useful abstraction.

    The geometric interpretation of software complexity allows a programmer to reason about whether an abstraction can bring a benefit using the height:length ratio. Clearly, this interpretation is incomplete so if you (reader) have ways to extend this or interesting edge cases where this might fail, let me know (email on top of the page).

    An interesting aside I noticed when writing this: more abstraction can help improve profiling data but make debugging a little more unclear. If you aren't using a profiler like Intel VTune or linux perf, you can use a debugger as a sampling profiler by trying to break/pause the debugger during each execution of a program (more likely than not, you will sample and stop at the true bottleneck of your binary). With more abstraction/stack frames, this poor man's profiler will land closer to the true bottleneck. However, with more abstractions, you will have more stack frames to pop during debugging, which means a more complex program to reason about.