Friday, May 29, 2009

Dynamic Linking and Distributed Computing Don't Mix



Dynamic linking is one of the more frustrating aspects of distributed computing in the real world. It's is the sort of technology that is meant to optimize the computer's happiness at the expense of the end user's sanity. Dynamic linking should really be avoided, except in a few very specific cases outlined below.

For those of you who don't remember, here is a brief primer on linking:

Back in the good old days, programmers would group commonly used functions (like printf and strlen) into a common module, otherwise known as a library. However, managing the library was difficult. If you simply compiled your library into the program, it would work, but your program would be full of unused code. The alternative was to cut-and-paste the needed routines into your program, but this was time consuming, and led to many copies of the code that were difficult to synchronize. Frustration was the result.

The solution to this is a tool known as a link editor or just linker. A linker looks at a program and a set of libraries, figures out all the pieces that are needed, and then constructs a complete executable program with only the routines that are actually needed. In the example below, suppose that main.o needs to use the functions printf.o and baz.o. The linker figures out that those reside in libc.a and libstrange.a, and puts the whole thing together in prog.exe. This program can be copied to any other machine, and will run correctly. This is now known as static linking.

As machines grew larger, and had ever more programs and libraries installed, someone clever observed an inefficiency. Nearly every program requires printf, so a copy of the printf code was present in nearly every single program, wasting space in both the filesystem and virtual memory. Further, if someone fixed a bug or security flaw in printf, it was necessary to recompile everything.

To address these problems, dynamic linking was invented. In this model, the linker does not copy routines into the executable, it simply makes a note that the program depends upon a certainly library. When the program is actually run, the loader binds the function calls in the program to the shared libraries on disk. Often, the executable program is very small, and simply consists of a few calls to a large number of libraries.


Now enter distributed systems. Suppose that you wish to take a program that you have written on one machine, and run it on another machine? If you have employed static linking, it's easy: you simply copy the program over, and run it. If you have used dynamic linking, it's a real pain: you must identify all of the libraries that the program depends upon, copy them over, set some obscure environment variables, and then run the program.

Ironically, dynamic linking is less efficient than static linking in several ways. First, it actually ends up using more disk space, virtual memory, and network traffic, because you have to copy over the entire libraries, not just the parts that your program needs. (Of course, you can break the dynamic library up into smaller libraries, but then you are just making it harder on the programmer and user to identify the right libraries.) Second, it makes program startup very slow, especially on a distributed filesystem, because the loader must search for every single library in the search path.

For a nice example of how this can make a simple program ridiculously complicated, try the following two commands on Linux: ldd /bin/ls and strace /bin/ls . The former shows the libraries required to run the ls command, and the latter shows the hundreds of system calls needed to just start the program. Of course, a few hundred system calls isn't much by itself, but when you think of hundreds of users sharing a common file server, and ever call to exec() results in this traffic, you can start to see why this might not be a good idea.

So, to sum up:

Static LinkingDynamic Linking
On A Single
Computer
Easy to use.
Wastes space.
Easy to use.
Saves space.
In a Distributed
System

Easy to use.
Saves space.

Hard to use.
Wastes space.


My advice? Always use static linking, unless you are 100% sure that every single computer on the planet has the libraries that you need. That means, link dynamically against the standard C and math libraries, maybe against pthreads and X11, and statically against everything else.



Appendix: How to control linking with gcc.


To link everything in your program statically, use the -static flag:
gcc -static main.o -lstrange -lc -lm -o prog.exe

To link some libraries statically and some dynamically, use -Xlinker -Bdynamic and -Xlinker -Bstatic to switch between modes:
gcc main.o -Xlinker -Bstatic -lstrange -Xlinker -Bdynamic -lc -lm -o prog.exe

To see what dynamic libraries your program depends upon, use the ldd command:
% ldd /bin/ls
libc.so.6 => /lib/tls/libc.so.6 (0x00a99000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00cf8000)
/lib/ld-linux.so.2 (0x00a7f000)

10 comments:

  1. And for the opposition: Ulrich Drepper

    ReplyDelete
  2. Isn't it interesting how the faster computers get, the more interested in efficiency we are? The PDP-11s that Unix was originally developed on were unthinkably slow by today's standards (KIPS!). Yet the graybeards wrote lots of shell scripts, used interpreted languages like awk, implemented fork without COW, and didn't have shared libraries.

    ReplyDelete
  3. Hey Doug, psilord here, I had written a similar post about dynamic libraries and how they aren't that great of an idea when you want to move binaries around.

    http://pages.cs.wisc.edu/~psilord/blog/2.html

    Other than a real benefit in the sharing of text segments between processes, the rest of the ideals of dynamic linking seems to be superstitiously believed.

    Of course, on non-linux OSes, backwards compatibility between dynamic library APIs most often works, but a lot of those OSes are dying off, and Linux absolutely doesn't care about backwards compatibility.

    ReplyDelete
  4. Suppose you have two styles of processor that your program may run upon. One of them has a hardware math error in it that the libm.so for that specific revision of the OS patches so the application never sees the error. Now, if you package up all of the libraries for your application and move them to the other processor, you'll get wrong answers. Think this'll never happen? It happened with the IRIX OS on the R Series of processors. That one took a while to track down....

    ReplyDelete
  5. On linux, shared libraries have an oddball scoping mechanism like foo@@GLIBC_2.0 which is the function foo that has a semantic signature matching the definition of it in GLIBC 2.0. Suppose GLIBC 2.1 defines a *different* foo with the same function signature but different internal behavior. Old programs use the old API, new programs use the new API, and you think things work. Now, how many revision of the old things are kept around, the answer apparenty is "until someone notices and removes it". Also, what happens when you take data from an old style function and pass it to a new style function, which can happen when one shared library invokes calls froma different and later compiled shared library? I'll leave the answer as an excersize to the reader...

    ReplyDelete
  6. Note that my advice is simply that dynamic linking should only be used with standard system libraries that can be found on any machine.

    The link above to Mr. Drepper's post mostly discusses issues relevant to libc. I have no argument there, since every machine has a libc.so, and every programs needs to use it.

    Another commenter above points out that the standard math library can work around architectural bugs. Again, every machine has a libm.so, so no argument there.

    The comment about symbol versioning is interesting. It seems to me that it is an unnecessary complication, since you could achieve the same effect with greater clarity by just using ordinary symbol names. If a function foo() needs to have its semantics drastically changed, then just define a new function foo_2(), and require the developer to take the positive step of selecting the new semantics.

    ReplyDelete
  7. Note that my advice is simply that dynamic linking should only be used with standard system libraries that can be found on any machineThis is too simplistic, because it means you need to know what the least common denominator of even your glibc runtimes is, and code to that. For example, if you use the regexec function from glibc, and dynamically link on RHEL 5, your code won't load on RHEL 4, even though RHEL 4's glibc has the regexec function! Static linking fixes that problem. And, good lucking finding this mentioned in the man pages.

    ReplyDelete
  8. The symbol versioning topic has a good explanation here:
    http://lists.debian.org/lsb-spec/1999/12/msg00017.html

    As for changing the names of the symbol, you misunderstand--from the point of view of the user, the two functions act identically, but the libc (or whatever) implements them differently with different internal state. The trouble comes in when you do something like opendir() against an old glibc version, then pass the handle to a differenlt compiled shared library which then uses a newer revision of readdir() on it. The readdir() will expect the handle to be a specific thing, and it may not be that at all.

    ReplyDelete
  9. Also, if you complain that readdir() and opendir() do not have this behaviour so it is a non-issue, I present openssl as an example. I've seen cases where an executable has a static version compiled into it, loads a library which wants a different revision, and then load a different library which wants yet a different revision. The global variables between all of the revisions of ssl are the same (the default link behaviour) so they each trash each other's spaces and handles and end up segfaulting. This is an example which is quite simple to see and has various permutations of it.

    ReplyDelete
  10. And, another thing (continuing the above post). on linux "static" linking means the resolver libraries are still dynamically loaded, so if you use LDAP and the administrator chose to use LDAP, so it loads ssl, then your statically linked ssl in your executable fights with it, and bam, segfault again.

    ReplyDelete