When seekdir() Won't Seek to the Right Position

The other day, I got an email from Edd, an OpenBSD user, claiming that Samba would crash when serving files off an MS-DOS filesystem. This was Samba built from sources and not the one from ports. Since I use myself Samba a lot and for a quite large user base, I got interested in the issue and started investigating it. What I found out in the end is a surprise and was not expected: A bug that has been there in all BSDs for almost all the time, since the 4.2BSD times or for roughly 25 years...

The Samba Directory Cache

The Problem With seekdir() In Samba's replacement code I found the following comment, stating the problem from the Samba point of view:
"This is needed because the existing directory handling in FreeBSD
 and OpenBSD (and possibly NetBSD) doesn't correctly handle unlink()
 on files in a directory where telldir() has been used. On a block
 boundary it will occasionally miss a file when seekdir() is used to
 return to a position previously recorded with telldir().

 This also fixes a severe performance and memory usage problem with
 telldir() on BSD systems. Each call to telldir() in BSD adds an
 entry to a linked list, and those entries are cleaned up on
 closedir(). This means with a large directory closedir() can take an
 arbitrary amount of time, causing network timeouts as millions of
 telldir() entries are freed"
Apparently there are two problems: seekdir() not returning to the position initially retrieved using telldir() and a performance problem. Before digging to much into source code, I decided to check the documentation of said functions, just to make sure, they were being used like they are meant to be used. The OpenBSD manual page is clear: The seekdir() function sets the position of the next readdir() operation on the named directory stream dirp. The new position reverts to the one associated with the directory stream when the telldir() operation was performed. Values returned by telldir() are good only for the lifetime of the DIR pointer, dirp, from which they are derived. If the directory is closed and then reopened, the telldir() value may be invalidated due to undetected directory compaction. If you don't close the directory stream, seekdir() will take you back to the position previsouly obtained using telldir(). If the system behaves differently, then it's either a bug or the documentation is wrong. If the system indeed does not take you back to the right position, why do we have these functions then in the first place? A look at the closedir() function in OpenBSD reveals that the statement made by the Samba people about performance in closedir() is wrong: In OpenBSD there is no linked list, but a smarter memory handling scheme implemented by Otto Moerbeek (otto@). FreeBSD, however, uses a linked list indeed, but they can switch to our code at any time. So the performance problem really is a non-issue.

Hunting the seekdir() Bug

As the original author of the *dir() library, you probably fixed one of my bugs :-) Prior to the *dir() commands, programs just opened, read, and interpreted directories directly. I had to update a shocking 22 programs (a large percentage of the programs available on UNIX at the time) to replace their direct interpretation of directories with the *dir() library calls. (Kirk McKusick; private communication)
What I needed is a test program to exercise these functions an eventually trigger the behaviour that prevented the Samba people from enabling a directory cache on BSD systems. I wrote a C program that approximately does the following steps:
  1. Create a directory and populate it with a certain number of files
  2. Iterate over the directory using readdir(), recording the position of the entry using telldir() before readdir(), and storing the obtained values in an array
  3. Delete a random number of files and also mark them as deleted in the in-memory array
  4. Iterate over the in-memory array, skipping the entries marked as deleted, and seekdir() to the others, doing a readdir() and compare the returned values with the in-memory copy
  5. Output a message when the in-memory copy and the value returned by readdir() are different
Jeremy told me that the problem occurs with large directories, so I began my tests with really large directories (up to 250'000 entries) and quite quickly I hit the issue. seekdir() won't return me to the recorded position. This kind of confirmed the problem the Samba people were seeing for more than three years. I tweaked the values of my test program, to see if I can find a pattern. But looking at thousands of directory positions, filenames, inode numbers where more likely to turn me mad than to spot the problem... I started lowering the numbers and to my surprise, I could trigger the problem with as little as 10 or 20 files and deleting just one of them. Suddenly, I had a case that shows the problem on every run, no more randomness: Create 28 files, delete file 25 and seekdir to file 26: You end up at file 27! Staring at the output of my program I suddenly saw the pattern as clear as can be: Creating the directory with 28 files had created a directory that spans more than one block on the disk (2 in this case). File 25 was the first entry of the second block. Obviously the problem occured when you delete the first entry in a block of a directory and then return to the recorded position of the second entry in the same block. This would actually get you one entry to far. By that time I had involved Otto to give me a hand and to confirm my findings. His first reaction: An interesting problem... ;) We investigated further and I began looking at the kernel code that removes a directory entry as well as the library code in libc that implements seekdir() and friends.

How the Kernel Removed Directory Entries

The kernel function to remove a directory entry, ufs_dirremove(), indeed treats entries that are located at the beginning of a block differently than others:
        if (dp->i_count == 0) {
                /*
                 * First entry in block: set d_ino to zero.
                 */
                ep->d_ino = 0;
        } else {
                /*
                 * Collapse new free space into previous entry.
                 */
                ep->d_reclen += dp->i_reclen;
        }
If it is the first entry in a block, the inode number is set to zero, thus invalidating the entry. For all other entries, the record length of the to-be-deleted record is added to the record length of the previous entry. Since the library uses the record length field of an entry to proceed to the next entry in readdir(), this effectively means that the removed entry, while it is still in the block on disc, will no longer be used.

How the C Library Accesses Directory Entries

The C library implements the opendir(), telldir(), readdir(), seekdir(), and closedir() functions. These functions were written in the 4.2BSD times so that UNIX programs don't need to handle directories by themselves.
/*
 * get next entry in a directory.
 */
int
_readdir_unlocked(DIR *dirp, struct dirent **result)
{
        struct dirent *dp;

        *result = NULL;
        for (;;) {
                if (dirp->dd_loc >= dirp->dd_size)
                        dirp->dd_loc = 0;
                if (dirp->dd_loc == 0) {
                        dirp->dd_size = getdirentries(dirp->dd_fd,
                            dirp->dd_buf, dirp->dd_len, &dirp->dd_seek);
                        if (dirp->dd_size == 0)
                                return (0);
                        if (dirp->dd_size < 0)
                                return (-1);
                }
                dp = (struct dirent *)(dirp->dd_buf + dirp->dd_loc);
                if ((long)dp & 03)      /* bogus pointer check */
                        return (-1);
                if (dp->d_reclen <= 0 ||
                    dp->d_reclen > dirp->dd_len + 1 - dirp->dd_loc)
                        return (-1);
                dirp->dd_loc += dp->d_reclen;
                 if (dp->d_ino == 0)
                        continue;
                *result = dp;
                return (0);
        }
}
At first sight, this code looks correct. It will skip deleted entries that have their inode number set to zero and it will use the record lenght otherwise. And since a directory traversal using readdir() works just fine, this is code effectively works. A close look at the seekdir() library implementation finally reveals the problem:
/*
 * seek to an entry in a directory.
 * Only values returned by "telldir" should be passed to seekdir.
 */
void
__seekdir(DIR *dirp, long loc)
{
        struct ddloc *lp;
        struct dirent *dp;

        if (loc < 0 || loc >= dirp->dd_td->td_loccnt)
                return;
        lp = &dirp->dd_td->td_locs[loc];
        dirp->dd_td->td_last = loc;
        if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek)
                return;
        (void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
        dirp->dd_seek = lp->loc_seek;
        dirp->dd_loc = 0;
        while (dirp->dd_loc < lp->loc_loc) {
                _readdir_unlocked(dirp, &dp);
                if (dp == NULL)
                        break;
        }
}
This code will not work as expected when seeking to the second entry of a block where the first has been deleted: seekdir() calls readdir() which happily skips the first entry (it has inode set to zero), and advance to the second entry. When the user now calls readdir() to read the directory entry to which he just seekdir()ed, he does not get the second entry but the third. Much to my surprise I not only found this problem in all other BSDs or BSD derived systems like Mac OS X, but also in very old BSD versions. I first checked 4.4BSD Lite 2, and Otto confirmed it is also in 4.2BSD. The bug has been around for roughly 25 years or more.

The Solution

The fix is surprisingly simple, not to say trivial: _readdir_unlocked() must not skip directory entries with inode set to zero when it is called from __seekdir(). q.e.d. (and sorry that it took us almost twenty-five years to fix it)
Many thanks to Jeremy Allison, Edd Barrett, Volker Lendecke, and especially to Otto Moerbeek for their help and suggestions.

Comments

Hard to find, simple to fix

"and sorry that it took us almost twenty-five years to fix it"

regarding the surprisingly simple solution, the time it takes to find a bug is usually in inverse proportional to the time it takes to fix it ;-)

Other BSDs

Interesting! Have you reported this bug to the other BSDs and Apple? Hopefully they'll want to fix it too.

Other BSDs fixed the bug

Yes, of course I reported this to FreeBSD, NetBSD, and DragonflyBSD and they all fixed it. In OpenBSD, it has also been committed to the patch branches. I don't know anyone in Apple, but they will eventually figure it out themselves.

Bug manifests differently on Mac OS X

Cool stuff. I wrote a program to reproduce the bug and tested it on FreeBSD and Mac OS X. Got some funny results on Mac OS X which I can't explain. I wrote it up here. Take a look, I'd be interested to hear what you think.
--
Ben Rosengart

The code is first in 4.1c

If anyone is interested in the original code, I've stuck it here (I have McKusick's CSRG archive CDs):

http://plover.net/~agarvin/seekdir.c.txt

Good Detective Work

I'm impressed!

Marc's report didn't mention it, but I imagine the bug's effect is cumulative. An in-service utility handling a case of 250,000 files hit with a number of deletions must have created a real mess as iterations progressed.

Marc seems nonplussed that the fix was 'surprisingly simple', but it's often like that– a flaw so tiny that it gets overlooked, often only a bit or two.