Copyright © 2008, 2009, 2010 Marc Balmer. All rights reserved.
"This is needed because the existing directory handling in FreeBSD and OpenBSD (and possibly NetBSD) doesn't correctly handle unlink() on files in a directory where telldir() has been used. On a block boundary it will occasionally miss a file when seekdir() is used to return to a position previously recorded with telldir(). This also fixes a severe performance and memory usage problem with telldir() on BSD systems. Each call to telldir() in BSD adds an entry to a linked list, and those entries are cleaned up on closedir(). This means with a large directory closedir() can take an arbitrary amount of time, causing network timeouts as millions of telldir() entries are freed"Apparently there are two problems: seekdir() not returning to the position initially retrieved using telldir() and a performance problem. Before digging to much into source code, I decided to check the documentation of said functions, just to make sure, they were being used like they are meant to be used. The OpenBSD manual page is clear: The seekdir() function sets the position of the next readdir() operation on the named directory stream dirp. The new position reverts to the one associated with the directory stream when the telldir() operation was performed. Values returned by telldir() are good only for the lifetime of the DIR pointer, dirp, from which they are derived. If the directory is closed and then reopened, the telldir() value may be invalidated due to undetected directory compaction. If you don't close the directory stream, seekdir() will take you back to the position previsouly obtained using telldir(). If the system behaves differently, then it's either a bug or the documentation is wrong. If the system indeed does not take you back to the right position, why do we have these functions then in the first place? A look at the closedir() function in OpenBSD reveals that the statement made by the Samba people about performance in closedir() is wrong: In OpenBSD there is no linked list, but a smarter memory handling scheme implemented by Otto Moerbeek (otto@). FreeBSD, however, uses a linked list indeed, but they can switch to our code at any time. So the performance problem really is a non-issue.
if (dp->i_count == 0) {
/*
* First entry in block: set d_ino to zero.
*/
ep->d_ino = 0;
} else {
/*
* Collapse new free space into previous entry.
*/
ep->d_reclen += dp->i_reclen;
}
If it is the first entry in a block, the inode number is set to zero, thus invalidating the entry. For all other entries, the record length of the to-be-deleted record is added to the record length of the previous entry.
Since the library uses the record length field of an entry to proceed to the next entry in readdir(), this effectively means that the removed entry, while it is still in the block on disc, will no longer be used.
/*
* get next entry in a directory.
*/
int
_readdir_unlocked(DIR *dirp, struct dirent **result)
{
struct dirent *dp;
*result = NULL;
for (;;) {
if (dirp->dd_loc >= dirp->dd_size)
dirp->dd_loc = 0;
if (dirp->dd_loc == 0) {
dirp->dd_size = getdirentries(dirp->dd_fd,
dirp->dd_buf, dirp->dd_len, &dirp->dd_seek);
if (dirp->dd_size == 0)
return (0);
if (dirp->dd_size < 0)
return (-1);
}
dp = (struct dirent *)(dirp->dd_buf + dirp->dd_loc);
if ((long)dp & 03) /* bogus pointer check */
return (-1);
if (dp->d_reclen <= 0 ||
dp->d_reclen > dirp->dd_len + 1 - dirp->dd_loc)
return (-1);
dirp->dd_loc += dp->d_reclen;
if (dp->d_ino == 0)
continue;
*result = dp;
return (0);
}
}
At first sight, this code looks correct. It will skip deleted entries that have their inode number set to zero and it will use the record lenght otherwise. And since a directory traversal using readdir() works just fine, this is code effectively works.
A close look at the seekdir() library implementation finally reveals the problem:
/*
* seek to an entry in a directory.
* Only values returned by "telldir" should be passed to seekdir.
*/
void
__seekdir(DIR *dirp, long loc)
{
struct ddloc *lp;
struct dirent *dp;
if (loc < 0 || loc >= dirp->dd_td->td_loccnt)
return;
lp = &dirp->dd_td->td_locs[loc];
dirp->dd_td->td_last = loc;
if (lp->loc_loc == dirp->dd_loc && lp->loc_seek == dirp->dd_seek)
return;
(void) lseek(dirp->dd_fd, (off_t)lp->loc_seek, SEEK_SET);
dirp->dd_seek = lp->loc_seek;
dirp->dd_loc = 0;
while (dirp->dd_loc < lp->loc_loc) {
_readdir_unlocked(dirp, &dp);
if (dp == NULL)
break;
}
}
This code will not work as expected when seeking to the second entry of a block where the first has been deleted: seekdir() calls readdir() which happily skips the first entry (it has inode set to zero), and advance to the second entry. When the user now calls readdir() to read the directory entry to which he just seekdir()ed, he does not get the second entry but the third.
Much to my surprise I not only found this problem in all other BSDs or BSD derived systems like Mac OS X, but also in very old BSD versions. I first checked 4.4BSD Lite 2, and Otto confirmed it is also in 4.2BSD. The bug has been around for roughly 25 years or more.
Comments
Hard to find, simple to fix
"and sorry that it took us almost twenty-five years to fix it"
regarding the surprisingly simple solution, the time it takes to find a bug is usually in inverse proportional to the time it takes to fix it ;-)
Other BSDs
Interesting! Have you reported this bug to the other BSDs and Apple? Hopefully they'll want to fix it too.
Other BSDs fixed the bug
Yes, of course I reported this to FreeBSD, NetBSD, and DragonflyBSD and they all fixed it. In OpenBSD, it has also been committed to the patch branches. I don't know anyone in Apple, but they will eventually figure it out themselves.
Bug manifests differently on Mac OS X
Cool stuff. I wrote a program to reproduce the bug and tested it on FreeBSD and Mac OS X. Got some funny results on Mac OS X which I can't explain. I wrote it up here. Take a look, I'd be interested to hear what you think.
--
Ben Rosengart
The code is first in 4.1c
If anyone is interested in the original code, I've stuck it here (I have McKusick's CSRG archive CDs):
http://plover.net/~agarvin/seekdir.c.txt
Good Detective Work
I'm impressed!
Marc's report didn't mention it, but I imagine the bug's effect is cumulative. An in-service utility handling a case of 250,000 files hit with a number of deletions must have created a real mess as iterations progressed.
Marc seems nonplussed that the fix was 'surprisingly simple', but it's often like that– a flaw so tiny that it gets overlooked, often only a bit or two.