From mboxrd@z Thu Jan  1 00:00:00 1970
From: wmorgan-sup@masanjin.net (William Morgan)
Date: Mon, 04 May 2009 09:10:23 -0700
Subject: [sup-talk] Possible problem with maildir ID generation
In-Reply-To: <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
References: <a412e2a70904161505l3e1fdfc9l987013130b3f5ddd@mail.gmail.com>
	<1240320547-sup-6957@entry>
	<20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>
	<a412e2a70904281629n6171bbb6i4c6330e375f8c6ad@mail.gmail.com>
	<1241038730-sup-2939@entry>
	<20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
Message-ID: <1241451919-sup-8970@entry>

Reformatted excerpts from Marc Hartstein's message of 2009-04-29:
> On Wed, Apr 29, 2009 at 03:31:52PM -0700, William Morgan wrote:
> > (All this rigamarole about ordinals and blah blah blah is necessary
> > because I don't want Sup to rescan the entire Maildir unless absolutely
> > necessary. One day I'll convert my mbox to a Maildir with 250k files in
> > it, and a rescan will kill me, especially at Ruby speed.)
> 
> How are we defining 'rescan' here?  I can think of a couple of possible
> meanings, and I'm not sure where you are:
> 
> 1. Open every file in the maildir and read from it, every poll.
> 2. Visit every filename in the maildir to consider whether it is a new
> file, every poll.
> 3. Something else I'm not thinking of this instant.

You're probably confused because what I said was wrong. Here's what I
should have said:

Rescanning the entire directory to check for new mail is unavoidable in
Maildir. What I *do* have some control over, i.e. can try to optimize,
is the following operation: given a file in the directory, does it
represent a new message? One way to do this would be to maintain a set
of all filenames representing read messages, and to check for existence
within the set, for every file. Sup doesn't do this; instead it defines
an ordering on filenames, such that newer filenames are "larger" than
older filenames under that ordering, and maintains a threshold
representing the dividing point between new and old. The idea being that
performing that comparison is quick, and that storing the set of read
messages is also quick.

(And again, I care about the case where you have 500k messages, and so
doing set existence is painful. Although perhaps Bloom filters are the
solution... that would be interesting!)

> So the only way you're going to get a timestamp collision (on the
> filename timestamp, perhaps not on the actual ctime) is if you have
> multiple processes delivering mail simultaneously, or if you're
> synchronizing mail delivered on multiple hosts.  In the case of the
> latter, a timestamp-based heuristic for finding new messages isn't
> going to work.

I think the collisions we are seeing are due to timestamp granularity.
A little research suggests that e.g. ext2fs ctimes are 1-second
granular. It seems quite easy to get more than one email per second.

> Would it suffice to keep track of the filename of the most recent
> message added for each maildir source, and check everything with a
> time portion of the filename equal to or greater than that message for
> whether it needs to be added?  [although see above about whether that
> gains anything]

Yes, I think that's the solution. Basically switch from > to >= when
comparing timestamps, and realize that you're going to get some dupes.

> Having looked into this more closely, I'm starting to seriously
> reconsider whether maildir is really what I want to be using for storing
> my mail.  There are some weird timing issues.

Well, mbox has the advantage of being nice and linear so this problem
goes away, but is horribly broken in other ways (c.f. the recent thread
about the "From problem"). IMAP is a ridiculous PITA to deal with (at
least as a client), so you're kinda screwed in every direction here. If
anything, Maildir is less broken than the other alternatives.

Plus Maildir works really well with git. :)
-- 
William <wmorgan-sup at masanjin.net>