[sup-talk] Possible problem with maildir ID generation

Archive of RubyForge sup-talk mailing list
 help / color / mirror / Atom feed

From: wmorgan-sup@masanjin.net (William Morgan)
Subject: [sup-talk] Possible problem with maildir ID generation
Date: Mon, 04 May 2009 09:10:23 -0700	[thread overview]
Message-ID: <1241451919-sup-8970@entry> (raw)
In-Reply-To: <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>

Reformatted excerpts from Marc Hartstein's message of 2009-04-29:
> On Wed, Apr 29, 2009 at 03:31:52PM -0700, William Morgan wrote:
> > (All this rigamarole about ordinals and blah blah blah is necessary
> > because I don't want Sup to rescan the entire Maildir unless absolutely
> > necessary. One day I'll convert my mbox to a Maildir with 250k files in
> > it, and a rescan will kill me, especially at Ruby speed.)
> 
> How are we defining 'rescan' here?  I can think of a couple of possible
> meanings, and I'm not sure where you are:
> 
> 1. Open every file in the maildir and read from it, every poll.
> 2. Visit every filename in the maildir to consider whether it is a new
> file, every poll.
> 3. Something else I'm not thinking of this instant.

You're probably confused because what I said was wrong. Here's what I
should have said:

Rescanning the entire directory to check for new mail is unavoidable in
Maildir. What I *do* have some control over, i.e. can try to optimize,
is the following operation: given a file in the directory, does it
represent a new message? One way to do this would be to maintain a set
of all filenames representing read messages, and to check for existence
within the set, for every file. Sup doesn't do this; instead it defines
an ordering on filenames, such that newer filenames are "larger" than
older filenames under that ordering, and maintains a threshold
representing the dividing point between new and old. The idea being that
performing that comparison is quick, and that storing the set of read
messages is also quick.

(And again, I care about the case where you have 500k messages, and so
doing set existence is painful. Although perhaps Bloom filters are the
solution... that would be interesting!)

> So the only way you're going to get a timestamp collision (on the
> filename timestamp, perhaps not on the actual ctime) is if you have
> multiple processes delivering mail simultaneously, or if you're
> synchronizing mail delivered on multiple hosts.  In the case of the
> latter, a timestamp-based heuristic for finding new messages isn't
> going to work.

I think the collisions we are seeing are due to timestamp granularity.
A little research suggests that e.g. ext2fs ctimes are 1-second
granular. It seems quite easy to get more than one email per second.

> Would it suffice to keep track of the filename of the most recent
> message added for each maildir source, and check everything with a
> time portion of the filename equal to or greater than that message for
> whether it needs to be added?  [although see above about whether that
> gains anything]

Yes, I think that's the solution. Basically switch from > to >= when
comparing timestamps, and realize that you're going to get some dupes.

> Having looked into this more closely, I'm starting to seriously
> reconsider whether maildir is really what I want to be using for storing
> my mail.  There are some weird timing issues.

Well, mbox has the advantage of being nice and linear so this problem
goes away, but is horribly broken in other ways (c.f. the recent thread
about the "From problem"). IMAP is a ridiculous PITA to deal with (at
least as a client), so you're kinda screwed in every direction here. If
anything, Maildir is less broken than the other alternatives.

Plus Maildir works really well with git. :)
-- 
William <wmorgan-sup at masanjin.net>

next prev parent reply	other threads:[~2009-05-04 16:10 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-16 22:05 Mark Alexander
2009-04-21 14:00 ` William Morgan
2009-04-21 15:33   ` Mark Alexander
     [not found]   ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>
2009-04-28 23:29     ` Mark Alexander
2009-04-29 22:31       ` William Morgan
2009-04-29 22:39         ` William Morgan
     [not found]         ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
2009-05-04 16:10           ` William Morgan [this message]
     [not found]             ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net>
2009-05-04 17:24               ` Mark Alexander

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1241451919-sup-8970@entry \
    --to=wmorgan-sup@masanjin.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox