[sup-talk] Possible problem with maildir ID generation

Archive of RubyForge sup-talk mailing list
 help / color / mirror / Atom feed

* [sup-talk] Possible problem with maildir ID generation
@ 2009-04-16 22:05 Mark Alexander
  2009-04-21 14:00 ` William Morgan
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Alexander @ 2009-04-16 22:05 UTC (permalink / raw)


I've been studying maildir.rb (and adding some debug code) while
trying to figure out my lost message problem.  I think there may be a
problem with the way the internal message IDs are generated.  The
make_id method glues together the file timestamp and size.  But I
think this could lead to an out-of-order problem in the @ids array.

Consider two messages that arrive in the same second, but the second
message is smaller than the first.  Because the message size makes up
the low seven (decimal) digits of the ID, the second message, even
though it arrived later, will have an ID that is less than the first
message.

Then suppose that sup polls the maildir directory after the first
message arrives, but before the second message arrives, and sets the
cur_offset to the ID of the first message.  Then, the next time it
polls, it will see the second message, but because its ID is less than
that of the first message, it will appear before the first in the @ids
array after it is sorted.  So then the each method will skip the
second message, because cur_offset (the ID of the first message) will
be found in @ids after it.

Does this scenario make sense?  I have seen what appears to be one
instance of this happening, though I'm still watching closely and
adding more debugging code to make sure that it explains all of the
lost messages.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [sup-talk] Possible problem with maildir ID generation
  2009-04-16 22:05 [sup-talk] Possible problem with maildir ID generation Mark Alexander
@ 2009-04-21 14:00 ` William Morgan
  2009-04-21 15:33   ` Mark Alexander
       [not found]   ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>
  0 siblings, 2 replies; 8+ messages in thread
From: William Morgan @ 2009-04-21 14:00 UTC (permalink / raw)


Reformatted excerpts from Mark Alexander's message of 2009-04-16:
> Consider two messages that arrive in the same second, but the second
> message is smaller than the first.  Because the message size makes up
> the low seven (decimal) digits of the ID, the second message, even
> though it arrived later, will have an ID that is less than the first
> message.

I think you could be right. Using the size as part of the ID was
supposed to differentiate messages with the same timestamp, but it would
result in exactly the behavior you describe when polling.

I think there's a much simpler scheme we can use that will also fix
this. I'll post a patch soon and we can see if it addresses the
problem.
-- 
William <wmorgan-sup at masanjin.net>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [sup-talk] Possible problem with maildir ID generation
  2009-04-21 14:00 ` William Morgan
@ 2009-04-21 15:33   ` Mark Alexander
       [not found]   ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>
  1 sibling, 0 replies; 8+ messages in thread
From: Mark Alexander @ 2009-04-21 15:33 UTC (permalink / raw)

On Tue, Apr 21, 2009 at 7:00 AM, William Morgan
<wmorgan-sup at masanjin.net> wrote:
> I think you could be right. Using the size as part of the ID was
> supposed to differentiate messages with the same timestamp, but it would
> result in exactly the behavior you describe when polling.
>
> I think there's a much simpler scheme we can use that will also fix
> this. I'll post a patch soon and we can see if it addresses the
> problem.

I'd be very interested in this patch.

In the meantime, I made some minor changes to maildir.rb, without
changing the ID scheme.  One problem was every time a maildir was
polled, the most recent message (i.e., the one at cur_offset) would be
treated as a new message again.  I also changed last_offset to return
an ID that would be one second later than the last message seen.

These changes seem to have mostly fixed the lost message problem I was
having, though I'm not exactly sure why.  I've only had one lost
message over the last couple of days, instead of the expected 10 or
20.  I can't explain this one lost message, but I think it must be due
to a different problem, unrelated to maildir handling.  I was able to
get sup to see this message again by doing a 'touch' on both the
message itself and the containing maildir.

I doubt that my changes would fix the race condition I described
earlier, but I've avoided this problem by not running fetchmail
in the background while sup is running.

I'll send out my patch in a separate email.

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>]

* [sup-talk] Possible problem with maildir ID generation
       [not found]   ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>
@ 2009-04-28 23:29     ` Mark Alexander
  2009-04-29 22:31       ` William Morgan
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Alexander @ 2009-04-28 23:29 UTC (permalink / raw)

On Tue, Apr 28, 2009 at 3:18 PM, Marc Hartstein
<marc.hartstein at alum.vassar.edu> wrote:
> Isn't part of the maildir scheme that the filenames are guaranteed to be
> unique? ?It's been a while since I looked at this part of the sup
> source, but would it be possible to simply use the filename as the ID
> when working with maildir, rather than generating a new ID? ?Or is there
> an additional constraint (like ordering?) that needs to be satisfied and
> isn't by maildir?

Maildir filenames are unique, but they would need to be ordered
by time, since sup depends on that ordering (look in maildir.rb for
where it uses sort).  I'm not sure if mail delivery programs
(I use procmail) guarantee that the filenames are ordered that way.

I will say that the patch I sent out for maildir.rb has made
my life a lot happier, but it's still not ideal because of
the race condition I mentioned.

William was talking about using some other scheme to generate IDs.
We should see what he has to say about this.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [sup-talk] Possible problem with maildir ID generation
  2009-04-28 23:29     ` Mark Alexander
@ 2009-04-29 22:31       ` William Morgan
  2009-04-29 22:39         ` William Morgan
       [not found]         ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
  0 siblings, 2 replies; 8+ messages in thread
From: William Morgan @ 2009-04-29 22:31 UTC (permalink / raw)


Reformatted excerpts from Mark Alexander's message of 2009-04-28:
> Maildir filenames are unique, but they would need to be ordered by
> time, since sup depends on that ordering (look in maildir.rb for where
> it uses sort).  I'm not sure if mail delivery programs (I use
> procmail) guarantee that the filenames are ordered that way.

That's correct; the name is not sufficient as ids because Sup needs a
single pointer into the Maildir as a marker for what it has already
processed, so we have to use something ordinal.  But it can't just be
any old ordinal, it has to be something that corresponds with the way
messages are written to the Maildir, in order to be able to divide newer
messages from older ones.

A timestamp is the obvious choice, but messages can have the same
timestamp, so then what do you do? The current approach is to sort by
another arbitrary field (in this cae, message size), which gives a
unique ordering, but doesn't match up

(All this rigamarole about ordinals and blah blah blah is necessary
because I don't want Sup to rescan the entire Maildir unless absolutely
necessary. One day I'll convert my mbox to a Maildir with 250k files in
it, and a rescan will kill me, especially at Ruby speed.)

> I will say that the patch I sent out for maildir.rb has made my life a
> lot happier, but it's still not ideal because of the race condition I
> mentioned.
> 
> William was talking about using some other scheme to generate IDs.  We
> should see what he has to say about this.

Well I haven't quite started on it yet, but my plan is to:

a) Sort files by timestamp, and then by something else (maybe name), and
   use the position in that array instead of the timestamp. This doesn't
   solve anything, but it will make the ids prettier, and removes the
   hideous "%7d" thing.
b) When polling, if the current "offset" is N, return all messages that
   have a timestamp >= the Nth message. So this will mean that we'll
   rescan messages on occasion, but we shouldn't miss any.

Any obvious flaws?
-- 
William <wmorgan-sup at masanjin.net>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [sup-talk] Possible problem with maildir ID generation
  2009-04-29 22:31       ` William Morgan
@ 2009-04-29 22:39         ` William Morgan
       [not found]         ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
  1 sibling, 0 replies; 8+ messages in thread
From: William Morgan @ 2009-04-29 22:39 UTC (permalink / raw)


Reformatted excerpts from William Morgan's message of 2009-04-29:
> A timestamp is the obvious choice, but messages can have the same
> timestamp, so then what do you do? The current approach is to sort by
> another arbitrary field (in this cae, message size), which gives a
> unique ordering, but doesn't match up

Sigh.

Doesn't match up with that procmail, or whatever MUA, is giving us.
-- 
William <wmorgan-sup at masanjin.net>


^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>]

* [sup-talk] Possible problem with maildir ID generation
       [not found]         ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
@ 2009-05-04 16:10           ` William Morgan
       [not found]             ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net>
  0 siblings, 1 reply; 8+ messages in thread
From: William Morgan @ 2009-05-04 16:10 UTC (permalink / raw)

Reformatted excerpts from Marc Hartstein's message of 2009-04-29:
> On Wed, Apr 29, 2009 at 03:31:52PM -0700, William Morgan wrote:
> > (All this rigamarole about ordinals and blah blah blah is necessary
> > because I don't want Sup to rescan the entire Maildir unless absolutely
> > necessary. One day I'll convert my mbox to a Maildir with 250k files in
> > it, and a rescan will kill me, especially at Ruby speed.)
> 
> How are we defining 'rescan' here?  I can think of a couple of possible
> meanings, and I'm not sure where you are:
> 
> 1. Open every file in the maildir and read from it, every poll.
> 2. Visit every filename in the maildir to consider whether it is a new
> file, every poll.
> 3. Something else I'm not thinking of this instant.

You're probably confused because what I said was wrong. Here's what I
should have said:

Rescanning the entire directory to check for new mail is unavoidable in
Maildir. What I *do* have some control over, i.e. can try to optimize,
is the following operation: given a file in the directory, does it
represent a new message? One way to do this would be to maintain a set
of all filenames representing read messages, and to check for existence
within the set, for every file. Sup doesn't do this; instead it defines
an ordering on filenames, such that newer filenames are "larger" than
older filenames under that ordering, and maintains a threshold
representing the dividing point between new and old. The idea being that
performing that comparison is quick, and that storing the set of read
messages is also quick.

(And again, I care about the case where you have 500k messages, and so
doing set existence is painful. Although perhaps Bloom filters are the
solution... that would be interesting!)

> So the only way you're going to get a timestamp collision (on the
> filename timestamp, perhaps not on the actual ctime) is if you have
> multiple processes delivering mail simultaneously, or if you're
> synchronizing mail delivered on multiple hosts.  In the case of the
> latter, a timestamp-based heuristic for finding new messages isn't
> going to work.

I think the collisions we are seeing are due to timestamp granularity.
A little research suggests that e.g. ext2fs ctimes are 1-second
granular. It seems quite easy to get more than one email per second.

> Would it suffice to keep track of the filename of the most recent
> message added for each maildir source, and check everything with a
> time portion of the filename equal to or greater than that message for
> whether it needs to be added?  [although see above about whether that
> gains anything]

Yes, I think that's the solution. Basically switch from > to >= when
comparing timestamps, and realize that you're going to get some dupes.

> Having looked into this more closely, I'm starting to seriously
> reconsider whether maildir is really what I want to be using for storing
> my mail.  There are some weird timing issues.

Well, mbox has the advantage of being nice and linear so this problem
goes away, but is horribly broken in other ways (c.f. the recent thread
about the "From problem"). IMAP is a ridiculous PITA to deal with (at
least as a client), so you're kinda screwed in every direction here. If
anything, Maildir is less broken than the other alternatives.

Plus Maildir works really well with git. :)
-- 
William <wmorgan-sup at masanjin.net>

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net>]

* [sup-talk] Possible problem with maildir ID generation
       [not found]             ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net>
@ 2009-05-04 17:24               ` Mark Alexander
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Alexander @ 2009-05-04 17:24 UTC (permalink / raw)


On Mon, May 4, 2009 at 12:52 PM, Marc Hartstein
<marc.hartstein at alum.vassar.edu> wrote:
> On Mon, May 04, 2009 at 09:10:23AM -0700, William Morgan wrote:
>> I think the collisions we are seeing are due to timestamp granularity.
>> A little research suggests that e.g. ext2fs ctimes are 1-second
>> granular. It seems quite easy to get more than one email per second.
>
> It does, but as I mentioned somewhere in that email, the reference
> implementation of maildir (qmail) says that it will only deliver one
> message per second. ?However, I definitely concede that relying on this
> behavior would be setting things up for something to go wrong down the
> line.

Just a data point: in my setup (postfix+procmail) I'm seeing multiple messages
being delivered to the same maildir in the same second.  That was what
motivated my patch.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-05-04 17:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-16 22:05 [sup-talk] Possible problem with maildir ID generation Mark Alexander
2009-04-21 14:00 ` William Morgan
2009-04-21 15:33   ` Mark Alexander
     [not found]   ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>
2009-04-28 23:29     ` Mark Alexander
2009-04-29 22:31       ` William Morgan
2009-04-29 22:39         ` William Morgan
     [not found]         ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>
2009-05-04 16:10           ` William Morgan
     [not found]             ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net>
2009-05-04 17:24               ` Mark Alexander

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox