* [sup-talk] Possible problem with maildir ID generation @ 2009-04-16 22:05 Mark Alexander 2009-04-21 14:00 ` William Morgan 0 siblings, 1 reply; 8+ messages in thread From: Mark Alexander @ 2009-04-16 22:05 UTC (permalink / raw) I've been studying maildir.rb (and adding some debug code) while trying to figure out my lost message problem. I think there may be a problem with the way the internal message IDs are generated. The make_id method glues together the file timestamp and size. But I think this could lead to an out-of-order problem in the @ids array. Consider two messages that arrive in the same second, but the second message is smaller than the first. Because the message size makes up the low seven (decimal) digits of the ID, the second message, even though it arrived later, will have an ID that is less than the first message. Then suppose that sup polls the maildir directory after the first message arrives, but before the second message arrives, and sets the cur_offset to the ID of the first message. Then, the next time it polls, it will see the second message, but because its ID is less than that of the first message, it will appear before the first in the @ids array after it is sorted. So then the each method will skip the second message, because cur_offset (the ID of the first message) will be found in @ids after it. Does this scenario make sense? I have seen what appears to be one instance of this happening, though I'm still watching closely and adding more debugging code to make sure that it explains all of the lost messages. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [sup-talk] Possible problem with maildir ID generation 2009-04-16 22:05 [sup-talk] Possible problem with maildir ID generation Mark Alexander @ 2009-04-21 14:00 ` William Morgan 2009-04-21 15:33 ` Mark Alexander [not found] ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net> 0 siblings, 2 replies; 8+ messages in thread From: William Morgan @ 2009-04-21 14:00 UTC (permalink / raw) Reformatted excerpts from Mark Alexander's message of 2009-04-16: > Consider two messages that arrive in the same second, but the second > message is smaller than the first. Because the message size makes up > the low seven (decimal) digits of the ID, the second message, even > though it arrived later, will have an ID that is less than the first > message. I think you could be right. Using the size as part of the ID was supposed to differentiate messages with the same timestamp, but it would result in exactly the behavior you describe when polling. I think there's a much simpler scheme we can use that will also fix this. I'll post a patch soon and we can see if it addresses the problem. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [sup-talk] Possible problem with maildir ID generation 2009-04-21 14:00 ` William Morgan @ 2009-04-21 15:33 ` Mark Alexander [not found] ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net> 1 sibling, 0 replies; 8+ messages in thread From: Mark Alexander @ 2009-04-21 15:33 UTC (permalink / raw) On Tue, Apr 21, 2009 at 7:00 AM, William Morgan <wmorgan-sup at masanjin.net> wrote: > I think you could be right. Using the size as part of the ID was > supposed to differentiate messages with the same timestamp, but it would > result in exactly the behavior you describe when polling. > > I think there's a much simpler scheme we can use that will also fix > this. I'll post a patch soon and we can see if it addresses the > problem. I'd be very interested in this patch. In the meantime, I made some minor changes to maildir.rb, without changing the ID scheme. One problem was every time a maildir was polled, the most recent message (i.e., the one at cur_offset) would be treated as a new message again. I also changed last_offset to return an ID that would be one second later than the last message seen. These changes seem to have mostly fixed the lost message problem I was having, though I'm not exactly sure why. I've only had one lost message over the last couple of days, instead of the expected 10 or 20. I can't explain this one lost message, but I think it must be due to a different problem, unrelated to maildir handling. I was able to get sup to see this message again by doing a 'touch' on both the message itself and the containing maildir. I doubt that my changes would fix the race condition I described earlier, but I've avoided this problem by not running fetchmail in the background while sup is running. I'll send out my patch in a separate email. ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net>]
* [sup-talk] Possible problem with maildir ID generation [not found] ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net> @ 2009-04-28 23:29 ` Mark Alexander 2009-04-29 22:31 ` William Morgan 0 siblings, 1 reply; 8+ messages in thread From: Mark Alexander @ 2009-04-28 23:29 UTC (permalink / raw) On Tue, Apr 28, 2009 at 3:18 PM, Marc Hartstein <marc.hartstein at alum.vassar.edu> wrote: > Isn't part of the maildir scheme that the filenames are guaranteed to be > unique? ?It's been a while since I looked at this part of the sup > source, but would it be possible to simply use the filename as the ID > when working with maildir, rather than generating a new ID? ?Or is there > an additional constraint (like ordering?) that needs to be satisfied and > isn't by maildir? Maildir filenames are unique, but they would need to be ordered by time, since sup depends on that ordering (look in maildir.rb for where it uses sort). I'm not sure if mail delivery programs (I use procmail) guarantee that the filenames are ordered that way. I will say that the patch I sent out for maildir.rb has made my life a lot happier, but it's still not ideal because of the race condition I mentioned. William was talking about using some other scheme to generate IDs. We should see what he has to say about this. ^ permalink raw reply [flat|nested] 8+ messages in thread
* [sup-talk] Possible problem with maildir ID generation 2009-04-28 23:29 ` Mark Alexander @ 2009-04-29 22:31 ` William Morgan 2009-04-29 22:39 ` William Morgan [not found] ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net> 0 siblings, 2 replies; 8+ messages in thread From: William Morgan @ 2009-04-29 22:31 UTC (permalink / raw) Reformatted excerpts from Mark Alexander's message of 2009-04-28: > Maildir filenames are unique, but they would need to be ordered by > time, since sup depends on that ordering (look in maildir.rb for where > it uses sort). I'm not sure if mail delivery programs (I use > procmail) guarantee that the filenames are ordered that way. That's correct; the name is not sufficient as ids because Sup needs a single pointer into the Maildir as a marker for what it has already processed, so we have to use something ordinal. But it can't just be any old ordinal, it has to be something that corresponds with the way messages are written to the Maildir, in order to be able to divide newer messages from older ones. A timestamp is the obvious choice, but messages can have the same timestamp, so then what do you do? The current approach is to sort by another arbitrary field (in this cae, message size), which gives a unique ordering, but doesn't match up (All this rigamarole about ordinals and blah blah blah is necessary because I don't want Sup to rescan the entire Maildir unless absolutely necessary. One day I'll convert my mbox to a Maildir with 250k files in it, and a rescan will kill me, especially at Ruby speed.) > I will say that the patch I sent out for maildir.rb has made my life a > lot happier, but it's still not ideal because of the race condition I > mentioned. > > William was talking about using some other scheme to generate IDs. We > should see what he has to say about this. Well I haven't quite started on it yet, but my plan is to: a) Sort files by timestamp, and then by something else (maybe name), and use the position in that array instead of the timestamp. This doesn't solve anything, but it will make the ids prettier, and removes the hideous "%7d" thing. b) When polling, if the current "offset" is N, return all messages that have a timestamp >= the Nth message. So this will mean that we'll rescan messages on occasion, but we shouldn't miss any. Any obvious flaws? -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 8+ messages in thread
* [sup-talk] Possible problem with maildir ID generation 2009-04-29 22:31 ` William Morgan @ 2009-04-29 22:39 ` William Morgan [not found] ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net> 1 sibling, 0 replies; 8+ messages in thread From: William Morgan @ 2009-04-29 22:39 UTC (permalink / raw) Reformatted excerpts from William Morgan's message of 2009-04-29: > A timestamp is the obvious choice, but messages can have the same > timestamp, so then what do you do? The current approach is to sort by > another arbitrary field (in this cae, message size), which gives a > unique ordering, but doesn't match up Sigh. Doesn't match up with that procmail, or whatever MUA, is giving us. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net>]
* [sup-talk] Possible problem with maildir ID generation [not found] ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net> @ 2009-05-04 16:10 ` William Morgan [not found] ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net> 0 siblings, 1 reply; 8+ messages in thread From: William Morgan @ 2009-05-04 16:10 UTC (permalink / raw) Reformatted excerpts from Marc Hartstein's message of 2009-04-29: > On Wed, Apr 29, 2009 at 03:31:52PM -0700, William Morgan wrote: > > (All this rigamarole about ordinals and blah blah blah is necessary > > because I don't want Sup to rescan the entire Maildir unless absolutely > > necessary. One day I'll convert my mbox to a Maildir with 250k files in > > it, and a rescan will kill me, especially at Ruby speed.) > > How are we defining 'rescan' here? I can think of a couple of possible > meanings, and I'm not sure where you are: > > 1. Open every file in the maildir and read from it, every poll. > 2. Visit every filename in the maildir to consider whether it is a new > file, every poll. > 3. Something else I'm not thinking of this instant. You're probably confused because what I said was wrong. Here's what I should have said: Rescanning the entire directory to check for new mail is unavoidable in Maildir. What I *do* have some control over, i.e. can try to optimize, is the following operation: given a file in the directory, does it represent a new message? One way to do this would be to maintain a set of all filenames representing read messages, and to check for existence within the set, for every file. Sup doesn't do this; instead it defines an ordering on filenames, such that newer filenames are "larger" than older filenames under that ordering, and maintains a threshold representing the dividing point between new and old. The idea being that performing that comparison is quick, and that storing the set of read messages is also quick. (And again, I care about the case where you have 500k messages, and so doing set existence is painful. Although perhaps Bloom filters are the solution... that would be interesting!) > So the only way you're going to get a timestamp collision (on the > filename timestamp, perhaps not on the actual ctime) is if you have > multiple processes delivering mail simultaneously, or if you're > synchronizing mail delivered on multiple hosts. In the case of the > latter, a timestamp-based heuristic for finding new messages isn't > going to work. I think the collisions we are seeing are due to timestamp granularity. A little research suggests that e.g. ext2fs ctimes are 1-second granular. It seems quite easy to get more than one email per second. > Would it suffice to keep track of the filename of the most recent > message added for each maildir source, and check everything with a > time portion of the filename equal to or greater than that message for > whether it needs to be added? [although see above about whether that > gains anything] Yes, I think that's the solution. Basically switch from > to >= when comparing timestamps, and realize that you're going to get some dupes. > Having looked into this more closely, I'm starting to seriously > reconsider whether maildir is really what I want to be using for storing > my mail. There are some weird timing issues. Well, mbox has the advantage of being nice and linear so this problem goes away, but is horribly broken in other ways (c.f. the recent thread about the "From problem"). IMAP is a ridiculous PITA to deal with (at least as a client), so you're kinda screwed in every direction here. If anything, Maildir is less broken than the other alternatives. Plus Maildir works really well with git. :) -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net>]
* [sup-talk] Possible problem with maildir ID generation [not found] ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net> @ 2009-05-04 17:24 ` Mark Alexander 0 siblings, 0 replies; 8+ messages in thread From: Mark Alexander @ 2009-05-04 17:24 UTC (permalink / raw) On Mon, May 4, 2009 at 12:52 PM, Marc Hartstein <marc.hartstein at alum.vassar.edu> wrote: > On Mon, May 04, 2009 at 09:10:23AM -0700, William Morgan wrote: >> I think the collisions we are seeing are due to timestamp granularity. >> A little research suggests that e.g. ext2fs ctimes are 1-second >> granular. It seems quite easy to get more than one email per second. > > It does, but as I mentioned somewhere in that email, the reference > implementation of maildir (qmail) says that it will only deliver one > message per second. ?However, I definitely concede that relying on this > behavior would be setting things up for something to go wrong down the > line. Just a data point: in my setup (postfix+procmail) I'm seeing multiple messages being delivered to the same maildir in the same second. That was what motivated my patch. ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2009-05-04 17:24 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-04-16 22:05 [sup-talk] Possible problem with maildir ID generation Mark Alexander 2009-04-21 14:00 ` William Morgan 2009-04-21 15:33 ` Mark Alexander [not found] ` <20090428191822.GB10581@cabinet.hsd1.ma.comcast.net> 2009-04-28 23:29 ` Mark Alexander 2009-04-29 22:31 ` William Morgan 2009-04-29 22:39 ` William Morgan [not found] ` <20090429233820.GA14143@cabinet.hsd1.ma.comcast.net> 2009-05-04 16:10 ` William Morgan [not found] ` <20090504165224.GA15815@cabinet.hsd1.ma.comcast.net> 2009-05-04 17:24 ` Mark Alexander
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox