* [sup-talk] possible mbox "initial From" fix
@ 2009-04-29 18:02 William Morgan
2009-04-30 0:51 ` Bart Schaefer
0 siblings, 1 reply; 3+ messages in thread
From: William Morgan @ 2009-04-29 18:02 UTC (permalink / raw)
After a lot of toying with RubyMail, hoping I could get it to behave
well, I finally gave up and just tweaked the regexp that determines
whether a line is an mbox separator or not, and bypassed RubyMail mbox
splitting entirely. It might still be too lenient---I have it looking
for /^From \S+@\S+ /, so it's not even bothering to parse a date, etc.
I'm hoping to strike somewhat of a balance between strict and liberal.
So, please try it out and see if it solves your mbox problems.
--
William <wmorgan-sup at masanjin.net>
^ permalink raw reply [flat|nested] 3+ messages in thread
* [sup-talk] possible mbox "initial From" fix
2009-04-29 18:02 [sup-talk] possible mbox "initial From" fix William Morgan
@ 2009-04-30 0:51 ` Bart Schaefer
2009-05-04 14:35 ` William Morgan
0 siblings, 1 reply; 3+ messages in thread
From: Bart Schaefer @ 2009-04-30 0:51 UTC (permalink / raw)
On Wed, Apr 29, 2009 at 11:02 AM, William Morgan
<wmorgan-sup at masanjin.net> wrote:
> After a lot of toying with RubyMail, hoping I could get it to behave
> well, I finally gave up and just tweaked the regexp that determines
> whether a line is an mbox separator or not, and bypassed RubyMail mbox
> splitting entirely. It might still be too lenient---I have it looking
> for /^From \S+@\S+ /, so it's not even bothering to parse a date, etc.
> I'm hoping to strike somewhat of a balance between strict and liberal.
Offering a geezer perspective on this, as one of the primary
programmers on an old Unix mail reader that originally didn't
understand any kind of folder *except* mbox ... the "match separator"
method that worked best for us back in the early '90s went like this:
(1) Check for "From " at start of line; if not found, return "not a separator".
(2) Skip 5 characters, then skip any whitespace.
(3) Remember this location.
(4) Skip everything that is NOT whitespace, ignoring syntax for now.
(5) Skip any whitespace.
(6) Attempt to parse a date string in any of a variety of formats.
(6a) If successful, return "is a separator"
(6b) If unsuccessful, rewind to the location saved at (3)
(7) Attempt to parse an email address in full RFC822 (now 5322)
syntax, including comments, phrases, etc; if unsuccessful, return "not
a separator"
(8) Skip any whitespace following the parsed email address
(9) Attempt again to parse a date; if successful, return "is a separator"
(10) Return "not a separator"
We found that in most cases this failed at (1) or succeeded very
quickly at (6a). Only obscure cases proceed to (7), but if you're
dealing with anything like old USENET news archives or folders written
by '80s-era mail clients you need either step (4) or step (7) to get
past the cruft.
Note that the key is finding "From ... DATE" rather than "From ADDRESS
..." if you really want to distinguish message separators from stuff
people type in a message body. I'm not sure you can do this with a
regular expression.
If you want details on the variety of date formats that we recognized,
let me know ...
^ permalink raw reply [flat|nested] 3+ messages in thread
* [sup-talk] possible mbox "initial From" fix
2009-04-30 0:51 ` Bart Schaefer
@ 2009-05-04 14:35 ` William Morgan
0 siblings, 0 replies; 3+ messages in thread
From: William Morgan @ 2009-05-04 14:35 UTC (permalink / raw)
Reformatted excerpts from Bart Schaefer's message of 2009-04-29:
> We found that in most cases this failed at (1) or succeeded very
> quickly at (6a). Only obscure cases proceed to (7), but if you're
> dealing with anything like old USENET news archives or folders written
> by '80s-era mail clients you need either step (4) or step (7) to get
> past the cruft.
>
> Note that the key is finding "From ... DATE" rather than "From ADDRESS
> ..." if you really want to distinguish message separators from stuff
> people type in a message body. I'm not sure you can do this with a
> regular expression.
Thanks! This is really helpful. I am a little worried about the current
fix, since there's no real requirement that an email address have an @
sign in it for local users, and that will result in false negatives,
and there's a non-trivial potential for false positives.
If we went this route (which wouldn't require a big changeset), I may
punt on parsing the date myself and just rely on Time.parse. Speed
shouldn't really be affected (except in weird pathological cases) since
the date parsing will be a second step. I like it.
--
William <wmorgan-sup at masanjin.net>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-05-04 14:35 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-04-29 18:02 [sup-talk] possible mbox "initial From" fix William Morgan
2009-04-30 0:51 ` Bart Schaefer
2009-05-04 14:35 ` William Morgan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox