From mboxrd@z Thu Jan 1 00:00:00 1970 From: barton.schaefer@gmail.com (Bart Schaefer) Date: Wed, 29 Apr 2009 17:51:15 -0700 Subject: [sup-talk] possible mbox "initial From" fix In-Reply-To: <1241027916-sup-7642@entry> References: <1241027916-sup-7642@entry> Message-ID: <6bb609560904291751n3cc62f46g6a4edcba1f97d6f7@mail.gmail.com> On Wed, Apr 29, 2009 at 11:02 AM, William Morgan wrote: > After a lot of toying with RubyMail, hoping I could get it to behave > well, I finally gave up and just tweaked the regexp that determines > whether a line is an mbox separator or not, and bypassed RubyMail mbox > splitting entirely. It might still be too lenient---I have it looking > for /^From \S+@\S+ /, so it's not even bothering to parse a date, etc. > I'm hoping to strike somewhat of a balance between strict and liberal. Offering a geezer perspective on this, as one of the primary programmers on an old Unix mail reader that originally didn't understand any kind of folder *except* mbox ... the "match separator" method that worked best for us back in the early '90s went like this: (1) Check for "From " at start of line; if not found, return "not a separator". (2) Skip 5 characters, then skip any whitespace. (3) Remember this location. (4) Skip everything that is NOT whitespace, ignoring syntax for now. (5) Skip any whitespace. (6) Attempt to parse a date string in any of a variety of formats. (6a) If successful, return "is a separator" (6b) If unsuccessful, rewind to the location saved at (3) (7) Attempt to parse an email address in full RFC822 (now 5322) syntax, including comments, phrases, etc; if unsuccessful, return "not a separator" (8) Skip any whitespace following the parsed email address (9) Attempt again to parse a date; if successful, return "is a separator" (10) Return "not a separator" We found that in most cases this failed at (1) or succeeded very quickly at (6a). Only obscure cases proceed to (7), but if you're dealing with anything like old USENET news archives or folders written by '80s-era mail clients you need either step (4) or step (7) to get past the cruft. Note that the key is finding "From ... DATE" rather than "From ADDRESS ..." if you really want to distinguish message separators from stuff people type in a message body. I'm not sure you can do this with a regular expression. If you want details on the variety of date formats that we recognized, let me know ...