Archive of RubyForge sup-talk mailing list
 help / color / mirror / Atom feed
From: barton.schaefer@gmail.com (Bart Schaefer)
Subject: [sup-talk] possible mbox "initial From" fix
Date: Wed, 29 Apr 2009 17:51:15 -0700	[thread overview]
Message-ID: <6bb609560904291751n3cc62f46g6a4edcba1f97d6f7@mail.gmail.com> (raw)
In-Reply-To: <1241027916-sup-7642@entry>

On Wed, Apr 29, 2009 at 11:02 AM, William Morgan
<wmorgan-sup at masanjin.net> wrote:
> After a lot of toying with RubyMail, hoping I could get it to behave
> well, I finally gave up and just tweaked the regexp that determines
> whether a line is an mbox separator or not, and bypassed RubyMail mbox
> splitting entirely. It might still be too lenient---I have it looking
> for /^From \S+@\S+ /, so it's not even bothering to parse a date, etc.
> I'm hoping to strike somewhat of a balance between strict and liberal.

Offering a geezer perspective on this, as one of the primary
programmers on an old Unix mail reader that originally didn't
understand any kind of folder *except* mbox ... the "match separator"
method that worked best for us back in the early '90s went like this:

(1) Check for "From " at start of line; if not found, return "not a separator".
(2) Skip 5 characters, then skip any whitespace.
(3) Remember this location.
(4) Skip everything that is NOT whitespace, ignoring syntax for now.
(5) Skip any whitespace.
(6) Attempt to parse a date string in any of a variety of formats.
(6a) If successful, return "is a separator"
(6b) If unsuccessful, rewind to the location saved at (3)
(7) Attempt to parse an email address in full RFC822 (now 5322)
syntax, including comments, phrases, etc; if unsuccessful, return "not
a separator"
(8) Skip any whitespace following the parsed email address
(9) Attempt again to parse a date; if successful, return "is a separator"
(10) Return "not a separator"

We found that in most cases this failed at (1) or succeeded very
quickly at (6a).  Only obscure cases proceed to (7), but if you're
dealing with anything like old USENET news archives or folders written
by '80s-era mail clients you need either step (4) or step (7) to get
past the cruft.

Note that the key is finding "From ... DATE" rather than "From ADDRESS
..." if you really want to distinguish message separators from stuff
people type in a message body.  I'm not sure you can do this with a
regular expression.

If you want details on the variety of date formats that we recognized,
let me know ...


  reply	other threads:[~2009-04-30  0:51 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-04-29 18:02 William Morgan
2009-04-30  0:51 ` Bart Schaefer [this message]
2009-05-04 14:35   ` William Morgan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6bb609560904291751n3cc62f46g6a4edcba1f97d6f7@mail.gmail.com \
    --to=barton.schaefer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox