From mboxrd@z Thu Jan 1 00:00:00 1970 From: wmorgan-sup@masanjin.net (William Morgan) Date: Mon, 04 May 2009 07:35:20 -0700 Subject: [sup-talk] possible mbox "initial From" fix In-Reply-To: <6bb609560904291751n3cc62f46g6a4edcba1f97d6f7@mail.gmail.com> References: <1241027916-sup-7642@entry> <6bb609560904291751n3cc62f46g6a4edcba1f97d6f7@mail.gmail.com> Message-ID: <1241447393-sup-7632@entry> Reformatted excerpts from Bart Schaefer's message of 2009-04-29: > We found that in most cases this failed at (1) or succeeded very > quickly at (6a). Only obscure cases proceed to (7), but if you're > dealing with anything like old USENET news archives or folders written > by '80s-era mail clients you need either step (4) or step (7) to get > past the cruft. > > Note that the key is finding "From ... DATE" rather than "From ADDRESS > ..." if you really want to distinguish message separators from stuff > people type in a message body. I'm not sure you can do this with a > regular expression. Thanks! This is really helpful. I am a little worried about the current fix, since there's no real requirement that an email address have an @ sign in it for local users, and that will result in false negatives, and there's a non-trivial potential for false positives. If we went this route (which wouldn't require a big changeset), I may punt on parsing the date myself and just rely on Time.parse. Speed shouldn't really be affected (except in weird pathological cases) since the date parsing will be a second step. I like it. -- William