Re: [sup-talk] Indexing messages without ruby

From: Carl Worth <cworth@cworth.org>
To: sup-talk <sup-talk@rubyforge.org>
Subject: Re: [sup-talk] Indexing messages without ruby
Date: Mon, 19 Oct 2009 21:24:21 -0700	[thread overview]
Message-ID: <1256009934-sup-9323@yoom.home.cworth.org> (raw)
In-Reply-To: <1255623468-sup-2284@yoom.home.cworth.org>

[-- Attachment #1.1: Type: text/plain, Size: 3165 bytes --]

Excerpts from Carl Worth's message of Thu Oct 15 10:23:40 -0700 2009:
> As for performance, things look pretty good, but perhaps not as good
> as I had hoped.

I know William already said he's not all that concerned with the
performance of sup-sync since it's not a common operation, but me, I
can't stop working on the problem.

And I think that's justified, really. For one thing, the giant
sup-sync is one of the first things a new user has to do. And I think
that having to wait for an operation that's measured in hours before
being able to use the program at all can be very off-putting.

I think we could do better to give a good first impression.

> So this is preliminary, but it looks like notmuch gives a 5-10x
> performance improvement over sup, (but likely closer to the 5x end of
> that range unless you've got a very small index---at which point who
> cares how fast/slow things are?).

Those numbers were off. I now believe that my original code gained
only a 3x improvement by switching from ruby/rmail to C/GMime for mail
parsing. But I've done a little more coding since. Here are the
current results:

  For a benchmark of ~ 45000 messages, rate in messages/sec.:

  Rate    Commit ID       Significant change
  -----   ---------       ------------------
  41                      sup (with xapian, from next)
  120     5fbdbeb33       Switch from ruby to C (with GMime)
  538     9bc4253fa       Index headers only, not body
  1050    371091139       Use custom header parser, not GMime

  (Before each run the Linux disk cache was cleared with:
          sync; echo 3 > /proc/sys/vm/drop_caches
  )

So beyond the original 3x improvement, I gained a further 4x
improvement by simply doing less work. I'm now starting off by only
indexing message-id and thread-id data. That's obviously "cheating" in
terms of comparing performance, but I think it really makes sense to
do this.

The idea is that by just computing the thread-ids and indexing those
for a collection of email, that initial sup-sync could be performed
very quickly. Then, later, (as a background thread while sup is
running), the full-text indexing could be performed.

Finally, I gained a final 2x improvement by not using GMime at all,
(which constructs a data structure for the entire message, even if I
only want a few header), and instead just rolling a simple parser for
email headers. (Did you know you can hide nested parenthesized
comments all over the place in email headers? I didn't.)

I'm quite happy with the final result that's 25x faster than sup.  I
can build a cold-cache index from my half-million message archive in
less than 10 minutes, (rather than 4 hours). And performance is fairly
IO-bound at this point, (in the 10-minute run, less than 7 minutes of
CPU are used).

Anyway, there are some ideas to consider for sup.

If anyone wants to play with my code, it's here:

	git clone git://notmuch.org/notmuch

I won't bore the list with further developments in notmuch, if any,
unless it's on-topic, (such as someone trying to make sup work on top
of an index built by notmuch). And I'd be delighted to see that kind
of thing happen.

Happy hacking,

-Carl

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

[-- Attachment #2: Type: text/plain, Size: 140 bytes --]

_______________________________________________
sup-talk mailing list
sup-talk@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-talk