From: Carl Worth <cworth@cworth.org>
To: sup-talk <sup-talk@rubyforge.org>
Subject: Re: [sup-talk] Indexing messages without ruby
Date: Mon, 19 Oct 2009 21:24:21 -0700 [thread overview]
Message-ID: <1256009934-sup-9323@yoom.home.cworth.org> (raw)
In-Reply-To: <1255623468-sup-2284@yoom.home.cworth.org>
[-- Attachment #1.1: Type: text/plain, Size: 3165 bytes --]
Excerpts from Carl Worth's message of Thu Oct 15 10:23:40 -0700 2009:
> As for performance, things look pretty good, but perhaps not as good
> as I had hoped.
I know William already said he's not all that concerned with the
performance of sup-sync since it's not a common operation, but me, I
can't stop working on the problem.
And I think that's justified, really. For one thing, the giant
sup-sync is one of the first things a new user has to do. And I think
that having to wait for an operation that's measured in hours before
being able to use the program at all can be very off-putting.
I think we could do better to give a good first impression.
> So this is preliminary, but it looks like notmuch gives a 5-10x
> performance improvement over sup, (but likely closer to the 5x end of
> that range unless you've got a very small index---at which point who
> cares how fast/slow things are?).
Those numbers were off. I now believe that my original code gained
only a 3x improvement by switching from ruby/rmail to C/GMime for mail
parsing. But I've done a little more coding since. Here are the
current results:
For a benchmark of ~ 45000 messages, rate in messages/sec.:
Rate Commit ID Significant change
----- --------- ------------------
41 sup (with xapian, from next)
120 5fbdbeb33 Switch from ruby to C (with GMime)
538 9bc4253fa Index headers only, not body
1050 371091139 Use custom header parser, not GMime
(Before each run the Linux disk cache was cleared with:
sync; echo 3 > /proc/sys/vm/drop_caches
)
So beyond the original 3x improvement, I gained a further 4x
improvement by simply doing less work. I'm now starting off by only
indexing message-id and thread-id data. That's obviously "cheating" in
terms of comparing performance, but I think it really makes sense to
do this.
The idea is that by just computing the thread-ids and indexing those
for a collection of email, that initial sup-sync could be performed
very quickly. Then, later, (as a background thread while sup is
running), the full-text indexing could be performed.
Finally, I gained a final 2x improvement by not using GMime at all,
(which constructs a data structure for the entire message, even if I
only want a few header), and instead just rolling a simple parser for
email headers. (Did you know you can hide nested parenthesized
comments all over the place in email headers? I didn't.)
I'm quite happy with the final result that's 25x faster than sup. I
can build a cold-cache index from my half-million message archive in
less than 10 minutes, (rather than 4 hours). And performance is fairly
IO-bound at this point, (in the 10-minute run, less than 7 minutes of
CPU are used).
Anyway, there are some ideas to consider for sup.
If anyone wants to play with my code, it's here:
git clone git://notmuch.org/notmuch
I won't bore the list with further developments in notmuch, if any,
unless it's on-topic, (such as someone trying to make sup work on top
of an index built by notmuch). And I'd be delighted to see that kind
of thing happen.
Happy hacking,
-Carl
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 190 bytes --]
[-- Attachment #2: Type: text/plain, Size: 140 bytes --]
_______________________________________________
sup-talk mailing list
sup-talk@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-talk
next prev parent reply other threads:[~2009-10-20 4:24 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-10-15 17:23 Carl Worth
2009-10-20 4:24 ` Carl Worth [this message]
2009-10-20 15:35 ` Carl Worth
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1256009934-sup-9323@yoom.home.cworth.org \
--to=cworth@cworth.org \
--cc=sup-talk@rubyforge.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox