* [sup-talk] [PATCH 0/18] Xapian-based index @ 2009-06-20 20:49 Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 01/18] remove load_entry_for_id call in sup-recover-sources Rich Lane 2009-06-24 16:30 ` [sup-talk] [PATCH 0/18] Xapian-based index William Morgan 0 siblings, 2 replies; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:49 UTC (permalink / raw) This patch series refactors the Index class to remove Ferret-isms and support multiple index implementations. The included XapianIndex is a bit faster at indexing messages and significantly faster when searching because it precomputes thread membership. It also works on Ruby 1.9.1. You can enable the new index with the environment variable SUP_INDEX=xapian. It's missing a couple of features, notably threading by subject. I'm sure there are many more bugs left, so I'd appreciate any testing or review you all can provide. These patches depend on the two I posted June 16: 'cleanup interface' and 'consistent naming'. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 01/18] remove load_entry_for_id call in sup-recover-sources 2009-06-20 20:49 [sup-talk] [PATCH 0/18] Xapian-based index Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 02/18] remove load_entry_for_id call in DraftManager.discard Rich Lane 2009-06-24 16:30 ` [sup-talk] [PATCH 0/18] Xapian-based index William Morgan 1 sibling, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- bin/sup-recover-sources | 12 +++++------- lib/sup/index.rb | 6 ++++++ 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/bin/sup-recover-sources b/bin/sup-recover-sources index d3b1424..6e3810c 100755 --- a/bin/sup-recover-sources +++ b/bin/sup-recover-sources @@ -69,15 +69,14 @@ ARGV.each do |fn| Redwood::MBox::Loader.new(fn, nil, !$opts[:unusual], $opts[:archive]) end - source_ids = {} + source_ids = Hash.new 0 count = 0 source.each do |offset, labels| m = Redwood::Message.new :source => source, :source_info => offset - docid, entry = index.load_entry_for_id m.id - next unless entry - #puts "# #{source} #{offset} #{entry[:source_id]}" - - source_ids[entry[:source_id]] = (source_ids[entry[:source_id]] || 0) + 1 + m.load_from_source! + source_id = index.source_for_id m.id + next unless source_id + source_ids[source_id] += 1 count += 1 break if count == $opts[:scan_num] end @@ -86,7 +85,6 @@ ARGV.each do |fn| id = source_ids.keys.first.to_i puts "assigned #{source} to #{source_ids.keys.first}" source.id = id - source.seek_to! source.total index.add_source source else puts ">> unable to determine #{source}: #{source_ids.inspect}" diff --git a/lib/sup/index.rb b/lib/sup/index.rb index d15e7bb..b5d0e5d 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -494,6 +494,12 @@ EOS @index_mutex.synchronize { @index.optimize } end + def source_for_id id + entry = @index[id] + return unless entry + entry[:source_id].to_i + end + class ParseError < StandardError; end ## parse a query string from the user. returns a query object -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 02/18] remove load_entry_for_id call in DraftManager.discard 2009-06-20 20:50 ` [sup-talk] [PATCH 01/18] remove load_entry_for_id call in sup-recover-sources Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 03/18] remove ferret entry from poll/sync interface Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/draft.rb | 9 ++------- 1 files changed, 2 insertions(+), 7 deletions(-) diff --git a/lib/sup/draft.rb b/lib/sup/draft.rb index 9127739..1233945 100644 --- a/lib/sup/draft.rb +++ b/lib/sup/draft.rb @@ -31,14 +31,9 @@ class DraftManager end def discard m - docid, entry = Index.load_entry_for_id m.id - unless entry - Redwood::log "can't find entry for draft: #{m.id.inspect}. You probably already discarded it." - return - end - raise ArgumentError, "not a draft: source id #{entry[:source_id].inspect}, should be #{DraftManager.source_id.inspect} for #{m.id.inspect} / docno #{docid}" unless entry[:source_id].to_i == DraftManager.source_id + raise ArgumentError, "not a draft: source id #{m.source.id.inspect}, should be #{DraftManager.source_id.inspect} for #{m.id.inspect}" unless m.source.id.to_i == DraftManager.source_id Index.delete m.id - File.delete @source.fn_for_offset(entry[:source_info]) + File.delete @source.fn_for_offset(m.source_info) UpdateManager.relay self, :single_message_deleted, m end end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 03/18] remove ferret entry from poll/sync interface 2009-06-20 20:50 ` [sup-talk] [PATCH 02/18] remove load_entry_for_id call in DraftManager.discard Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 04/18] index: remove unused method load_entry_for_id Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) This leads to an extra index lookup in the sup-sync update path, but I think it's worth it for the sake of API simplicity. --- bin/sup-sync | 8 ++++---- bin/sup-sync-back | 6 +++--- lib/sup/index.rb | 18 ++++-------------- lib/sup/message.rb | 6 ++++++ lib/sup/poll.rb | 33 ++++++++++++++------------------- lib/sup/sent.rb | 2 +- 6 files changed, 32 insertions(+), 41 deletions(-) diff --git a/bin/sup-sync b/bin/sup-sync index a759cbe..18a3cab 100755 --- a/bin/sup-sync +++ b/bin/sup-sync @@ -137,7 +137,7 @@ begin num_added = num_updated = num_scanned = num_restored = 0 last_info_time = start_time = Time.now - Redwood::PollManager.add_messages_from source, :force_overwrite => true do |m, offset, entry| + Redwood::PollManager.add_messages_from source, :force_overwrite => true do |m_old, m, offset| num_scanned += 1 seen[m.id] = true @@ -153,10 +153,10 @@ begin ## skip if we're operating only on changed messages, the message ## is in the index, and it's unchanged from what the source is ## reporting. - next if target == :changed && entry && entry[:source_id].to_i == source.id && entry[:source_info].to_i == offset + next if target == :changed && m_old && m_old.source.id == source.id && m_old.source_info == offset ## get the state currently in the index - index_state = entry[:label].symbolistize if entry + index_state = m_old.labels.dup if m_old ## skip if we're operating on restored messages, and this one ## ain't. @@ -196,7 +196,7 @@ begin puts "Adding message #{source}##{offset} from #{m.from} with state {#{m.labels * ', '}}" if opts[:verbose] num_added += 1 else - puts "Updating message #{source}##{offset}, source #{entry[:source_id]} => #{source.id}, offset #{entry[:source_info]} => #{offset}, state {#{index_state * ', '}} => {#{m.labels * ', '}}" if opts[:verbose] + puts "Updating message #{source}##{offset}, source #{m_old.source.id} => #{source.id}, offset #{m_old.source_info} => #{offset}, state {#{index_state * ', '}} => {#{m.labels * ', '}}" if opts[:verbose] num_updated += 1 end diff --git a/bin/sup-sync-back b/bin/sup-sync-back index 4f1387e..1c746d2 100755 --- a/bin/sup-sync-back +++ b/bin/sup-sync-back @@ -105,11 +105,11 @@ EOS num_dropped = num_moved = num_scanned = 0 out_fp = Tempfile.new "sup-sync-back-#{source.id}" - Redwood::PollManager.add_messages_from source do |m, offset, entry| + Redwood::PollManager.add_messages_from source do |m_old, m, offset| num_scanned += 1 - if entry - labels = entry[:label].symbolistize.to_boolean_h + if m_old + labels = m_old.labels if labels.member? :deleted if opts[:drop_deleted] diff --git a/lib/sup/index.rb b/lib/sup/index.rb index b5d0e5d..89795da 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -174,16 +174,10 @@ EOS ## Syncs the message to the index, replacing any previous version. adding ## either way. Index state will be determined by the message's #labels ## accessor. - ## - ## if need_load is false, docid and entry are assumed to be set to the - ## result of load_entry_for_id (which can be nil). - def sync_message m, need_load=true, docid=nil, entry=nil, opts={} - docid, entry = load_entry_for_id m.id if need_load + def sync_message m, opts={} + entry = @index[m.id] raise "no source info for message #{m.id}" unless m.source && m.source_info - @index_mutex.synchronize do - raise "trying to delete non-corresponding entry #{docid} with index message-id #{@index[docid][:message_id].inspect} and parameter message id #{m.id.inspect}" if docid && @index[docid][:message_id] != m.id - end source_id = if m.source.is_a? Integer m.source @@ -256,13 +250,9 @@ EOS } @index_mutex.synchronize do - @index.delete docid if docid + @index.delete m.id @index.add_document d end - - ## this hasn't been triggered in a long time. - ## docid, entry = load_entry_for_id m.id - ## raise "just added message #{m.id.inspect} but couldn't find it in a search" unless docid end def save_index fn=File.join(@dir, "ferret") @@ -391,7 +381,7 @@ EOS ## builds a message object from a ferret result def build_message docid @index_mutex.synchronize do - doc = @index[docid] + doc = @index[docid] or return source = @source_mutex.synchronize { @sources[doc[:source_id].to_i] } raise "invalid source #{doc[:source_id]}" unless source diff --git a/lib/sup/message.rb b/lib/sup/message.rb index 8525fdf..b667cb3 100644 --- a/lib/sup/message.rb +++ b/lib/sup/message.rb @@ -288,6 +288,12 @@ EOS "Subject: #{@subj}"] end + def self.build_from_source source, source_info + m = Message.new :source => source, :source_info => source_info + m.load_from_source! + m + end + private ## here's where we handle decoding mime attachments. unfortunately diff --git a/lib/sup/poll.rb b/lib/sup/poll.rb index 74f7d1c..bbad5f2 100644 --- a/lib/sup/poll.rb +++ b/lib/sup/poll.rb @@ -95,11 +95,11 @@ EOS num = 0 numi = 0 - add_messages_from source do |m, offset, entry| + add_messages_from source do |m_old, m, offset| ## always preserve the labels on disk. - m.labels = ((m.labels - [:unread, :inbox]) + entry[:label].symbolistize).uniq if entry + m.labels = ((m.labels - [:unread, :inbox]) + m_old.labels).uniq if m_old yield "Found message at #{offset} with labels {#{m.labels * ', '}}" - unless entry + unless m_old num += 1 from_and_subj << [m.from && m.from.longname, m.subj] if m.has_label?(:inbox) && ([:spam, :deleted, :killed] & m.labels).empty? @@ -138,29 +138,24 @@ EOS begin return if source.done? || source.has_errors? - source.each do |offset, labels| + source.each do |offset, default_labels| if source.has_errors? Redwood::log "error loading messages from #{source}: #{source.error.message}" return end - labels << :sent if source.uri.eql?(SentManager.source_uri) - labels.each { |l| LabelManager << l } - labels = labels + (source.archived? ? [] : [:inbox]) + m_new = Message.build_from_source source, offset + m_old = Index.build_message m_new.id - m = Message.new :source => source, :source_info => offset, :labels => labels - m.load_from_source! + m_new.labels = default_labels + (source.archived? ? [] : [:inbox]) + m_new.labels << :sent if source.uri.eql?(SentManager.source_uri) + m_new.labels.delete :unread if m_new.source_marked_read? + m_new.labels.each { |l| LabelManager << l } - if m.source_marked_read? - m.remove_label :unread - labels.delete :unread - end - - docid, entry = Index.load_entry_for_id m.id - HookManager.run "before-add-message", :message => m - m = yield(m, offset, entry) or next if block_given? - times = Index.sync_message m, false, docid, entry, opts - UpdateManager.relay self, :added, m unless entry + HookManager.run "before-add-message", :message => m_new + m_ret = yield(m_old, m_new, offset) or next if block_given? + Index.sync_message m_ret, opts + UpdateManager.relay self, :added, m_ret unless m_old end rescue SourceError => e Redwood::log "problem getting messages from #{source}: #{e.message}" diff --git a/lib/sup/sent.rb b/lib/sup/sent.rb index e6ae856..b750d71 100644 --- a/lib/sup/sent.rb +++ b/lib/sup/sent.rb @@ -30,7 +30,7 @@ class SentManager def write_sent_message date, from_email, &block @source.store_message date, from_email, &block - PollManager.add_messages_from(@source) do |m, o, e| + PollManager.add_messages_from(@source) do |m_old, m, offset| m.remove_label :unread m end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 04/18] index: remove unused method load_entry_for_id 2009-06-20 20:50 ` [sup-talk] [PATCH 03/18] remove ferret entry from poll/sync interface Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 05/18] switch DraftManager to use Message.build_from_source Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/index.rb | 11 ----------- 1 files changed, 0 insertions(+), 11 deletions(-) diff --git a/lib/sup/index.rb b/lib/sup/index.rb index 89795da..64afbdd 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -411,17 +411,6 @@ EOS def delete id; @index_mutex.synchronize { @index.delete id } end - def load_entry_for_id mid - @index_mutex.synchronize do - results = @index.search Ferret::Search::TermQuery.new(:message_id, mid) - return if results.total_hits == 0 - docid = results.hits[0].doc - entry = @index[docid] - entry_dup = entry.fields.inject({}) { |h, f| h[f] = entry[f]; h } - [docid, entry_dup] - end - end - def load_contacts emails, h={} q = Ferret::Search::BooleanQuery.new true emails.each do |e| -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 05/18] switch DraftManager to use Message.build_from_source 2009-06-20 20:50 ` [sup-talk] [PATCH 04/18] index: remove unused method load_entry_for_id Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 06/18] index: move has_any_from_source_with_label? to sup-sync-back Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/draft.rb | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/lib/sup/draft.rb b/lib/sup/draft.rb index 1233945..dd4574d 100644 --- a/lib/sup/draft.rb +++ b/lib/sup/draft.rb @@ -21,7 +21,8 @@ class DraftManager my_message = nil @source.each do |thisoffset, theselabels| - m = Message.new :source => @source, :source_info => thisoffset, :labels => theselabels + m = Message.build_from_source @source, thisoffset + m.labels = theselabels Index.sync_message m UpdateManager.relay self, :added, m my_message = m if thisoffset == offset -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 06/18] index: move has_any_from_source_with_label? to sup-sync-back 2009-06-20 20:50 ` [sup-talk] [PATCH 05/18] switch DraftManager to use Message.build_from_source Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 07/18] move source-related methods to SourceManager Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- bin/sup-sync-back | 7 ++++++- lib/sup/index.rb | 7 ------- 2 files changed, 6 insertions(+), 8 deletions(-) diff --git a/bin/sup-sync-back b/bin/sup-sync-back index 1c746d2..05b9e8c 100755 --- a/bin/sup-sync-back +++ b/bin/sup-sync-back @@ -4,6 +4,7 @@ require 'rubygems' require 'uri' require 'tempfile' require 'trollop' +require 'enumerator' require "sup" ## save a message 'm' to an open file pointer 'fp' @@ -14,6 +15,10 @@ def die msg $stderr.puts "Error: #{msg}" exit(-1) end +def has_any_from_source_with_label? index, source, label + query = { :source_id => source.id, :label => label, :limit => 1 } + not Enumerable::Enumerator.new(index, :each_docid, query).map.empty? +end opts = Trollop::options do version "sup-sync-back (sup #{Redwood::VERSION})" @@ -96,7 +101,7 @@ EOS sources.each do |source| $stderr.puts "Scanning #{source}..." - unless ((opts[:drop_deleted] || opts[:move_deleted]) && index.has_any_from_source_with_label?(source, :deleted)) || ((opts[:drop_spam] || opts[:move_spam]) && index.has_any_from_source_with_label?(source, :spam)) + unless ((opts[:drop_deleted] || opts[:move_deleted]) && has_any_from_source_with_label?(index, source, :deleted)) || ((opts[:drop_spam] || opts[:move_spam]) && has_any_from_source_with_label?(index, source, :spam)) $stderr.puts "Nothing to do from this source; skipping" next end diff --git a/lib/sup/index.rb b/lib/sup/index.rb index 64afbdd..b9f4b36 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -450,13 +450,6 @@ EOS end end - def has_any_from_source_with_label? source, label - q = Ferret::Search::BooleanQuery.new - q.add_query Ferret::Search::TermQuery.new("source_id", source.id.to_s), :must - q.add_query Ferret::Search::TermQuery.new("label", label.to_s), :must - @index_mutex.synchronize { @index.search(q, :limit => 1).total_hits > 0 } - end - def each_docid query={} ferret_query = build_ferret_query query results = @index_mutex.synchronize { @index.search ferret_query, :limit => (query[:limit] || :all) } -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 07/18] move source-related methods to SourceManager 2009-06-20 20:50 ` [sup-talk] [PATCH 06/18] index: move has_any_from_source_with_label? to sup-sync-back Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 08/18] index: remove unused method fresh_thread_id Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- bin/sup | 10 ++++---- bin/sup-add | 10 ++++---- bin/sup-config | 14 +++++----- bin/sup-recover-sources | 7 +++-- bin/sup-sync | 6 ++-- bin/sup-sync-back | 4 +- bin/sup-tweak-labels | 4 +- lib/sup.rb | 5 ++- lib/sup/index.rb | 52 ++---------------------------------------- lib/sup/poll.rb | 2 +- lib/sup/source.rb | 57 +++++++++++++++++++++++++++++++++++++++++++++++ 11 files changed, 92 insertions(+), 79 deletions(-) diff --git a/bin/sup b/bin/sup index 302ad7c..1febefd 100755 --- a/bin/sup +++ b/bin/sup @@ -160,17 +160,17 @@ begin Redwood::start Index.load - if(s = Index.source_for DraftManager.source_name) + if(s = Redwood::SourceManager.source_for DraftManager.source_name) DraftManager.source = s else Redwood::log "no draft source, auto-adding..." - Index.add_source DraftManager.new_source + Redwood::SourceManager.add_source DraftManager.new_source end - if(s = Index.source_for SentManager.source_uri) + if(s = Redwood::SourceManager.source_for SentManager.source_uri) SentManager.source = s else - Index.add_source SentManager.default_source + Redwood::SourceManager.add_source SentManager.default_source end HookManager.run "startup" @@ -190,7 +190,7 @@ begin bm.draw_screen - Index.usual_sources.each do |s| + Redwood::SourceManager.usual_sources.each do |s| next unless s.respond_to? :connect reporting_thread("call #connect on #{s}") do begin diff --git a/bin/sup-add b/bin/sup-add index 50bbb29..c491ca7 100755 --- a/bin/sup-add +++ b/bin/sup-add @@ -82,12 +82,12 @@ index = Redwood::Index.new index.lock_or_die begin - index.load_sources + Redwood::SourceManager.load_sources ARGV.each do |uri| labels = $opts[:labels] ? $opts[:labels].split(/\s*,\s*/).uniq : [] - if !$opts[:force_new] && index.source_for(uri) + if !$opts[:force_new] && Redwood::SourceManager.source_for(uri) say "Already know about #{uri}; skipping." next end @@ -99,10 +99,10 @@ begin when "mbox+ssh" say "For SSH connections, if you will use public key authentication, you may leave the username and password blank." say "" - username, password = get_login_info uri, index.sources + username, password = get_login_info uri, Redwood::SourceManager.sources Redwood::MBox::SSHLoader.new uri, username, password, nil, !$opts[:unusual], $opts[:archive], nil, labels when "imap", "imaps" - username, password = get_login_info uri, index.sources + username, password = get_login_info uri, Redwood::SourceManager.sources Redwood::IMAP.new uri, username, password, nil, !$opts[:unusual], $opts[:archive], nil, labels when "maildir" Redwood::Maildir.new uri, nil, !$opts[:unusual], $opts[:archive], nil, labels @@ -114,7 +114,7 @@ begin Trollop::die "Unknown source type #{parsed_uri.scheme.inspect}" end say "Adding #{source}..." - index.add_source source + Redwood::SourceManager.add_source source end ensure index.save diff --git a/bin/sup-config b/bin/sup-config index 398197f..9fcbee6 100755 --- a/bin/sup-config +++ b/bin/sup-config @@ -152,7 +152,7 @@ end $terminal.wrap_at = :auto Redwood::start index = Redwood::Index.new -index.load_sources +Redwood::SourceManager.load_sources say <<EOS Howdy neighbor! This here's sup-config, ready to help you jack in to @@ -191,12 +191,12 @@ $config[:editor] = editor done = false until done say "\nNow, we'll tell Sup where to find all your email." - index.load_sources + Redwood::SourceManager.load_sources say "Current sources:" - if index.sources.empty? + if Redwood::SourceManager.sources.empty? say " No sources!" else - index.sources.each { |s| puts "* #{s}" } + Redwood::SourceManager.sources.each { |s| puts "* #{s}" } end say "\n" @@ -210,8 +210,8 @@ end say "\nSup needs to know where to store your sent messages." say "Only sources capable of storing mail will be listed.\n\n" -index.load_sources -if index.sources.empty? +Redwood::SourceManager.load_sources +if Redwood::SourceManager.sources.empty? say "\nUsing the default sup://sent, since you haven't configured other sources yet." $config[:sent_source] = 'sup://sent' else @@ -222,7 +222,7 @@ else choose do |menu| menu.prompt = "Store my sent mail in? " - valid_sents = index.sources.each do |s| + valid_sents = Redwood::SourceManager.sources.each do |s| have_sup_sent = true if s.to_s.eql?('sup://sent') menu.choice(s.to_s) { $config[:sent_source] = s.to_s } if s.respond_to? :store_message diff --git a/bin/sup-recover-sources b/bin/sup-recover-sources index 6e3810c..db75b11 100755 --- a/bin/sup-recover-sources +++ b/bin/sup-recover-sources @@ -48,13 +48,14 @@ EOS end.parse(ARGV) require "sup" +Redwood::start puts "loading index..." index = Redwood::Index.new index.load puts "loaded index of #{index.size} messages" ARGV.each do |fn| - next if index.source_for fn + next if Redwood::SourceManager.source_for fn ## TODO: merge this code with the same snippet in import source = @@ -74,7 +75,7 @@ ARGV.each do |fn| source.each do |offset, labels| m = Redwood::Message.new :source => source, :source_info => offset m.load_from_source! - source_id = index.source_for_id m.id + source_id = Redwood::SourceManager.source_for_id m.id next unless source_id source_ids[source_id] += 1 count += 1 @@ -85,7 +86,7 @@ ARGV.each do |fn| id = source_ids.keys.first.to_i puts "assigned #{source} to #{source_ids.keys.first}" source.id = id - index.add_source source + Redwood::SourceManager.add_source source else puts ">> unable to determine #{source}: #{source_ids.inspect}" end diff --git a/bin/sup-sync b/bin/sup-sync index 18a3cab..270524a 100755 --- a/bin/sup-sync +++ b/bin/sup-sync @@ -116,11 +116,11 @@ begin index.load sources = ARGV.map do |uri| - index.source_for uri or Trollop::die "Unknown source: #{uri}. Did you add it with sup-add first?" + Redwood::SourceManager.source_for uri or Trollop::die "Unknown source: #{uri}. Did you add it with sup-add first?" end - sources = index.usual_sources if sources.empty? - sources = index.sources if opts[:all_sources] + sources = Redwood::SourceManager.usual_sources if sources.empty? + sources = Redwood::SourceManager.sources if opts[:all_sources] unless target == :new if opts[:start_at] diff --git a/bin/sup-sync-back b/bin/sup-sync-back index 05b9e8c..679e03a 100755 --- a/bin/sup-sync-back +++ b/bin/sup-sync-back @@ -80,13 +80,13 @@ begin index.load sources = ARGV.map do |uri| - s = index.source_for(uri) or die "unknown source: #{uri}. Did you add it with sup-add first?" + s = Redwood::SourceManager.source_for(uri) or die "unknown source: #{uri}. Did you add it with sup-add first?" s.is_a?(Redwood::MBox::Loader) or die "#{uri} is not an mbox source." s end if sources.empty? - sources = index.usual_sources.select { |s| s.is_a? Redwood::MBox::Loader } + sources = Redwood::SourceManager.usual_sources.select { |s| s.is_a? Redwood::MBox::Loader } end unless sources.all? { |s| s.file_path.nil? } || File.executable?(dotlockfile) || opts[:dont_use_dotlockfile] diff --git a/bin/sup-tweak-labels b/bin/sup-tweak-labels index 6f603e2..95a3b03 100755 --- a/bin/sup-tweak-labels +++ b/bin/sup-tweak-labels @@ -66,10 +66,10 @@ begin source_ids = if opts[:all_sources] - index.sources + Redwood::SourceManager.sources else ARGV.map do |uri| - index.source_for uri or Trollop::die "Unknown source: #{uri}. Did you add it with sup-add first?" + Redwood::SourceManager.source_for uri or Trollop::die "Unknown source: #{uri}. Did you add it with sup-add first?" end end.map { |s| s.id } Trollop::die "nothing to do: no sources" if source_ids.empty? diff --git a/lib/sup.rb b/lib/sup.rb index 8373820..5689c2b 100644 --- a/lib/sup.rb +++ b/lib/sup.rb @@ -115,6 +115,7 @@ module Redwood Redwood::SuicideManager.new Redwood::SUICIDE_FN Redwood::CryptoManager.new Redwood::UndoManager.new + Redwood::SourceManager.new end def finish @@ -130,7 +131,7 @@ module Redwood def report_broken_sources opts={} return unless BufferManager.instantiated? - broken_sources = Index.sources.select { |s| s.error.is_a? FatalSourceError } + broken_sources = SourceManager.sources.select { |s| s.error.is_a? FatalSourceError } unless broken_sources.empty? BufferManager.spawn_unless_exists("Broken source notification for #{broken_sources.join(',')}", opts) do TextMode.new(<<EOM) @@ -147,7 +148,7 @@ EOM end end - desynced_sources = Index.sources.select { |s| s.error.is_a? OutOfSyncSourceError } + desynced_sources = SourceManager.sources.select { |s| s.error.is_a? OutOfSyncSourceError } unless desynced_sources.empty? BufferManager.spawn_unless_exists("Out-of-sync source notification for #{broken_sources.join(',')}", opts) do TextMode.new(<<EOM) diff --git a/lib/sup/index.rb b/lib/sup/index.rb index b9f4b36..7d6258d 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -26,11 +26,7 @@ class Index def initialize dir=BASE_DIR @index_mutex = Monitor.new - @dir = dir - @sources = {} - @sources_dirty = false - @source_mutex = Monitor.new wsa = Ferret::Analysis::WhiteSpaceAnalyzer.new false sa = Ferret::Analysis::StandardAnalyzer.new [], true @@ -112,36 +108,17 @@ EOS end def load - load_sources + SourceManager.load_sources load_index end def save Redwood::log "saving index and sources..." FileUtils.mkdir_p @dir unless File.exists? @dir - save_sources + SourceManager.save_sources save_index end - def add_source source - @source_mutex.synchronize do - raise "duplicate source!" if @sources.include? source - @sources_dirty = true - max = @sources.max_of { |id, s| s.is_a?(DraftLoader) || s.is_a?(SentLoader) ? 0 : id } - source.id ||= (max || 0) + 1 - ##source.id += 1 while @sources.member? source.id - @sources[source.id] = source - end - end - - def sources - ## favour the inbox by listing non-archived sources first - @source_mutex.synchronize { @sources.values }.sort_by { |s| s.id }.partition { |s| !s.archived? }.flatten - end - - def source_for uri; sources.find { |s| s.is_source_for? uri }; end - def usual_sources; sources.find_all { |s| s.usual? }; end - def load_index dir=File.join(@dir, "ferret") if File.exists? dir Redwood::log "loading index..." @@ -383,7 +360,7 @@ EOS @index_mutex.synchronize do doc = @index[docid] or return - source = @source_mutex.synchronize { @sources[doc[:source_id].to_i] } + source = SourceManager[doc[:source_id].to_i] raise "invalid source #{doc[:source_id]}" unless source #puts "building message #{doc[:message_id]} (#{source}##{doc[:source_info]})" @@ -442,14 +419,6 @@ EOS contacts.keys.compact end - def load_sources fn=Redwood::SOURCE_FN - source_array = (Redwood::load_yaml_obj(fn) || []).map { |o| Recoverable.new o } - @source_mutex.synchronize do - @sources = Hash[*(source_array).map { |s| [s.id, s] }.flatten] - @sources_dirty = false - end - end - def each_docid query={} ferret_query = build_ferret_query query results = @index_mutex.synchronize { @index.search ferret_query, :limit => (query[:limit] || :all) } @@ -604,21 +573,6 @@ private q.add_query Ferret::Search::TermQuery.new("source_id", query[:source_id]), :must if query[:source_id] q end - - def save_sources fn=Redwood::SOURCE_FN - @source_mutex.synchronize do - if @sources_dirty || @sources.any? { |id, s| s.dirty? } - bakfn = fn + ".bak" - if File.exists? fn - File.chmod 0600, fn - FileUtils.mv fn, bakfn, :force => true unless File.exists?(bakfn) && File.size(fn) == 0 - end - Redwood::save_yaml_obj sources.sort_by { |s| s.id.to_i }, fn, true - File.chmod 0600, fn - end - @sources_dirty = false - end - end end end diff --git a/lib/sup/poll.rb b/lib/sup/poll.rb index bbad5f2..c83290c 100644 --- a/lib/sup/poll.rb +++ b/lib/sup/poll.rb @@ -83,7 +83,7 @@ EOS from_and_subj_inbox = [] @mutex.synchronize do - Index.usual_sources.each do |source| + SourceManager.usual_sources.each do |source| # yield "source #{source} is done? #{source.done?} (cur_offset #{source.cur_offset} >= #{source.end_offset})" begin yield "Loading from #{source}... " unless source.done? || (source.respond_to?(:has_errors?) && source.has_errors?) diff --git a/lib/sup/source.rb b/lib/sup/source.rb index fb98dbc..1bb7797 100644 --- a/lib/sup/source.rb +++ b/lib/sup/source.rb @@ -155,4 +155,61 @@ protected end end +class SourceManager + include Singleton + + def initialize + @sources = {} + @sources_dirty = false + @source_mutex = Monitor.new + self.class.i_am_the_instance self + end + + def [](id) + @source_mutex.synchronize { @sources[id] } + end + + def add_source source + @source_mutex.synchronize do + raise "duplicate source!" if @sources.include? source + @sources_dirty = true + max = @sources.max_of { |id, s| s.is_a?(DraftLoader) || s.is_a?(SentLoader) ? 0 : id } + source.id ||= (max || 0) + 1 + ##source.id += 1 while @sources.member? source.id + @sources[source.id] = source + end + end + + def sources + ## favour the inbox by listing non-archived sources first + @source_mutex.synchronize { @sources.values }.sort_by { |s| s.id }.partition { |s| !s.archived? }.flatten + end + + def source_for uri; sources.find { |s| s.is_source_for? uri }; end + def usual_sources; sources.find_all { |s| s.usual? }; end + + def load_sources fn=Redwood::SOURCE_FN + source_array = (Redwood::load_yaml_obj(fn) || []).map { |o| Recoverable.new o } + @source_mutex.synchronize do + @sources = Hash[*(source_array).map { |s| [s.id, s] }.flatten] + @sources_dirty = false + end + end + + def save_sources fn=Redwood::SOURCE_FN + @source_mutex.synchronize do + if @sources_dirty || @sources.any? { |id, s| s.dirty? } + bakfn = fn + ".bak" + if File.exists? fn + File.chmod 0600, fn + FileUtils.mv fn, bakfn, :force => true unless File.exists?(bakfn) && File.size(fn) == 0 + end + Redwood::save_yaml_obj sources.sort_by { |s| s.id.to_i }, fn, true + File.chmod 0600, fn + end + @sources_dirty = false + end + end +end + end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 08/18] index: remove unused method fresh_thread_id 2009-06-20 20:50 ` [sup-talk] [PATCH 07/18] move source-related methods to SourceManager Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 09/18] index: revert overeager opts->query rename in each_message_in_thread_for Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/index.rb | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/lib/sup/index.rb b/lib/sup/index.rb index 7d6258d..e3f9e69 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -382,7 +382,6 @@ EOS end end - def fresh_thread_id; @next_thread_id += 1; end def wrap_subj subj; "__START_SUBJECT__ #{subj} __END_SUBJECT__"; end def unwrap_subj subj; subj =~ /__START_SUBJECT__ (.*?) __END_SUBJECT__/ && $1; end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 09/18] index: revert overeager opts->query rename in each_message_in_thread_for 2009-06-20 20:50 ` [sup-talk] [PATCH 08/18] index: remove unused method fresh_thread_id Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 10/18] index: make wrap_subj methods private Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/index.rb | 8 ++++---- 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/sup/index.rb b/lib/sup/index.rb index e3f9e69..080a4ec 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -280,7 +280,7 @@ EOS ## is found. SAME_SUBJECT_DATE_LIMIT = 7 MAX_CLAUSES = 1000 - def each_message_in_thread_for m, query={} + def each_message_in_thread_for m, opts={} #Redwood::log "Building thread for #{m.id}: #{m.subj}" messages = {} searched = {} @@ -310,7 +310,7 @@ EOS pending = (pending + p1 + p2).uniq end - until pending.empty? || (query[:limit] && messages.size >= query[:limit]) + until pending.empty? || (opts[:limit] && messages.size >= opts[:limit]) q = Ferret::Search::BooleanQuery.new true # this disappeared in newer ferrets... wtf. # q.max_clause_count = 2048 @@ -329,8 +329,8 @@ EOS killed = false @index_mutex.synchronize do @index.search_each(q, :limit => :all) do |docid, score| - break if query[:limit] && messages.size >= query[:limit] - if @index[docid][:label].split(/\s+/).include?("killed") && query[:skip_killed] + break if opts[:limit] && messages.size >= opts[:limit] + if @index[docid][:label].split(/\s+/).include?("killed") && opts[:skip_killed] killed = true break end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 10/18] index: make wrap_subj methods private 2009-06-20 20:50 ` [sup-talk] [PATCH 09/18] index: revert overeager opts->query rename in each_message_in_thread_for Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 11/18] index: move Ferret-specific code to ferret_index.rb Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/index.rb | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/sup/index.rb b/lib/sup/index.rb index 080a4ec..5ddd6ee 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -382,9 +382,6 @@ EOS end end - def wrap_subj subj; "__START_SUBJECT__ #{subj} __END_SUBJECT__"; end - def unwrap_subj subj; subj =~ /__START_SUBJECT__ (.*?) __END_SUBJECT__/ && $1; end - def delete id; @index_mutex.synchronize { @index.delete id } end def load_contacts emails, h={} @@ -572,6 +569,9 @@ private q.add_query Ferret::Search::TermQuery.new("source_id", query[:source_id]), :must if query[:source_id] q end + + def wrap_subj subj; "__START_SUBJECT__ #{subj} __END_SUBJECT__"; end + def unwrap_subj subj; subj =~ /__START_SUBJECT__ (.*?) __END_SUBJECT__/ && $1; end end end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 11/18] index: move Ferret-specific code to ferret_index.rb 2009-06-20 20:50 ` [sup-talk] [PATCH 10/18] index: make wrap_subj methods private Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 12/18] remove last external uses of ferret docid Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/ferret_index.rb | 463 +++++++++++++++++++++++++++++++++++++++++++++++ lib/sup/index.rb | 453 +++++----------------------------------------- 2 files changed, 509 insertions(+), 407 deletions(-) create mode 100644 lib/sup/ferret_index.rb diff --git a/lib/sup/ferret_index.rb b/lib/sup/ferret_index.rb new file mode 100644 index 0000000..53c19e0 --- /dev/null +++ b/lib/sup/ferret_index.rb @@ -0,0 +1,463 @@ +require 'ferret' + +module Redwood + +class FerretIndex < BaseIndex + + def initialize dir=BASE_DIR + super + + @index_mutex = Monitor.new + wsa = Ferret::Analysis::WhiteSpaceAnalyzer.new false + sa = Ferret::Analysis::StandardAnalyzer.new [], true + @analyzer = Ferret::Analysis::PerFieldAnalyzer.new wsa + @analyzer[:body] = sa + @analyzer[:subject] = sa + @qparser ||= Ferret::QueryParser.new :default_field => :body, :analyzer => @analyzer, :or_default => false + end + + def load_index dir=File.join(@dir, "ferret") + if File.exists? dir + Redwood::log "loading index..." + @index_mutex.synchronize do + @index = Ferret::Index::Index.new(:path => dir, :analyzer => @analyzer, :id_field => 'message_id') + Redwood::log "loaded index of #{@index.size} messages" + end + else + Redwood::log "creating index..." + @index_mutex.synchronize do + field_infos = Ferret::Index::FieldInfos.new :store => :yes + field_infos.add_field :message_id, :index => :untokenized + field_infos.add_field :source_id + field_infos.add_field :source_info + field_infos.add_field :date, :index => :untokenized + field_infos.add_field :body + field_infos.add_field :label + field_infos.add_field :attachments + field_infos.add_field :subject + field_infos.add_field :from + field_infos.add_field :to + field_infos.add_field :refs + field_infos.add_field :snippet, :index => :no, :term_vector => :no + field_infos.create_index dir + @index = Ferret::Index::Index.new(:path => dir, :analyzer => @analyzer, :id_field => 'message_id') + end + end + end + + def sync_message m, opts={} + entry = @index[m.id] + + raise "no source info for message #{m.id}" unless m.source && m.source_info + + source_id = if m.source.is_a? Integer + m.source + else + m.source.id or raise "unregistered source #{m.source} (id #{m.source.id.inspect})" + end + + snippet = if m.snippet_contains_encrypted_content? && $config[:discard_snippets_from_encrypted_messages] + "" + else + m.snippet + end + + ## write the new document to the index. if the entry already exists in the + ## index, reuse it (which avoids having to reload the entry from the source, + ## which can be quite expensive for e.g. large threads of IMAP actions.) + ## + ## exception: if the index entry belongs to an earlier version of the + ## message, use everything from the new message instead, but union the + ## flags. this allows messages sent to mailing lists to have their header + ## updated and to have flags set properly. + ## + ## minor hack: messages in sources with lower ids have priority over + ## messages in sources with higher ids. so messages in the inbox will + ## override everyone, and messages in the sent box will be overridden + ## by everyone else. + ## + ## written in this manner to support previous versions of the index which + ## did not keep around the entry body. upgrading is thus seamless. + entry ||= {} + labels = m.labels.uniq # override because this is the new state, unless... + + ## if we are a later version of a message, ignore what's in the index, + ## but merge in the labels. + if entry[:source_id] && entry[:source_info] && entry[:label] && + ((entry[:source_id].to_i > source_id) || (entry[:source_info].to_i < m.source_info)) + labels = (entry[:label].symbolistize + m.labels).uniq + #Redwood::log "found updated version of message #{m.id}: #{m.subj}" + #Redwood::log "previous version was at #{entry[:source_id].inspect}:#{entry[:source_info].inspect}, this version at #{source_id.inspect}:#{m.source_info.inspect}" + #Redwood::log "merged labels are #{labels.inspect} (index #{entry[:label].inspect}, message #{m.labels.inspect})" + entry = {} + end + + ## if force_overwite is true, ignore what's in the index. this is used + ## primarily by sup-sync to force index updates. + entry = {} if opts[:force_overwrite] + + d = { + :message_id => m.id, + :source_id => source_id, + :source_info => m.source_info, + :date => (entry[:date] || m.date.to_indexable_s), + :body => (entry[:body] || m.indexable_content), + :snippet => snippet, # always override + :label => labels.uniq.join(" "), + :attachments => (entry[:attachments] || m.attachments.uniq.join(" ")), + + ## always override :from and :to. + ## older versions of Sup would often store the wrong thing in the index + ## (because they were canonicalizing email addresses, resulting in the + ## wrong name associated with each.) the correct address is read from + ## the original header when these messages are opened in thread-view-mode, + ## so this allows people to forcibly update the address in the index by + ## marking those threads for saving. + :from => (m.from ? m.from.indexable_content : ""), + :to => (m.to + m.cc + m.bcc).map { |x| x.indexable_content }.join(" "), + + :subject => (entry[:subject] || wrap_subj(Message.normalize_subj(m.subj))), + :refs => (entry[:refs] || (m.refs + m.replytos).uniq.join(" ")), + } + + @index_mutex.synchronize do + @index.delete m.id + @index.add_document d + end + end + + def save_index fn=File.join(@dir, "ferret") + # don't have to do anything, apparently + end + + def contains_id? id + @index_mutex.synchronize { @index.search(Ferret::Search::TermQuery.new(:message_id, id)).total_hits > 0 } + end + + def size + @index_mutex.synchronize { @index.size } + end + + EACH_BY_DATE_NUM = 100 + def each_id_by_date query={} + return if empty? # otherwise ferret barfs ###TODO: remove this once my ferret patch is accepted + ferret_query = build_ferret_query query + offset = 0 + while true + limit = (query[:limit])? [EACH_BY_DATE_NUM, query[:limit] - offset].min : EACH_BY_DATE_NUM + results = @index_mutex.synchronize { @index.search ferret_query, :sort => "date DESC", :limit => limit, :offset => offset } + Redwood::log "got #{results.total_hits} results for query (offset #{offset}) #{ferret_query.inspect}" + results.hits.each do |hit| + yield @index_mutex.synchronize { @index[hit.doc][:message_id] }, lambda { build_message hit.doc } + end + break if query[:limit] and offset >= query[:limit] - limit + break if offset >= results.total_hits - limit + offset += limit + end + end + + def num_results_for query={} + return 0 if empty? # otherwise ferret barfs ###TODO: remove this once my ferret patch is accepted + ferret_query = build_ferret_query query + @index_mutex.synchronize { @index.search(ferret_query, :limit => 1).total_hits } + end + + SAME_SUBJECT_DATE_LIMIT = 7 + MAX_CLAUSES = 1000 + def each_message_in_thread_for m, opts={} + #Redwood::log "Building thread for #{m.id}: #{m.subj}" + messages = {} + searched = {} + num_queries = 0 + + pending = [m.id] + if $config[:thread_by_subject] # do subject queries + date_min = m.date - (SAME_SUBJECT_DATE_LIMIT * 12 * 3600) + date_max = m.date + (SAME_SUBJECT_DATE_LIMIT * 12 * 3600) + + q = Ferret::Search::BooleanQuery.new true + sq = Ferret::Search::PhraseQuery.new(:subject) + wrap_subj(Message.normalize_subj(m.subj)).split.each do |t| + sq.add_term t + end + q.add_query sq, :must + q.add_query Ferret::Search::RangeQuery.new(:date, :>= => date_min.to_indexable_s, :<= => date_max.to_indexable_s), :must + + q = build_ferret_query :qobj => q + + p1 = @index_mutex.synchronize { @index.search(q).hits.map { |hit| @index[hit.doc][:message_id] } } + Redwood::log "found #{p1.size} results for subject query #{q}" + + p2 = @index_mutex.synchronize { @index.search(q.to_s, :limit => :all).hits.map { |hit| @index[hit.doc][:message_id] } } + Redwood::log "found #{p2.size} results in string form" + + pending = (pending + p1 + p2).uniq + end + + until pending.empty? || (opts[:limit] && messages.size >= opts[:limit]) + q = Ferret::Search::BooleanQuery.new true + # this disappeared in newer ferrets... wtf. + # q.max_clause_count = 2048 + + lim = [MAX_CLAUSES / 2, pending.length].min + pending[0 ... lim].each do |id| + searched[id] = true + q.add_query Ferret::Search::TermQuery.new(:message_id, id), :should + q.add_query Ferret::Search::TermQuery.new(:refs, id), :should + end + pending = pending[lim .. -1] + + q = build_ferret_query :qobj => q + + num_queries += 1 + killed = false + @index_mutex.synchronize do + @index.search_each(q, :limit => :all) do |docid, score| + break if opts[:limit] && messages.size >= opts[:limit] + if @index[docid][:label].split(/\s+/).include?("killed") && opts[:skip_killed] + killed = true + break + end + mid = @index[docid][:message_id] + unless messages.member?(mid) + #Redwood::log "got #{mid} as a child of #{id}" + messages[mid] ||= lambda { build_message docid } + refs = @index[docid][:refs].split + pending += refs.select { |id| !searched[id] } + end + end + end + end + + if killed + Redwood::log "thread for #{m.id} is killed, ignoring" + false + else + Redwood::log "ran #{num_queries} queries to build thread of #{messages.size} messages for #{m.id}: #{m.subj}" if num_queries > 0 + messages.each { |mid, builder| yield mid, builder } + true + end + end + + ## builds a message object from a ferret result + def build_message docid + @index_mutex.synchronize do + doc = @index[docid] or return + + source = SourceManager[doc[:source_id].to_i] + raise "invalid source #{doc[:source_id]}" unless source + + #puts "building message #{doc[:message_id]} (#{source}##{doc[:source_info]})" + + fake_header = { + "date" => Time.at(doc[:date].to_i), + "subject" => unwrap_subj(doc[:subject]), + "from" => doc[:from], + "to" => doc[:to].split.join(", "), # reformat + "message-id" => doc[:message_id], + "references" => doc[:refs].split.map { |x| "<#{x}>" }.join(" "), + } + + m = Message.new :source => source, :source_info => doc[:source_info].to_i, + :labels => doc[:label].symbolistize, + :snippet => doc[:snippet] + m.parse_header fake_header + m + end + end + + def delete id + @index_mutex.synchronize { @index.delete id } + end + + def load_contacts emails, h={} + q = Ferret::Search::BooleanQuery.new true + emails.each do |e| + qq = Ferret::Search::BooleanQuery.new true + qq.add_query Ferret::Search::TermQuery.new(:from, e), :should + qq.add_query Ferret::Search::TermQuery.new(:to, e), :should + q.add_query qq + end + q.add_query Ferret::Search::TermQuery.new(:label, "spam"), :must_not + + Redwood::log "contact search: #{q}" + contacts = {} + num = h[:num] || 20 + @index_mutex.synchronize do + @index.search_each q, :sort => "date DESC", :limit => :all do |docid, score| + break if contacts.size >= num + #Redwood::log "got message #{docid} to: #{@index[docid][:to].inspect} and from: #{@index[docid][:from].inspect}" + f = @index[docid][:from] + t = @index[docid][:to] + + if AccountManager.is_account_email? f + t.split(" ").each { |e| contacts[Person.from_address(e)] = true } + else + contacts[Person.from_address(f)] = true + end + end + end + + contacts.keys.compact + end + + def each_docid query={} + ferret_query = build_ferret_query query + results = @index_mutex.synchronize { @index.search ferret_query, :limit => (query[:limit] || :all) } + results.hits.map { |hit| yield hit.doc } + end + + def each_message query={} + each_docid query do |docid| + yield build_message(docid) + end + end + + def optimize + @index_mutex.synchronize { @index.optimize } + end + + def source_for_id id + entry = @index[id] + return unless entry + entry[:source_id].to_i + end + + class ParseError < StandardError; end + + ## parse a query string from the user. returns a query object + ## that can be passed to any index method with a 'query' + ## argument, as well as build_ferret_query. + ## + ## raises a ParseError if something went wrong. + def parse_query s + query = {} + + subs = s.gsub(/\b(to|from):(\S+)\b/) do + field, name = $1, $2 + if(p = ContactManager.contact_for(name)) + [field, p.email] + elsif name == "me" + [field, "(" + AccountManager.user_emails.join("||") + ")"] + else + [field, name] + end.join(":") + end + + ## if we see a label:deleted or a label:spam term anywhere in the query + ## string, we set the extra load_spam or load_deleted options to true. + ## bizarre? well, because the query allows arbitrary parenthesized boolean + ## expressions, without fully parsing the query, we can't tell whether + ## the user is explicitly directing us to search spam messages or not. + ## e.g. if the string is -(-(-(-(-label:spam)))), does the user want to + ## search spam messages or not? + ## + ## so, we rely on the fact that turning these extra options ON turns OFF + ## the adding of "-label:deleted" or "-label:spam" terms at the very + ## final stage of query processing. if the user wants to search spam + ## messages, not adding that is the right thing; if he doesn't want to + ## search spam messages, then not adding it won't have any effect. + query[:load_spam] = true if subs =~ /\blabel:spam\b/ + query[:load_deleted] = true if subs =~ /\blabel:deleted\b/ + + ## gmail style "is" operator + subs = subs.gsub(/\b(is|has):(\S+)\b/) do + field, label = $1, $2 + case label + when "read" + "-label:unread" + when "spam" + query[:load_spam] = true + "label:spam" + when "deleted" + query[:load_deleted] = true + "label:deleted" + else + "label:#{$2}" + end + end + + ## gmail style attachments "filename" and "filetype" searches + subs = subs.gsub(/\b(filename|filetype):(\((.+?)\)\B|(\S+)\b)/) do + field, name = $1, ($3 || $4) + case field + when "filename" + Redwood::log "filename - translated #{field}:#{name} to attachments:(#{name.downcase})" + "attachments:(#{name.downcase})" + when "filetype" + Redwood::log "filetype - translated #{field}:#{name} to attachments:(*.#{name.downcase})" + "attachments:(*.#{name.downcase})" + end + end + + if $have_chronic + subs = subs.gsub(/\b(before|on|in|during|after):(\((.+?)\)\B|(\S+)\b)/) do + field, datestr = $1, ($3 || $4) + realdate = Chronic.parse datestr, :guess => false, :context => :past + if realdate + case field + when "after" + Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate.end}" + "date:(>= #{sprintf "%012d", realdate.end.to_i})" + when "before" + Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate.begin}" + "date:(<= #{sprintf "%012d", realdate.begin.to_i})" + else + Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate}" + "date:(<= #{sprintf "%012d", realdate.end.to_i}) date:(>= #{sprintf "%012d", realdate.begin.to_i})" + end + else + raise ParseError, "can't understand date #{datestr.inspect}" + end + end + end + + ## limit:42 restrict the search to 42 results + subs = subs.gsub(/\blimit:(\S+)\b/) do + lim = $1 + if lim =~ /^\d+$/ + query[:limit] = lim.to_i + '' + else + raise ParseError, "non-numeric limit #{lim.inspect}" + end + end + + begin + query[:qobj] = @qparser.parse(subs) + query[:text] = s + query + rescue Ferret::QueryParser::QueryParseException => e + raise ParseError, e.message + end + end + +private + + def build_ferret_query query + q = Ferret::Search::BooleanQuery.new + q.add_query query[:qobj], :must if query[:qobj] + labels = ([query[:label]] + (query[:labels] || [])).compact + labels.each { |t| q.add_query Ferret::Search::TermQuery.new("label", t.to_s), :must } + if query[:participants] + q2 = Ferret::Search::BooleanQuery.new + query[:participants].each do |p| + q2.add_query Ferret::Search::TermQuery.new("from", p.email), :should + q2.add_query Ferret::Search::TermQuery.new("to", p.email), :should + end + q.add_query q2, :must + end + + q.add_query Ferret::Search::TermQuery.new("label", "spam"), :must_not unless query[:load_spam] || labels.include?(:spam) + q.add_query Ferret::Search::TermQuery.new("label", "deleted"), :must_not unless query[:load_deleted] || labels.include?(:deleted) + q.add_query Ferret::Search::TermQuery.new("label", "killed"), :must_not if query[:skip_killed] + + q.add_query Ferret::Search::TermQuery.new("source_id", query[:source_id]), :must if query[:source_id] + q + end + + def wrap_subj subj; "__START_SUBJECT__ #{subj} __END_SUBJECT__"; end + def unwrap_subj subj; subj =~ /__START_SUBJECT__ (.*?) __END_SUBJECT__/ && $1; end +end + +end diff --git a/lib/sup/index.rb b/lib/sup/index.rb index 5ddd6ee..be0e870 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -1,7 +1,6 @@ -## the index structure for redwood. interacts with ferret. +## Index interface, subclassed by Ferret indexer. require 'fileutils' -require 'ferret' begin require 'chronic' @@ -13,7 +12,7 @@ end module Redwood -class Index +class BaseIndex class LockError < StandardError def initialize h @h = h @@ -25,17 +24,8 @@ class Index include Singleton def initialize dir=BASE_DIR - @index_mutex = Monitor.new @dir = dir - - wsa = Ferret::Analysis::WhiteSpaceAnalyzer.new false - sa = Ferret::Analysis::StandardAnalyzer.new [], true - @analyzer = Ferret::Analysis::PerFieldAnalyzer.new wsa - @analyzer[:body] = sa - @analyzer[:subject] = sa - @qparser ||= Ferret::QueryParser.new :default_field => :body, :analyzer => @analyzer, :or_default => false @lock = Lockfile.new lockfile, :retries => 0, :max_age => nil - self.class.i_am_the_instance self end @@ -119,155 +109,44 @@ EOS save_index end - def load_index dir=File.join(@dir, "ferret") - if File.exists? dir - Redwood::log "loading index..." - @index_mutex.synchronize do - @index = Ferret::Index::Index.new(:path => dir, :analyzer => @analyzer, :id_field => 'message_id') - Redwood::log "loaded index of #{@index.size} messages" - end - else - Redwood::log "creating index..." - @index_mutex.synchronize do - field_infos = Ferret::Index::FieldInfos.new :store => :yes - field_infos.add_field :message_id, :index => :untokenized - field_infos.add_field :source_id - field_infos.add_field :source_info - field_infos.add_field :date, :index => :untokenized - field_infos.add_field :body - field_infos.add_field :label - field_infos.add_field :attachments - field_infos.add_field :subject - field_infos.add_field :from - field_infos.add_field :to - field_infos.add_field :refs - field_infos.add_field :snippet, :index => :no, :term_vector => :no - field_infos.create_index dir - @index = Ferret::Index::Index.new(:path => dir, :analyzer => @analyzer, :id_field => 'message_id') - end - end + def load_index + unimplemented end ## Syncs the message to the index, replacing any previous version. adding ## either way. Index state will be determined by the message's #labels ## accessor. def sync_message m, opts={} - entry = @index[m.id] - - raise "no source info for message #{m.id}" unless m.source && m.source_info - - source_id = if m.source.is_a? Integer - m.source - else - m.source.id or raise "unregistered source #{m.source} (id #{m.source.id.inspect})" - end - - snippet = if m.snippet_contains_encrypted_content? && $config[:discard_snippets_from_encrypted_messages] - "" - else - m.snippet - end - - ## write the new document to the index. if the entry already exists in the - ## index, reuse it (which avoids having to reload the entry from the source, - ## which can be quite expensive for e.g. large threads of IMAP actions.) - ## - ## exception: if the index entry belongs to an earlier version of the - ## message, use everything from the new message instead, but union the - ## flags. this allows messages sent to mailing lists to have their header - ## updated and to have flags set properly. - ## - ## minor hack: messages in sources with lower ids have priority over - ## messages in sources with higher ids. so messages in the inbox will - ## override everyone, and messages in the sent box will be overridden - ## by everyone else. - ## - ## written in this manner to support previous versions of the index which - ## did not keep around the entry body. upgrading is thus seamless. - entry ||= {} - labels = m.labels.uniq # override because this is the new state, unless... - - ## if we are a later version of a message, ignore what's in the index, - ## but merge in the labels. - if entry[:source_id] && entry[:source_info] && entry[:label] && - ((entry[:source_id].to_i > source_id) || (entry[:source_info].to_i < m.source_info)) - labels = (entry[:label].symbolistize + m.labels).uniq - #Redwood::log "found updated version of message #{m.id}: #{m.subj}" - #Redwood::log "previous version was at #{entry[:source_id].inspect}:#{entry[:source_info].inspect}, this version at #{source_id.inspect}:#{m.source_info.inspect}" - #Redwood::log "merged labels are #{labels.inspect} (index #{entry[:label].inspect}, message #{m.labels.inspect})" - entry = {} - end - - ## if force_overwite is true, ignore what's in the index. this is used - ## primarily by sup-sync to force index updates. - entry = {} if opts[:force_overwrite] - - d = { - :message_id => m.id, - :source_id => source_id, - :source_info => m.source_info, - :date => (entry[:date] || m.date.to_indexable_s), - :body => (entry[:body] || m.indexable_content), - :snippet => snippet, # always override - :label => labels.uniq.join(" "), - :attachments => (entry[:attachments] || m.attachments.uniq.join(" ")), - - ## always override :from and :to. - ## older versions of Sup would often store the wrong thing in the index - ## (because they were canonicalizing email addresses, resulting in the - ## wrong name associated with each.) the correct address is read from - ## the original header when these messages are opened in thread-view-mode, - ## so this allows people to forcibly update the address in the index by - ## marking those threads for saving. - :from => (m.from ? m.from.indexable_content : ""), - :to => (m.to + m.cc + m.bcc).map { |x| x.indexable_content }.join(" "), - - :subject => (entry[:subject] || wrap_subj(Message.normalize_subj(m.subj))), - :refs => (entry[:refs] || (m.refs + m.replytos).uniq.join(" ")), - } - - @index_mutex.synchronize do - @index.delete m.id - @index.add_document d - end + unimplemented end - def save_index fn=File.join(@dir, "ferret") - # don't have to do anything, apparently + def save_index fn + unimplemented end def contains_id? id - @index_mutex.synchronize { @index.search(Ferret::Search::TermQuery.new(:message_id, id)).total_hits > 0 } + unimplemented end + def contains? m; contains_id? m.id end - def size; @index_mutex.synchronize { @index.size } end + + def size + unimplemented + end + def empty?; size == 0 end - ## you should probably not call this on a block that doesn't break + ## Yields a message-id and message-building lambda for each + ## message that matches the given query, in descending date order. + ## You should probably not call this on a block that doesn't break ## rather quickly because the results can be very large. - EACH_BY_DATE_NUM = 100 def each_id_by_date query={} - return if empty? # otherwise ferret barfs ###TODO: remove this once my ferret patch is accepted - ferret_query = build_ferret_query query - offset = 0 - while true - limit = (query[:limit])? [EACH_BY_DATE_NUM, query[:limit] - offset].min : EACH_BY_DATE_NUM - results = @index_mutex.synchronize { @index.search ferret_query, :sort => "date DESC", :limit => limit, :offset => offset } - Redwood::log "got #{results.total_hits} results for query (offset #{offset}) #{ferret_query.inspect}" - results.hits.each do |hit| - yield @index_mutex.synchronize { @index[hit.doc][:message_id] }, lambda { build_message hit.doc } - end - break if query[:limit] and offset >= query[:limit] - limit - break if offset >= results.total_hits - limit - offset += limit - end + unimplemented end + ## Return the number of matches for query in the index def num_results_for query={} - return 0 if empty? # otherwise ferret barfs ###TODO: remove this once my ferret patch is accepted - - ferret_query = build_ferret_query query - @index_mutex.synchronize { @index.search(ferret_query, :limit => 1).total_hits } + unimplemented end ## yield all messages in the thread containing 'm' by repeatedly @@ -278,300 +157,60 @@ EOS ## only two options, :limit and :skip_killed. if :skip_killed is ## true, stops loading any thread if a message with a :killed flag ## is found. - SAME_SUBJECT_DATE_LIMIT = 7 - MAX_CLAUSES = 1000 def each_message_in_thread_for m, opts={} - #Redwood::log "Building thread for #{m.id}: #{m.subj}" - messages = {} - searched = {} - num_queries = 0 - - pending = [m.id] - if $config[:thread_by_subject] # do subject queries - date_min = m.date - (SAME_SUBJECT_DATE_LIMIT * 12 * 3600) - date_max = m.date + (SAME_SUBJECT_DATE_LIMIT * 12 * 3600) - - q = Ferret::Search::BooleanQuery.new true - sq = Ferret::Search::PhraseQuery.new(:subject) - wrap_subj(Message.normalize_subj(m.subj)).split.each do |t| - sq.add_term t - end - q.add_query sq, :must - q.add_query Ferret::Search::RangeQuery.new(:date, :>= => date_min.to_indexable_s, :<= => date_max.to_indexable_s), :must - - q = build_ferret_query :qobj => q - - p1 = @index_mutex.synchronize { @index.search(q).hits.map { |hit| @index[hit.doc][:message_id] } } - Redwood::log "found #{p1.size} results for subject query #{q}" - - p2 = @index_mutex.synchronize { @index.search(q.to_s, :limit => :all).hits.map { |hit| @index[hit.doc][:message_id] } } - Redwood::log "found #{p2.size} results in string form" - - pending = (pending + p1 + p2).uniq - end - - until pending.empty? || (opts[:limit] && messages.size >= opts[:limit]) - q = Ferret::Search::BooleanQuery.new true - # this disappeared in newer ferrets... wtf. - # q.max_clause_count = 2048 - - lim = [MAX_CLAUSES / 2, pending.length].min - pending[0 ... lim].each do |id| - searched[id] = true - q.add_query Ferret::Search::TermQuery.new(:message_id, id), :should - q.add_query Ferret::Search::TermQuery.new(:refs, id), :should - end - pending = pending[lim .. -1] - - q = build_ferret_query :qobj => q - - num_queries += 1 - killed = false - @index_mutex.synchronize do - @index.search_each(q, :limit => :all) do |docid, score| - break if opts[:limit] && messages.size >= opts[:limit] - if @index[docid][:label].split(/\s+/).include?("killed") && opts[:skip_killed] - killed = true - break - end - mid = @index[docid][:message_id] - unless messages.member?(mid) - #Redwood::log "got #{mid} as a child of #{id}" - messages[mid] ||= lambda { build_message docid } - refs = @index[docid][:refs].split - pending += refs.select { |id| !searched[id] } - end - end - end - end - - if killed - Redwood::log "thread for #{m.id} is killed, ignoring" - false - else - Redwood::log "ran #{num_queries} queries to build thread of #{messages.size} messages for #{m.id}: #{m.subj}" if num_queries > 0 - messages.each { |mid, builder| yield mid, builder } - true - end + unimplemented end - ## builds a message object from a ferret result - def build_message docid - @index_mutex.synchronize do - doc = @index[docid] or return - - source = SourceManager[doc[:source_id].to_i] - raise "invalid source #{doc[:source_id]}" unless source - - #puts "building message #{doc[:message_id]} (#{source}##{doc[:source_info]})" - - fake_header = { - "date" => Time.at(doc[:date].to_i), - "subject" => unwrap_subj(doc[:subject]), - "from" => doc[:from], - "to" => doc[:to].split.join(", "), # reformat - "message-id" => doc[:message_id], - "references" => doc[:refs].split.map { |x| "<#{x}>" }.join(" "), - } - - m = Message.new :source => source, :source_info => doc[:source_info].to_i, - :labels => doc[:label].symbolistize, - :snippet => doc[:snippet] - m.parse_header fake_header - m - end + ## Load message with the given message-id from the index + def build_message id + unimplemented end - def delete id; @index_mutex.synchronize { @index.delete id } end - - def load_contacts emails, h={} - q = Ferret::Search::BooleanQuery.new true - emails.each do |e| - qq = Ferret::Search::BooleanQuery.new true - qq.add_query Ferret::Search::TermQuery.new(:from, e), :should - qq.add_query Ferret::Search::TermQuery.new(:to, e), :should - q.add_query qq - end - q.add_query Ferret::Search::TermQuery.new(:label, "spam"), :must_not - - Redwood::log "contact search: #{q}" - contacts = {} - num = h[:num] || 20 - @index_mutex.synchronize do - @index.search_each q, :sort => "date DESC", :limit => :all do |docid, score| - break if contacts.size >= num - #Redwood::log "got message #{docid} to: #{@index[docid][:to].inspect} and from: #{@index[docid][:from].inspect}" - f = @index[docid][:from] - t = @index[docid][:to] - - if AccountManager.is_account_email? f - t.split(" ").each { |e| contacts[Person.from_address(e)] = true } - else - contacts[Person.from_address(f)] = true - end - end - end + ## Delete message with the given message-id from the index + def delete id + unimplemented + end - contacts.keys.compact + ## Given an array of email addresses, return an array of Person objects that + ## have sent mail to or received mail from any of the given addresses. + def load_contacts email_addresses, h={} + unimplemented end + ## Yield each docid matching query def each_docid query={} - ferret_query = build_ferret_query query - results = @index_mutex.synchronize { @index.search ferret_query, :limit => (query[:limit] || :all) } - results.hits.map { |hit| yield hit.doc } + unimplemented end + ## Yield each messages matching query def each_message query={} - each_docid query do |docid| - yield build_message(docid) - end + unimplemented end + ## Implementation-specific optimization step def optimize - @index_mutex.synchronize { @index.optimize } + unimplemented end + ## Return the id source of the source the message with the given message-id + ## was synced from def source_for_id id - entry = @index[id] - return unless entry - entry[:source_id].to_i + unimplemented end class ParseError < StandardError; end ## parse a query string from the user. returns a query object ## that can be passed to any index method with a 'query' - ## argument, as well as build_ferret_query. + ## argument. ## ## raises a ParseError if something went wrong. def parse_query s - query = {} - - subs = s.gsub(/\b(to|from):(\S+)\b/) do - field, name = $1, $2 - if(p = ContactManager.contact_for(name)) - [field, p.email] - elsif name == "me" - [field, "(" + AccountManager.user_emails.join("||") + ")"] - else - [field, name] - end.join(":") - end - - ## if we see a label:deleted or a label:spam term anywhere in the query - ## string, we set the extra load_spam or load_deleted options to true. - ## bizarre? well, because the query allows arbitrary parenthesized boolean - ## expressions, without fully parsing the query, we can't tell whether - ## the user is explicitly directing us to search spam messages or not. - ## e.g. if the string is -(-(-(-(-label:spam)))), does the user want to - ## search spam messages or not? - ## - ## so, we rely on the fact that turning these extra options ON turns OFF - ## the adding of "-label:deleted" or "-label:spam" terms at the very - ## final stage of query processing. if the user wants to search spam - ## messages, not adding that is the right thing; if he doesn't want to - ## search spam messages, then not adding it won't have any effect. - query[:load_spam] = true if subs =~ /\blabel:spam\b/ - query[:load_deleted] = true if subs =~ /\blabel:deleted\b/ - - ## gmail style "is" operator - subs = subs.gsub(/\b(is|has):(\S+)\b/) do - field, label = $1, $2 - case label - when "read" - "-label:unread" - when "spam" - query[:load_spam] = true - "label:spam" - when "deleted" - query[:load_deleted] = true - "label:deleted" - else - "label:#{$2}" - end - end - - ## gmail style attachments "filename" and "filetype" searches - subs = subs.gsub(/\b(filename|filetype):(\((.+?)\)\B|(\S+)\b)/) do - field, name = $1, ($3 || $4) - case field - when "filename" - Redwood::log "filename - translated #{field}:#{name} to attachments:(#{name.downcase})" - "attachments:(#{name.downcase})" - when "filetype" - Redwood::log "filetype - translated #{field}:#{name} to attachments:(*.#{name.downcase})" - "attachments:(*.#{name.downcase})" - end - end - - if $have_chronic - subs = subs.gsub(/\b(before|on|in|during|after):(\((.+?)\)\B|(\S+)\b)/) do - field, datestr = $1, ($3 || $4) - realdate = Chronic.parse datestr, :guess => false, :context => :past - if realdate - case field - when "after" - Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate.end}" - "date:(>= #{sprintf "%012d", realdate.end.to_i})" - when "before" - Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate.begin}" - "date:(<= #{sprintf "%012d", realdate.begin.to_i})" - else - Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate}" - "date:(<= #{sprintf "%012d", realdate.end.to_i}) date:(>= #{sprintf "%012d", realdate.begin.to_i})" - end - else - raise ParseError, "can't understand date #{datestr.inspect}" - end - end - end - - ## limit:42 restrict the search to 42 results - subs = subs.gsub(/\blimit:(\S+)\b/) do - lim = $1 - if lim =~ /^\d+$/ - query[:limit] = lim.to_i - '' - else - raise ParseError, "non-numeric limit #{lim.inspect}" - end - end - - begin - query[:qobj] = @qparser.parse(subs) - query[:text] = s - query - rescue Ferret::QueryParser::QueryParseException => e - raise ParseError, e.message - end - end - -private - - def build_ferret_query query - q = Ferret::Search::BooleanQuery.new - q.add_query query[:qobj], :must if query[:qobj] - labels = ([query[:label]] + (query[:labels] || [])).compact - labels.each { |t| q.add_query Ferret::Search::TermQuery.new("label", t.to_s), :must } - if query[:participants] - q2 = Ferret::Search::BooleanQuery.new - query[:participants].each do |p| - q2.add_query Ferret::Search::TermQuery.new("from", p.email), :should - q2.add_query Ferret::Search::TermQuery.new("to", p.email), :should - end - q.add_query q2, :must - end - - q.add_query Ferret::Search::TermQuery.new("label", "spam"), :must_not unless query[:load_spam] || labels.include?(:spam) - q.add_query Ferret::Search::TermQuery.new("label", "deleted"), :must_not unless query[:load_deleted] || labels.include?(:deleted) - q.add_query Ferret::Search::TermQuery.new("label", "killed"), :must_not if query[:skip_killed] - - q.add_query Ferret::Search::TermQuery.new("source_id", query[:source_id]), :must if query[:source_id] - q + unimplemented end - - def wrap_subj subj; "__START_SUBJECT__ #{subj} __END_SUBJECT__"; end - def unwrap_subj subj; subj =~ /__START_SUBJECT__ (.*?) __END_SUBJECT__/ && $1; end end end + +require 'lib/sup/ferret_index' +Redwood::Index = Redwood::FerretIndex -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 12/18] remove last external uses of ferret docid 2009-06-20 20:50 ` [sup-talk] [PATCH 11/18] index: move Ferret-specific code to ferret_index.rb Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 13/18] add Message.indexable_{body, chunks, subject} Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- bin/sup-sync-back | 2 +- bin/sup-tweak-labels | 6 +++--- lib/sup/ferret_index.rb | 10 ++-------- lib/sup/index.rb | 12 +++++++----- 4 files changed, 13 insertions(+), 17 deletions(-) diff --git a/bin/sup-sync-back b/bin/sup-sync-back index 679e03a..8aa2039 100755 --- a/bin/sup-sync-back +++ b/bin/sup-sync-back @@ -17,7 +17,7 @@ def die msg end def has_any_from_source_with_label? index, source, label query = { :source_id => source.id, :label => label, :limit => 1 } - not Enumerable::Enumerator.new(index, :each_docid, query).map.empty? + not Enumerable::Enumerator.new(index, :each_id, query).map.empty? end opts = Trollop::options do diff --git a/bin/sup-tweak-labels b/bin/sup-tweak-labels index 95a3b03..a8115ea 100755 --- a/bin/sup-tweak-labels +++ b/bin/sup-tweak-labels @@ -83,14 +83,14 @@ begin query += ' ' + opts[:query] if opts[:query] parsed_query = index.parse_query query - docs = Enumerable::Enumerator.new(index, :each_docid, parsed_query).map - num_total = docs.size + ids = Enumerable::Enumerator.new(index, :each_id, parsed_query).map + num_total = ids.size $stderr.puts "Found #{num_total} documents across #{source_ids.length} sources. Scanning..." num_changed = num_scanned = 0 last_info_time = start_time = Time.now - docs.each do |id| + ids.each do |id| num_scanned += 1 m = index.build_message id diff --git a/lib/sup/ferret_index.rb b/lib/sup/ferret_index.rb index 53c19e0..a2c30ab 100644 --- a/lib/sup/ferret_index.rb +++ b/lib/sup/ferret_index.rb @@ -301,16 +301,10 @@ class FerretIndex < BaseIndex contacts.keys.compact end - def each_docid query={} + def each_id query={} ferret_query = build_ferret_query query results = @index_mutex.synchronize { @index.search ferret_query, :limit => (query[:limit] || :all) } - results.hits.map { |hit| yield hit.doc } - end - - def each_message query={} - each_docid query do |docid| - yield build_message(docid) - end + results.hits.map { |hit| yield @index[hit.doc][:message_id] } end def optimize diff --git a/lib/sup/index.rb b/lib/sup/index.rb index be0e870..45382f1 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -177,14 +177,16 @@ EOS unimplemented end - ## Yield each docid matching query - def each_docid query={} + ## Yield each message-id matching query + def each_id query={} unimplemented end - ## Yield each messages matching query - def each_message query={} - unimplemented + ## Yield each message matching query + def each_message query={}, &b + each_id query do |id| + yield build_message(id) + end end ## Implementation-specific optimization step -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 13/18] add Message.indexable_{body, chunks, subject} 2009-06-20 20:50 ` [sup-talk] [PATCH 12/18] remove last external uses of ferret docid Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 14/18] index: choose index implementation with config entry or environment variable Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/message.rb | 16 ++++++++++++++-- 1 files changed, 14 insertions(+), 2 deletions(-) diff --git a/lib/sup/message.rb b/lib/sup/message.rb index b667cb3..2999986 100644 --- a/lib/sup/message.rb +++ b/lib/sup/message.rb @@ -270,11 +270,23 @@ EOS to.map { |p| p.indexable_content }, cc.map { |p| p.indexable_content }, bcc.map { |p| p.indexable_content }, - chunks.select { |c| c.is_a? Chunk::Text }.map { |c| c.lines }, - Message.normalize_subj(subj), + indexable_chunks.map { |c| c.lines }, + indexable_subject, ].flatten.compact.join " " end + def indexable_body + indexable_chunks.map { |c| c.lines }.flatten.compact.join " " + end + + def indexable_chunks + chunks.select { |c| c.is_a? Chunk::Text } + end + + def indexable_subject + Message.normalize_subj(subj) + end + def quotable_body_lines chunks.find_all { |c| c.quotable? }.map { |c| c.lines }.flatten end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 14/18] index: choose index implementation with config entry or environment variable 2009-06-20 20:50 ` [sup-talk] [PATCH 13/18] add Message.indexable_{body, chunks, subject} Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 15/18] index: add xapian implementation Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup.rb | 2 ++ lib/sup/index.rb | 10 ++++++++-- 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/lib/sup.rb b/lib/sup.rb index 5689c2b..54de73f 100644 --- a/lib/sup.rb +++ b/lib/sup.rb @@ -54,6 +54,8 @@ module Redwood YAML_DOMAIN = "masanjin.net" YAML_DATE = "2006-10-01" + DEFAULT_INDEX = 'ferret' + ## record exceptions thrown in threads nicely @exceptions = [] @exception_mutex = Mutex.new diff --git a/lib/sup/index.rb b/lib/sup/index.rb index 45382f1..df428f7 100644 --- a/lib/sup/index.rb +++ b/lib/sup/index.rb @@ -212,7 +212,13 @@ EOS end end +index_name = ENV['SUP_INDEX'] || $config[:index] || DEFAULT_INDEX +begin + require "sup/#{index_name}_index" +rescue LoadError + fail "invalid index name #{index_name.inspect}" end +Index = Redwood.const_get "#{index_name.capitalize}Index" +Redwood::log "using index #{Index.name}" -require 'lib/sup/ferret_index' -Redwood::Index = Redwood::FerretIndex +end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 15/18] index: add xapian implementation 2009-06-20 20:50 ` [sup-talk] [PATCH 14/18] index: choose index implementation with config entry or environment variable Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 16/18] fix String#ord monkeypatch Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/poll.rb | 2 +- lib/sup/xapian_index.rb | 483 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 484 insertions(+), 1 deletions(-) create mode 100644 lib/sup/xapian_index.rb diff --git a/lib/sup/poll.rb b/lib/sup/poll.rb index c83290c..8a9d218 100644 --- a/lib/sup/poll.rb +++ b/lib/sup/poll.rb @@ -147,7 +147,7 @@ EOS m_new = Message.build_from_source source, offset m_old = Index.build_message m_new.id - m_new.labels = default_labels + (source.archived? ? [] : [:inbox]) + m_new.labels += default_labels + (source.archived? ? [] : [:inbox]) m_new.labels << :sent if source.uri.eql?(SentManager.source_uri) m_new.labels.delete :unread if m_new.source_marked_read? m_new.labels.each { |l| LabelManager << l } diff --git a/lib/sup/xapian_index.rb b/lib/sup/xapian_index.rb new file mode 100644 index 0000000..7faa64d --- /dev/null +++ b/lib/sup/xapian_index.rb @@ -0,0 +1,483 @@ +require 'xapian' +require 'gdbm' +require 'set' + +module Redwood + +# This index implementation uses Xapian for searching and GDBM for storage. It +# tends to be slightly faster than Ferret for indexing and significantly faster +# for searching due to precomputing thread membership. +class XapianIndex < BaseIndex + STEM_LANGUAGE = "english" + + def initialize dir=BASE_DIR + super + + @index_mutex = Monitor.new + + @entries = MarshalledGDBM.new File.join(dir, "entries.db") + @docids = MarshalledGDBM.new File.join(dir, "docids.db") + @thread_members = MarshalledGDBM.new File.join(dir, "thread_members.db") + @thread_ids = MarshalledGDBM.new File.join(dir, "thread_ids.db") + @assigned_docids = GDBM.new File.join(dir, "assigned_docids.db") + + @xapian = Xapian::WritableDatabase.new(File.join(dir, "xapian"), Xapian::DB_CREATE_OR_OPEN) + @term_generator = Xapian::TermGenerator.new() + @term_generator.stemmer = Xapian::Stem.new(STEM_LANGUAGE) + @enquire = Xapian::Enquire.new @xapian + @enquire.weighting_scheme = Xapian::BoolWeight.new + @enquire.docid_order = Xapian::Enquire::ASCENDING + end + + def load_index + end + + def save_index + end + + def optimize + end + + def size + synchronize { @xapian.doccount } + end + + def contains_id? id + synchronize { @entries.member? id } + end + + def source_for_id id + synchronize { @entries[id][:source_id] } + end + + def delete id + synchronize { @xapian.delete_document @docids[id] } + end + + def build_message id + entry = synchronize { @entries[id] } + return unless entry + + source = SourceManager[entry[:source_id]] + raise "invalid source #{entry[:source_id]}" unless source + + mk_addrs = lambda { |l| l.map { |e,n| "#{n} <#{e}>" } * ', ' } + mk_refs = lambda { |l| l.map { |r| "<#{r}>" } * ' ' } + fake_header = { + 'message-id' => entry[:message_id], + 'date' => Time.at(entry[:date]), + 'subject' => entry[:subject], + 'from' => mk_addrs[[entry[:from]]], + 'to' => mk_addrs[[entry[:to]]], + 'cc' => mk_addrs[[entry[:cc]]], + 'bcc' => mk_addrs[[entry[:bcc]]], + 'reply-tos' => mk_refs[entry[:replytos]], + 'references' => mk_refs[entry[:refs]], + } + + m = Message.new :source => source, :source_info => entry[:source_info], + :labels => entry[:labels], + :snippet => entry[:snippet] + m.parse_header fake_header + m + end + + def sync_message m, opts={} + entry = synchronize { @entries[m.id] } + snippet = m.snippet + entry ||= {} + labels = m.labels + entry = {} if opts[:force_overwrite] + + d = { + :message_id => m.id, + :source_id => m.source.id, + :source_info => m.source_info, + :date => (entry[:date] || m.date), + :snippet => snippet, + :labels => labels.uniq, + :from => (entry[:from] || [m.from.email, m.from.name]), + :to => (entry[:to] || m.to.map { |p| [p.email, p.name] }), + :cc => (entry[:cc] || m.cc.map { |p| [p.email, p.name] }), + :bcc => (entry[:bcc] || m.bcc.map { |p| [p.email, p.name] }), + :subject => m.subj, + :refs => (entry[:refs] || m.refs), + :replytos => (entry[:replytos] || m.replytos), + } + + m.labels.each { |l| LabelManager << l } + + synchronize do + index_message m, opts + union_threads([m.id] + m.refs + m.replytos) + @entries[m.id] = d + end + true + end + + def num_results_for query={} + xapian_query = build_xapian_query query + matchset = run_query xapian_query, 0, 0, 100 + matchset.matches_estimated + end + + EACH_ID_PAGE = 100 + def each_id query={} + offset = 0 + page = EACH_ID_PAGE + + xapian_query = build_xapian_query query + while true + ids = run_query_ids xapian_query, offset, (offset+page) + ids.each { |id| yield id } + break if ids.size < page + offset += page + end + end + + def each_id_by_date query={} + each_id(query) { |id| yield id, lambda { build_message id } } + end + + def each_message_in_thread_for m, opts={} + # TODO thread by subject + # TODO handle killed threads + ids = synchronize { @thread_members[@thread_ids[m.id]] } || [] + ids.select { |id| contains_id? id }.each { |id| yield id, lambda { build_message id } } + true + end + + def load_contacts emails, opts={} + contacts = Set.new + num = opts[:num] || 20 + each_id_by_date :participants => emails do |id,b| + break if contacts.size >= num + m = b.call + ([m.from]+m.to+m.cc+m.bcc).compact.each { |p| contacts << [p.name, p.email] } + end + contacts.to_a.compact.map { |n,e| Person.new n, e }[0...num] + end + + # TODO share code with the Ferret index + def parse_query s + query = {} + + subs = s.gsub(/\b(to|from):(\S+)\b/) do + field, name = $1, $2 + if(p = ContactManager.contact_for(name)) + [field, p.email] + elsif name == "me" + [field, "(" + AccountManager.user_emails.join("||") + ")"] + else + [field, name] + end.join(":") + end + + ## if we see a label:deleted or a label:spam term anywhere in the query + ## string, we set the extra load_spam or load_deleted options to true. + ## bizarre? well, because the query allows arbitrary parenthesized boolean + ## expressions, without fully parsing the query, we can't tell whether + ## the user is explicitly directing us to search spam messages or not. + ## e.g. if the string is -(-(-(-(-label:spam)))), does the user want to + ## search spam messages or not? + ## + ## so, we rely on the fact that turning these extra options ON turns OFF + ## the adding of "-label:deleted" or "-label:spam" terms at the very + ## final stage of query processing. if the user wants to search spam + ## messages, not adding that is the right thing; if he doesn't want to + ## search spam messages, then not adding it won't have any effect. + query[:load_spam] = true if subs =~ /\blabel:spam\b/ + query[:load_deleted] = true if subs =~ /\blabel:deleted\b/ + + ## gmail style "is" operator + subs = subs.gsub(/\b(is|has):(\S+)\b/) do + field, label = $1, $2 + case label + when "read" + "-label:unread" + when "spam" + query[:load_spam] = true + "label:spam" + when "deleted" + query[:load_deleted] = true + "label:deleted" + else + "label:#{$2}" + end + end + + ## gmail style attachments "filename" and "filetype" searches + subs = subs.gsub(/\b(filename|filetype):(\((.+?)\)\B|(\S+)\b)/) do + field, name = $1, ($3 || $4) + case field + when "filename" + Redwood::log "filename - translated #{field}:#{name} to attachment:\"#{name.downcase}\"" + "attachment:\"#{name.downcase}\"" + when "filetype" + Redwood::log "filetype - translated #{field}:#{name} to attachment_extension:#{name.downcase}" + "attachment_extension:#{name.downcase}" + end + end + + if $have_chronic + lastdate = 2<<32 - 1 + firstdate = 0 + subs = subs.gsub(/\b(before|on|in|during|after):(\((.+?)\)\B|(\S+)\b)/) do + field, datestr = $1, ($3 || $4) + realdate = Chronic.parse datestr, :guess => false, :context => :past + if realdate + case field + when "after" + Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate.end}" + "date:#{realdate.end.to_i}..#{lastdate}" + when "before" + Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate.begin}" + "date:#{firstdate}..#{realdate.end.to_i}" + else + Redwood::log "chronic: translated #{field}:#{datestr} to #{realdate}" + "date:#{realdate.begin.to_i}..#{realdate.end.to_i}" + end + else + raise ParseError, "can't understand date #{datestr.inspect}" + end + end + end + + ## limit:42 restrict the search to 42 results + subs = subs.gsub(/\blimit:(\S+)\b/) do + lim = $1 + if lim =~ /^\d+$/ + query[:limit] = lim.to_i + '' + else + raise ParseError, "non-numeric limit #{lim.inspect}" + end + end + + qp = Xapian::QueryParser.new + qp.database = @xapian + qp.stemmer = Xapian::Stem.new(STEM_LANGUAGE) + qp.stemming_strategy = Xapian::QueryParser::STEM_SOME + qp.default_op = Xapian::Query::OP_AND + qp.add_valuerangeprocessor(Xapian::NumberValueRangeProcessor.new(DATE_VALUENO, 'date:', true)) + NORMAL_PREFIX.each { |k,v| qp.add_prefix k, v } + BOOLEAN_PREFIX.each { |k,v| qp.add_boolean_prefix k, v } + xapian_query = qp.parse_query(subs, Xapian::QueryParser::FLAG_PHRASE|Xapian::QueryParser::FLAG_BOOLEAN|Xapian::QueryParser::FLAG_LOVEHATE|Xapian::QueryParser::FLAG_WILDCARD, PREFIX['body']) + + raise ParseError if xapian_query.nil? or xapian_query.empty? + query[:qobj] = xapian_query + query[:text] = s + query + end + + private + + # Stemmed + NORMAL_PREFIX = { + 'subject' => 'S', + 'body' => 'B', + 'from_name' => 'FN', + 'to_name' => 'TN', + 'name' => 'N', + 'attachment' => 'A', + } + + # Unstemmed + BOOLEAN_PREFIX = { + 'type' => 'K', + 'from_email' => 'FE', + 'to_email' => 'TE', + 'email' => 'E', + 'date' => 'D', + 'label' => 'L', + 'source_id' => 'I', + 'attachment_extension' => 'O', + } + + PREFIX = NORMAL_PREFIX.merge BOOLEAN_PREFIX + + DATE_VALUENO = 0 + + # Xapian can very efficiently sort in ascending docid order. Sup always wants + # to sort by descending date, so this method maps between them. In order to + # handle multiple messages per second, we use a logistic curve centered + # around MIDDLE_DATE so that the slope (docid/s) is greatest in this time + # period. A docid collision is not an error - the code will pick the next + # smallest unused one. + DOCID_SCALE = 2.0**32 + TIME_SCALE = 2.0**27 + MIDDLE_DATE = Time.gm(2011) + def assign_docid m + t = (m.date.to_i - MIDDLE_DATE.to_i).to_f + docid = (DOCID_SCALE - DOCID_SCALE/(Math::E**(-(t/TIME_SCALE)) + 1)).to_i + begin + while @assigned_docids.member? [docid].pack("N") + docid -= 1 + end + rescue + end + @assigned_docids[[docid].pack("N")] = '' + docid + end + + def synchronize &b + @index_mutex.synchronize &b + end + + def run_query xapian_query, offset, limit, checkatleast=0 + synchronize do + @enquire.query = xapian_query + @enquire.mset(offset, limit-offset, checkatleast) + end + end + + def run_query_ids xapian_query, offset, limit + matchset = run_query xapian_query, offset, limit + matchset.matches.map { |r| r.document.data } + end + + Q = Xapian::Query + def build_xapian_query opts + labels = ([opts[:label]] + (opts[:labels] || [])).compact + neglabels = [:spam, :deleted, :killed].reject { |l| (labels.include? l) || opts.member?("load_#{l}".intern) } + pos_terms, neg_terms = [], [] + + pos_terms << mkterm(:type, 'mail') + pos_terms.concat(labels.map { |l| mkterm(:label,l) }) + pos_terms << opts[:qobj] if opts[:qobj] + pos_terms << mkterm(:source_id, opts[:source_id]) if opts[:source_id] + + if opts[:participants] + participant_terms = opts[:participants].map { |p| mkterm(:email,:any, (Redwood::Person === p) ? p.email : p) } + pos_terms << Q.new(Q::OP_OR, participant_terms) + end + + neg_terms.concat(neglabels.map { |l| mkterm(:label,l) }) + + pos_query = Q.new(Q::OP_AND, pos_terms) + neg_query = Q.new(Q::OP_OR, neg_terms) + + if neg_query.empty? + pos_query + else + Q.new(Q::OP_AND_NOT, [pos_query, neg_query]) + end + end + + def index_message m, opts + terms = [] + text = [] + + subject_text = m.indexable_subject + body_text = m.indexable_body + + # Person names are indexed with several prefixes + person_termer = lambda do |d| + lambda do |p| + ["#{d}_name", "name", "body"].each do |x| + text << [p.name, PREFIX[x]] + end if p.name + [d, :any].each { |x| terms << mkterm(:email, x, p.email) } + end + end + + person_termer[:from][m.from] if m.from + (m.to+m.cc+m.bcc).each(&(person_termer[:to])) + + terms << mkterm(:date,m.date) if m.date + m.labels.each { |t| terms << mkterm(:label,t) } + terms << mkterm(:type, 'mail') + terms << mkterm(:source_id, m.source.id) + m.attachments.each do |a| + a =~ /\.(\w+)$/ or next + t = mkterm(:attachment_extension, $1) + terms << t + end + + # Full text search content + text << [subject_text, PREFIX['subject']] + text << [subject_text, PREFIX['body']] + text << [body_text, PREFIX['body']] + m.attachments.each { |a| text << [a, PREFIX['attachment']] } + + # Date value for range queries + date_value = Xapian.sortable_serialise(m.date.to_i) + + doc = Xapian::Document.new + docid = @docids[m.id] || assign_docid(m) + + @term_generator.document = doc + text.each { |text,prefix| @term_generator.index_text text, 1, prefix } + terms.each { |term| doc.add_term term } + doc.add_value DATE_VALUENO, date_value + doc.data = m.id + + @xapian.replace_document docid, doc + @docids[m.id] = docid + end + + # Construct a Xapian term + def mkterm type, *args + case type + when :label + PREFIX['label'] + args[0].to_s.downcase + when :type + PREFIX['type'] + args[0].to_s.downcase + when :date + PREFIX['date'] + args[0].getutc.strftime("%Y%m%d%H%M%S") + when :email + case args[0] + when :from then PREFIX['from_email'] + when :to then PREFIX['to_email'] + when :any then PREFIX['email'] + else raise "Invalid email term type #{args[0]}" + end + args[1].to_s.downcase + when :source_id + PREFIX['source_id'] + args[0].to_s.downcase + when :attachment_extension + PREFIX['attachment_extension'] + args[0].to_s.downcase + else + raise "Invalid term type #{type}" + end + end + + # Join all the given message-ids into a single thread + def union_threads ids + seen_threads = Set.new + related = Set.new + + # Get all the ids that will be in the new thread + ids.each do |id| + related << id + thread_id = @thread_ids[id] + if thread_id && !seen_threads.member?(thread_id) + thread_members = @thread_members[thread_id] + related.merge thread_members + seen_threads << thread_id + end + end + + # Pick a leader and move all the others to its thread + a = related.to_a + best, *rest = a.sort_by { |x| x.hash } + @thread_members[best] = a + @thread_ids[best] = best + rest.each do |x| + @thread_members.delete x + @thread_ids[x] = best + end + end +end + +end + +class MarshalledGDBM < GDBM + def []= k, v + super k, Marshal.dump(v) + end + + def [] k + v = super k + v ? Marshal.load(v) : nil + end +end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 16/18] fix String#ord monkeypatch 2009-06-20 20:50 ` [sup-talk] [PATCH 15/18] index: add xapian implementation Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 17/18] add limit argument to author_names_and_newness_for_thread Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/util.rb | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/lib/sup/util.rb b/lib/sup/util.rb index 8f60cc4..0609908 100644 --- a/lib/sup/util.rb +++ b/lib/sup/util.rb @@ -282,7 +282,7 @@ class String gsub(/\t/, " ").gsub(/\r/, "") end - if not defined? ord + unless method_defined? :ord def ord self[0] end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 17/18] add limit argument to author_names_and_newness_for_thread 2009-06-20 20:50 ` [sup-talk] [PATCH 16/18] fix String#ord monkeypatch Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 18/18] dont using SavingHash#[] for membership test Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/modes/thread-index-mode.rb | 15 ++++++++++----- 1 files changed, 10 insertions(+), 5 deletions(-) diff --git a/lib/sup/modes/thread-index-mode.rb b/lib/sup/modes/thread-index-mode.rb index 0bd8110..b671119 100644 --- a/lib/sup/modes/thread-index-mode.rb +++ b/lib/sup/modes/thread-index-mode.rb @@ -1,3 +1,5 @@ +require 'set' + module Redwood ## subclasses should implement: @@ -757,10 +759,12 @@ protected def authors; map { |m, *o| m.from if m }.compact.uniq; end - def author_names_and_newness_for_thread t + def author_names_and_newness_for_thread t, limit=nil new = {} - authors = t.map do |m, *o| + authors = Set.new + t.each do |m, *o| next unless m + break if limit and authors.size >= limit name = if AccountManager.is_account?(m.from) @@ -772,12 +776,13 @@ protected end new[name] ||= m.has_label?(:unread) - name + authors << name end - authors.compact.uniq.map { |a| [a, new[a]] } + authors.to_a.map { |a| [a, new[a]] } end + AUTHOR_LIMIT = 5 def text_for_thread_at line t, size_widget = @mutex.synchronize { [@threads[line], @size_widgets[line]] } @@ -787,7 +792,7 @@ protected ## format the from column cur_width = 0 - ann = author_names_and_newness_for_thread t + ann = author_names_and_newness_for_thread t, AUTHOR_LIMIT from = [] ann.each_with_index do |(name, newness), i| break if cur_width >= from_width -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 18/18] dont using SavingHash#[] for membership test 2009-06-20 20:50 ` [sup-talk] [PATCH 17/18] add limit argument to author_names_and_newness_for_thread Rich Lane @ 2009-06-20 20:50 ` Rich Lane 2009-06-22 14:46 ` Andrei Thorp 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-06-20 20:50 UTC (permalink / raw) --- lib/sup/thread.rb | 8 +++++--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/lib/sup/thread.rb b/lib/sup/thread.rb index 99f21dc..d395c35 100644 --- a/lib/sup/thread.rb +++ b/lib/sup/thread.rb @@ -310,13 +310,15 @@ class ThreadSet private :prune_thread_of def remove_id mid - return unless(c = @messages[mid]) + return unless @messages.member?(mid) + c = @messages[mid] remove_container c prune_thread_of c end def remove_thread_containing_id mid - c = @messages[mid] or return + return unless @messages.member?(mid) + c = @messages[mid] t = c.root.thread @threads.delete_if { |key, thread| t == thread } end @@ -355,7 +357,7 @@ class ThreadSet return if threads.size < 2 containers = threads.map do |t| - c = @messages[t.first.id] + c = @messages.member?(c) ? @messages[t.first.id] : nil raise "not in threadset: #{t.first.id}" unless c && c.message c end -- 1.6.0.4 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 18/18] dont using SavingHash#[] for membership test 2009-06-20 20:50 ` [sup-talk] [PATCH 18/18] dont using SavingHash#[] for membership test Rich Lane @ 2009-06-22 14:46 ` Andrei Thorp 0 siblings, 0 replies; 44+ messages in thread From: Andrei Thorp @ 2009-06-22 14:46 UTC (permalink / raw) Wow, that's one heck of a set of patches... good work dude :) -AT -- Andrei Thorp, Developer: Xandros Corp. (http://www.xandros.com) Make it idiot-proof, and someone will breed a better idiot. -- Oliver Elphick ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-06-20 20:49 [sup-talk] [PATCH 0/18] Xapian-based index Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 01/18] remove load_entry_for_id call in sup-recover-sources Rich Lane @ 2009-06-24 16:30 ` William Morgan 2009-06-24 17:33 ` William Morgan 1 sibling, 1 reply; 44+ messages in thread From: William Morgan @ 2009-06-24 16:30 UTC (permalink / raw) Hi Rich, Reformatted excerpts from Rich Lane's message of 2009-06-20: > This patch series refactors the Index class to remove Ferret-isms and > support multiple index implementations. The included XapianIndex is a > bit faster at indexing messages and significantly faster when > searching because it precomputes thread membership. It also works on > Ruby 1.9.1. This is great. Really, really great. You've refactored a crufty interface that's been growing untamed over the past three years, you've gotten us away from the unmaintained scariness that is Ferret, you've fixed the largest source of interface slowness (thread recomputation), and you've enabled us to move to the beautiful, speedy, encoding-aware world of Ruby 1.9. Thank you for satisfying all of my Sup-related desires in one fell swoop. From my lofty throne, I commend thee. Once the bugs are ironed out, I would like to make this the default index format and eventually deprecate Ferret. In the mean time, I've placed your patches on a branch called xapian. If anyone wants to play with this, here's what you do: 1. install the ruby xapian library and the ruby gdbm library, if you don't have them. These are packaged by your distro, and are not gems. 2. git fetch 3. git checkout -b xapian origin/xapian 4. cp ~/.sup/sources.yaml /tmp # just in case 5. sup-dump > dumpfile 6. SUP_INDEX=xapian sup-sync --all --all-sources --restore dumpfile 7. SUP_INDEX=xapian bin/sup -o 8. Oooh, fast. This should not disturb your Ferret index, so you can switch back and forth between the two. (Message state, of course, is not shared.) However, adding new messages to one index will prevent it from being automatically added to the other, so I recommend running in Xapian mode with -o and not pressing 'P'. > It's missing a couple of features, notably threading by subject. FWIW, I've been thinking about deprecating that particular feature for quite some time. > I'm sure there are many more bugs left, so I'd appreciate any testing > or review you all can provide. sup-sync crashes for me fairly systematically with this error: ./lib/sup/xapian_index.rb:404:in `sortable_serialise': Expected argument 0 of type double, but got Fixnum 51767811298 (TypeError) in SWIG method 'Xapian::sortable_serialise' from ./lib/sup/xapian_index.rb:404:in `index_message' from ./lib/sup/xapian_index.rb:111:in `sync_message' from /usr/lib/ruby/1.8/monitor.rb:242:in `synchronize' from ./lib/sup/xapian_index.rb:324:in `synchronize' from ./lib/sup/xapian_index.rb:110:in `sync_message' from ./lib/sup/util.rb:519:in `send' from ./lib/sup/util.rb:519:in `method_missing' from ./lib/sup/poll.rb:157:in `add_messages_from' from ./lib/sup/source.rb:100:in `each' from ./lib/sup/util.rb:558:in `send' from ./lib/sup/util.rb:558:in `__pass' from ./lib/sup/util.rb:545:in `method_missing' from ./lib/sup/poll.rb:141:in `add_messages_from' from ./lib/sup/util.rb:519:in `send' from ./lib/sup/util.rb:519:in `method_missing' from bin/sup-sync:140 from bin/sup-sync:135:in `each' from bin/sup-sync:135 I haven't spent any time tracking it down. Other than that, so far so good. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-06-24 16:30 ` [sup-talk] [PATCH 0/18] Xapian-based index William Morgan @ 2009-06-24 17:33 ` William Morgan 2009-06-26 2:00 ` Olly Betts 0 siblings, 1 reply; 44+ messages in thread From: William Morgan @ 2009-06-24 17:33 UTC (permalink / raw) Reformatted excerpts from William Morgan's message of 2009-06-24: > sup-sync crashes for me fairly systematically with this error: > > ./lib/sup/xapian_index.rb:404:in `sortable_serialise': Expected argument 0 of > type double, but got Fixnum 51767811298 (TypeError) This turns out to be due to dates being far in the future (e.g. on spam messages). I'm using the attached patch, which is pretty much a hack, to force them to be between 1969 and 2038. Better solutions welcome. (I haven't committed this.) -- William <wmorgan-sup at masanjin.net> -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-bugfix-dates-need-to-be-truncated-for-xapian-to-ind.patch Type: application/octet-stream Size: 2484 bytes Desc: not available URL: <http://rubyforge.org/pipermail/sup-talk/attachments/20090624/fae22ca3/attachment.obj> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-06-24 17:33 ` William Morgan @ 2009-06-26 2:00 ` Olly Betts 2009-06-26 13:49 ` William Morgan 0 siblings, 1 reply; 44+ messages in thread From: Olly Betts @ 2009-06-26 2:00 UTC (permalink / raw) William Morgan <wmorgan-sup at masanjin.net> writes: > Reformatted excerpts from William Morgan's message of 2009-06-24: > > sup-sync crashes for me fairly systematically with this error: > > > > ./lib/sup/xapian_index.rb:404:in `sortable_serialise': Expected argument 0 of > > type double, but got Fixnum 51767811298 (TypeError) > > This turns out to be due to dates being far in the future (e.g. on spam > messages). I'm using the attached patch, which is pretty much a hack, to > force them to be between 1969 and 2038. Better solutions welcome. (I > haven't committed this.) The error you get here is actually a bug in SWIG (http://www.swig.org/). Xapian uses SWIG to generate the wrappers for Ruby. The code SWIG currently uses to convert a parameter when a C/C++ function takes a double doesn't handle a "fixnum" which is larger than MAXINT. I've just applied a fix to SWIG SVN: http://swig.svn.sourceforge.net/viewvc/swig?view=rev&revision=11320 I'll make sure this fix makes it into the next Xapian release (which will be 1.0.14). Cheers, Olly ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-06-26 2:00 ` Olly Betts @ 2009-06-26 13:49 ` William Morgan 2009-07-17 23:42 ` Richard Heycock 2009-07-28 13:47 ` Olly Betts 0 siblings, 2 replies; 44+ messages in thread From: William Morgan @ 2009-06-26 13:49 UTC (permalink / raw) Reformatted excerpts from Olly Betts's message of 2009-06-25: > I'll make sure this fix makes it into the next Xapian release (which > will be 1.0.14). Awesome, thanks! Though even with SWIG fixed there will still be some tweaking necessary in Sup because the logistic function used for generating Xapian docids still has trouble with extreme dates. BTW, more kudos to Rich for somehow finding a way to use a logistic function in an email client. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-06-26 13:49 ` William Morgan @ 2009-07-17 23:42 ` Richard Heycock 2009-07-23 10:23 ` Adeodato Simó 2009-07-28 13:47 ` Olly Betts 1 sibling, 1 reply; 44+ messages in thread From: Richard Heycock @ 2009-07-17 23:42 UTC (permalink / raw) Excerpts from William Morgan's message of Fri Jun 26 23:49:40 +1000 2009: > Reformatted excerpts from Olly Betts's message of 2009-06-25: > > I'll make sure this fix makes it into the next Xapian release (which > > will be 1.0.14). > > Awesome, thanks! > > Though even with SWIG fixed there will still be some tweaking necessary > in Sup because the logistic function used for generating Xapian docids > still has trouble with extreme dates. > > BTW, more kudos to Rich for somehow finding a way to use a logistic > function in an email client. I've been meaning to respond to this the day this was posted. Rich Lane thank you, thank you. Ferret was one of by biggest gripes of using sup. I've used it elsewhere and it's a shocker; I eventually migrated it all to Xapian which has worked flawlessly since. I used to rebuild by ferret index almost on a weekly basis (I'm running debian unstable, which at the moment is really living up to it's name) at one stage something I haven't had to do once since migrating to Xapian. I got it work with 1.9 once but there are some problems that I just haven't had the time to look into but I will do and post any problems to the list. rgh ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-17 23:42 ` Richard Heycock @ 2009-07-23 10:23 ` Adeodato Simó 2009-07-25 4:53 ` Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Adeodato Simó @ 2009-07-23 10:23 UTC (permalink / raw) + Richard Heycock (Sat, 18 Jul 2009 09:42:07 +1000): > I've been meaning to respond to this the day this was posted. Rich Lane > thank you, thank you. Ferret was one of by biggest gripes of using sup. > I've used it elsewhere and it's a shocker; I eventually migrated it all > to Xapian which has worked flawlessly since. I used to rebuild by ferret > index almost on a weekly basis (I'm running debian unstable, which at > the moment is really living up to it's name) at one stage something I > haven't had to do once since migrating to Xapian. Yeah, thanks Rich! However, there seems to be something wrong with the parsing of contacts. After reindexing with Xapian, my contact list has entries like: <dato <dato at net.com.org.esadeodato <other <other at foo.ua.esfoo dato at net.com.org.esAdeodato Simo other2 at domain.netother2 surname2 Plus, nor '!label:inbox' or '-label:inbox' work for me. From an inspection of the code, it doesn't look to me as random negated labels are being parsed. Any hints? -- - Are you sure we're good? - Always. -- Rory and Lorelai ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-23 10:23 ` Adeodato Simó @ 2009-07-25 4:53 ` Rich Lane 2009-07-25 9:21 ` Adeodato Simó 2009-07-27 15:46 ` William Morgan 0 siblings, 2 replies; 44+ messages in thread From: Rich Lane @ 2009-07-25 4:53 UTC (permalink / raw) > Yeah, thanks Rich! However, there seems to be something wrong with the > parsing of contacts. After reindexing with Xapian, my contact list has > entries like: > > <dato <dato at net.com.org.esadeodato > <other <other at foo.ua.esfoo > dato at net.com.org.esAdeodato Simo other2 at domain.netother2 surname2 Thanks for the bug report, I've posted a patch (fix-mk_addrs-args) to fix this. You shouldn't need to reindex after applying the patch. > Plus, nor '!label:inbox' or '-label:inbox' work for me. From an > inspection of the code, it doesn't look to me as random negated labels > are being parsed. > > Any hints? You need to specify a non-negated term in the query. "type:mail -label:inbox" should work. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-25 4:53 ` Rich Lane @ 2009-07-25 9:21 ` Adeodato Simó 2009-07-25 19:59 ` Rich Lane 2009-07-27 15:46 ` William Morgan 1 sibling, 1 reply; 44+ messages in thread From: Adeodato Simó @ 2009-07-25 9:21 UTC (permalink / raw) + Rich Lane (Sat, 25 Jul 2009 06:53:07 +0200): > Thanks for the bug report, I've posted a patch (fix-mk_addrs-args) to > fix this. You shouldn't need to reindex after applying the patch. Great, thanks. The patch indeed fixes the issue. > > Plus, nor '!label:inbox' or '-label:inbox' work for me. From an > > inspection of the code, it doesn't look to me as random negated labels > > are being parsed. > > Any hints? > You need to specify a non-negated term in the query. > "type:mail -label:inbox" should work. Oh, I see. Yes, that works, thanks. One extra issue I just noticed: after dumping with ferret, reloading into Xapian, and doing a dump again (with Xapian this time), all the messages tagged "deleted" or "spam" do not appear in the dump at all. Any ideas? -- - Are you sure we're good? - Always. -- Rory and Lorelai ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-25 9:21 ` Adeodato Simó @ 2009-07-25 19:59 ` Rich Lane 2009-07-25 23:28 ` Ingmar Vanhassel 2009-07-27 15:48 ` William Morgan 0 siblings, 2 replies; 44+ messages in thread From: Rich Lane @ 2009-07-25 19:59 UTC (permalink / raw) Excerpts from Adeodato Sim?'s message of Sat Jul 25 05:21:16 -0400 2009: > One extra issue I just noticed: after dumping with ferret, reloading > into Xapian, and doing a dump again (with Xapian this time), all the > messages tagged "deleted" or "spam" do not appear in the dump at all. > Any ideas? The patch "xapian: dont exclude spam..." should fix this. One issue I've noticed is that removing labels from messages doesn't always immediately work. For example, label-list-mode shows a label as having some unread messages even though all of them are actually read. This tends to happen only after sup's been running for a while and restarting sup fixes it. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-25 19:59 ` Rich Lane @ 2009-07-25 23:28 ` Ingmar Vanhassel 2009-07-27 15:48 ` William Morgan 1 sibling, 0 replies; 44+ messages in thread From: Ingmar Vanhassel @ 2009-07-25 23:28 UTC (permalink / raw) Excerpts from Rich Lane's message of Sat Jul 25 21:59:19 +0200 2009: > One issue I've noticed is that removing labels from messages doesn't > always immediately work. For example, label-list-mode shows a label as > having some unread messages even though all of them are actually read. > This tends to happen only after sup's been running for a while and > restarting sup fixes it. I was just about to report that. :) Besides that, the Xapian index works very nicely. So I'd be happy to see it in next when that last regression (as far as my testing showed them) is fixed! -- Exherbo KDE, X.org maintainer ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-25 19:59 ` Rich Lane 2009-07-25 23:28 ` Ingmar Vanhassel @ 2009-07-27 15:48 ` William Morgan 2009-07-27 16:56 ` Ingmar Vanhassel 2009-07-27 17:06 ` Rich Lane 1 sibling, 2 replies; 44+ messages in thread From: William Morgan @ 2009-07-27 15:48 UTC (permalink / raw) Reformatted excerpts from Rich Lane's message of 2009-07-25: > One issue I've noticed is that removing labels from messages doesn't > always immediately work. Is this true even after you sync changes to the index? What about if you reload the label list buffer? ('@') -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-27 15:48 ` William Morgan @ 2009-07-27 16:56 ` Ingmar Vanhassel 2009-09-01 8:07 ` Ingmar Vanhassel 2009-07-27 17:06 ` Rich Lane 1 sibling, 1 reply; 44+ messages in thread From: Ingmar Vanhassel @ 2009-07-27 16:56 UTC (permalink / raw) Excerpts from William Morgan's message of Mon Jul 27 17:48:38 +0200 2009: > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > One issue I've noticed is that removing labels from messages doesn't > > always immediately work. > > Is this true even after you sync changes to the index? What about if you > reload the label list buffer? ('@') It's true in both cases. Even after a sync, 'U' still produces read messages (among unread), and a search for label:foo has threads without that label. If you quit sup & restart it things work as expected for a while. I've also noticed that sup takes a long time to quit with the xapian index. This delay happens after this message: [Mon Jul 27 16:56:01 +0000 2009] unlocking /home/ingmar/.sup/lock... -- Exherbo KDE, X.org maintainer ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-27 16:56 ` Ingmar Vanhassel @ 2009-09-01 8:07 ` Ingmar Vanhassel 2009-09-03 16:52 ` Rich Lane 0 siblings, 1 reply; 44+ messages in thread From: Ingmar Vanhassel @ 2009-09-01 8:07 UTC (permalink / raw) Excerpts from Ingmar Vanhassel's message of Mon Jul 27 18:56:28 +0200 2009: > Excerpts from William Morgan's message of Mon Jul 27 17:48:38 +0200 2009: > > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > > One issue I've noticed is that removing labels from messages doesn't > > > always immediately work. > > > > Is this true even after you sync changes to the index? What about if you > > reload the label list buffer? ('@') > > It's true in both cases. Even after a sync, 'U' still produces read > messages (among unread), and a search for label:foo has threads without > that label. If you quit sup & restart it things work as expected for a > while. I can still reproduce this for a more specific case, with xapian 1.0.15. Searching for is:unread (hit U), works as expected. When I filter that with threads having a second label (hit |, label:foo), then it shows threads with label:foo, but it loses the is:unread constraint. Same for immediately doing is:unread label:foo, which gives me unread threads, but not always with the foo label. -- Exherbo KDE, X.org maintainer ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-09-01 8:07 ` Ingmar Vanhassel @ 2009-09-03 16:52 ` Rich Lane 0 siblings, 0 replies; 44+ messages in thread From: Rich Lane @ 2009-09-03 16:52 UTC (permalink / raw) Excerpts from Ingmar Vanhassel's message of Tue Sep 01 04:07:27 -0400 2009: > Excerpts from Ingmar Vanhassel's message of Mon Jul 27 18:56:28 +0200 2009: > > Excerpts from William Morgan's message of Mon Jul 27 17:48:38 +0200 2009: > > > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > > > One issue I've noticed is that removing labels from messages doesn't > > > > always immediately work. > > > > > > Is this true even after you sync changes to the index? What about if you > > > reload the label list buffer? ('@') > > > > It's true in both cases. Even after a sync, 'U' still produces read > > messages (among unread), and a search for label:foo has threads without > > that label. If you quit sup & restart it things work as expected for a > > while. > > I can still reproduce this for a more specific case, with xapian 1.0.15. > > Searching for is:unread (hit U), works as expected. When I filter > that with threads having a second label (hit |, label:foo), then it > shows threads with label:foo, but it loses the is:unread constraint. > > Same for immediately doing is:unread label:foo, which gives me unread > threads, but not always with the foo label. I've reproduced this and it looks like a query parsing problem. Multiple terms on the same field are OR'd together instead of AND [1]. Adding an explicit AND works. I'll see if Xapian::QueryParser can be convinced to do what we want here. [1] http://trac.xapian.org/ticket/157 ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-27 15:48 ` William Morgan 2009-07-27 16:56 ` Ingmar Vanhassel @ 2009-07-27 17:06 ` Rich Lane 2009-07-31 16:20 ` Rich Lane 1 sibling, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-07-27 17:06 UTC (permalink / raw) Excerpts from William Morgan's message of Mon Jul 27 11:48:38 -0400 2009: > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > One issue I've noticed is that removing labels from messages doesn't > > always immediately work. > > Is this true even after you sync changes to the index? What about if you > reload the label list buffer? ('@') Yes. This is looking like a Xapian bug - I've reproduced it without any Sup code. I'm working on a fix. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-27 17:06 ` Rich Lane @ 2009-07-31 16:20 ` Rich Lane 2009-08-12 13:05 ` Ingmar Vanhassel 0 siblings, 1 reply; 44+ messages in thread From: Rich Lane @ 2009-07-31 16:20 UTC (permalink / raw) Excerpts from Rich Lane's message of Mon Jul 27 13:06:34 -0400 2009: > Excerpts from William Morgan's message of Mon Jul 27 11:48:38 -0400 2009: > > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > > One issue I've noticed is that removing labels from messages doesn't > > > always immediately work. > > > > Is this true even after you sync changes to the index? What about if you > > reload the label list buffer? ('@') > > Yes. This is looking like a Xapian bug - I've reproduced it without any > Sup code. I'm working on a fix. I've fixed this, it should be released in Xapian 1.0.15. Or, grab Xapian SVN and you can try out the Chert backend too (XAPIAN_PREFER_CHERT=1). ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-31 16:20 ` Rich Lane @ 2009-08-12 13:05 ` Ingmar Vanhassel 2009-08-12 14:32 ` Nicolas Pouillard 2009-08-14 5:23 ` Rich Lane 0 siblings, 2 replies; 44+ messages in thread From: Ingmar Vanhassel @ 2009-08-12 13:05 UTC (permalink / raw) Excerpts from Rich Lane's message of Fri Jul 31 18:20:41 +0200 2009: > Excerpts from Rich Lane's message of Mon Jul 27 13:06:34 -0400 2009: > > Excerpts from William Morgan's message of Mon Jul 27 11:48:38 -0400 2009: > > > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > > > One issue I've noticed is that removing labels from messages doesn't > > > > always immediately work. > > > > > > Is this true even after you sync changes to the index? What about if you > > > reload the label list buffer? ('@') > > > > Yes. This is looking like a Xapian bug - I've reproduced it without any > > Sup code. I'm working on a fix. > > I've fixed this, it should be released in Xapian 1.0.15. Or, grab Xapian > SVN and you can try out the Chert backend too (XAPIAN_PREFER_CHERT=1). Could you point me to the SVN revision containing the fix? I'd like to backport the fix to my Xapian 1.0.14 packages, pending 1.0.15 release. Thanks! -- Exherbo KDE, X.org maintainer ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-08-12 13:05 ` Ingmar Vanhassel @ 2009-08-12 14:32 ` Nicolas Pouillard 2009-08-14 5:23 ` Rich Lane 1 sibling, 0 replies; 44+ messages in thread From: Nicolas Pouillard @ 2009-08-12 14:32 UTC (permalink / raw) Excerpts from Ingmar Vanhassel's message of Wed Aug 12 15:05:35 +0200 2009: > Excerpts from Rich Lane's message of Fri Jul 31 18:20:41 +0200 2009: > > Excerpts from Rich Lane's message of Mon Jul 27 13:06:34 -0400 2009: > > > Excerpts from William Morgan's message of Mon Jul 27 11:48:38 -0400 2009: > > > > Reformatted excerpts from Rich Lane's message of 2009-07-25: > > > > > One issue I've noticed is that removing labels from messages doesn't > > > > > always immediately work. > > > > > > > > Is this true even after you sync changes to the index? What about if you > > > > reload the label list buffer? ('@') > > > > > > Yes. This is looking like a Xapian bug - I've reproduced it without any > > > Sup code. I'm working on a fix. > > > > I've fixed this, it should be released in Xapian 1.0.15. Or, grab Xapian > > SVN and you can try out the Chert backend too (XAPIAN_PREFER_CHERT=1). BTW, does someone have successfully built Xapian under Mac OS X ? -- Nicolas Pouillard http://nicolaspouillard.fr ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-08-12 13:05 ` Ingmar Vanhassel 2009-08-12 14:32 ` Nicolas Pouillard @ 2009-08-14 5:23 ` Rich Lane 1 sibling, 0 replies; 44+ messages in thread From: Rich Lane @ 2009-08-14 5:23 UTC (permalink / raw) Excerpts from Ingmar Vanhassel's message of Wed Aug 12 09:05:35 -0400 2009: > Could you point me to the SVN revision containing the fix? I'd like to > backport the fix to my Xapian 1.0.14 packages, pending 1.0.15 release. Revision 13219. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-25 4:53 ` Rich Lane 2009-07-25 9:21 ` Adeodato Simó @ 2009-07-27 15:46 ` William Morgan 2009-07-28 16:53 ` Olly Betts 1 sibling, 1 reply; 44+ messages in thread From: William Morgan @ 2009-07-27 15:46 UTC (permalink / raw) Reformatted excerpts from Rich Lane's message of 2009-07-24: > > Plus, nor '!label:inbox' or '-label:inbox' work for me. From an > > inspection of the code, it doesn't look to me as random negated > > labels are being parsed. > > > > Any hints? > > You need to specify a non-negated term in the query. "type:mail > -label:inbox" should work. This is a typical restriction for inverted index-based search engines. You need to have at least one positive term or the computation is too expensive (it would have to iterate over every term ever seen.) It's true of Ferret, Google, etc. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-27 15:46 ` William Morgan @ 2009-07-28 16:53 ` Olly Betts 2009-07-28 17:01 ` William Morgan 0 siblings, 1 reply; 44+ messages in thread From: Olly Betts @ 2009-07-28 16:53 UTC (permalink / raw) William Morgan <wmorgan-sup at masanjin.net> writes: > Reformatted excerpts from Rich Lane's message of 2009-07-24: > > You need to specify a non-negated term in the query. "type:mail > > -label:inbox" should work. > > This is a typical restriction for inverted index-based search engines. > You need to have at least one positive term or the computation is too > expensive (it would have to iterate over every term ever seen.) It's > true of Ferret, Google, etc. Actually, Xapian supports this - Xapian.Query.new("") is a "magic" query which matches all documents. It doesn't need to iterate over every term, just all documents. But if you want the top ten documents without a particular filter, there's no relevance ranking, so it can stop after it has found ten matches, which should be pretty quick. This isn't currently supported by the QueryParser when using "-" on terms (the reasoning was that it was too easy to accidentally invoke when pasting text), but 'NOT label:inbox' will work if you enable it using QueryParser.FLAG_PURE_NOT. Cheers, Olly ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-28 16:53 ` Olly Betts @ 2009-07-28 17:01 ` William Morgan 0 siblings, 0 replies; 44+ messages in thread From: William Morgan @ 2009-07-28 17:01 UTC (permalink / raw) Reformatted excerpts from Olly Betts's message of 2009-07-28: > Actually, Xapian supports this - Xapian.Query.new("") is a "magic" > query which matches all documents. Yeah, I think Rich Lane just taught me how Ferret supports this too. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-06-26 13:49 ` William Morgan 2009-07-17 23:42 ` Richard Heycock @ 2009-07-28 13:47 ` Olly Betts 2009-07-28 15:07 ` William Morgan 1 sibling, 1 reply; 44+ messages in thread From: Olly Betts @ 2009-07-28 13:47 UTC (permalink / raw) William Morgan <wmorgan-sup at masanjin.net> writes: > Reformatted excerpts from Olly Betts's message of 2009-06-25: > > I'll make sure this fix makes it into the next Xapian release (which > > will be 1.0.14). > > Awesome, thanks! Just to update, Xapian 1.0.14 was released last week with this fix. I tested with a distilled micro-testcase rather than sup and these patches, so if you still see problems please open a ticket on http://trac.xapian.org/ Cheers, Olly ^ permalink raw reply [flat|nested] 44+ messages in thread
* [sup-talk] [PATCH 0/18] Xapian-based index 2009-07-28 13:47 ` Olly Betts @ 2009-07-28 15:07 ` William Morgan 0 siblings, 0 replies; 44+ messages in thread From: William Morgan @ 2009-07-28 15:07 UTC (permalink / raw) Reformatted excerpts from Olly Betts's message of 2009-07-28: > Just to update, Xapian 1.0.14 was released last week with this fix. > > I tested with a distilled micro-testcase rather than sup and these patches, > so if you still see problems please open a ticket on http://trac.xapian.org/ Excellent. Thank you. -- William <wmorgan-sup at masanjin.net> ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2009-09-03 16:52 UTC | newest] Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-06-20 20:49 [sup-talk] [PATCH 0/18] Xapian-based index Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 01/18] remove load_entry_for_id call in sup-recover-sources Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 02/18] remove load_entry_for_id call in DraftManager.discard Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 03/18] remove ferret entry from poll/sync interface Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 04/18] index: remove unused method load_entry_for_id Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 05/18] switch DraftManager to use Message.build_from_source Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 06/18] index: move has_any_from_source_with_label? to sup-sync-back Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 07/18] move source-related methods to SourceManager Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 08/18] index: remove unused method fresh_thread_id Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 09/18] index: revert overeager opts->query rename in each_message_in_thread_for Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 10/18] index: make wrap_subj methods private Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 11/18] index: move Ferret-specific code to ferret_index.rb Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 12/18] remove last external uses of ferret docid Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 13/18] add Message.indexable_{body, chunks, subject} Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 14/18] index: choose index implementation with config entry or environment variable Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 15/18] index: add xapian implementation Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 16/18] fix String#ord monkeypatch Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 17/18] add limit argument to author_names_and_newness_for_thread Rich Lane 2009-06-20 20:50 ` [sup-talk] [PATCH 18/18] dont using SavingHash#[] for membership test Rich Lane 2009-06-22 14:46 ` Andrei Thorp 2009-06-24 16:30 ` [sup-talk] [PATCH 0/18] Xapian-based index William Morgan 2009-06-24 17:33 ` William Morgan 2009-06-26 2:00 ` Olly Betts 2009-06-26 13:49 ` William Morgan 2009-07-17 23:42 ` Richard Heycock 2009-07-23 10:23 ` Adeodato Simó 2009-07-25 4:53 ` Rich Lane 2009-07-25 9:21 ` Adeodato Simó 2009-07-25 19:59 ` Rich Lane 2009-07-25 23:28 ` Ingmar Vanhassel 2009-07-27 15:48 ` William Morgan 2009-07-27 16:56 ` Ingmar Vanhassel 2009-09-01 8:07 ` Ingmar Vanhassel 2009-09-03 16:52 ` Rich Lane 2009-07-27 17:06 ` Rich Lane 2009-07-31 16:20 ` Rich Lane 2009-08-12 13:05 ` Ingmar Vanhassel 2009-08-12 14:32 ` Nicolas Pouillard 2009-08-14 5:23 ` Rich Lane 2009-07-27 15:46 ` William Morgan 2009-07-28 16:53 ` Olly Betts 2009-07-28 17:01 ` William Morgan 2009-07-28 13:47 ` Olly Betts 2009-07-28 15:07 ` William Morgan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox