Chasen is the worst tokenizer, is pretty old. The best one is MeCab that is the faster and from the same author of Chasen. You can see all major Japanese tokenizer in action at this URL: http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some text in the box and press the button. After some hacking I got a Heliotrope server that works perfectly with Japanese text. All I did was follow your comments and applied the MeCab tokenizer to the message body and query strings before passing them to Whistelpig or more specific to Heliotrope::Index. There is one problem I don't see how to handle... I do receive email in Japanese but also Chinese and Korean. I need a different tokenizer for each one and I have no idea how to handle this. Do email messages contain a language header that would allow me to identify the language and pass it to the corresponding tokenizer?? regards, Horacio On Wed, May 4, 2011 at 7:26 AM, William Morgan wrote: > Reformatted excerpts from Horacio Sanson's message of 2011-05-03: >> index = Index.new "index" => # >> entry1 = Entry.new => # >> entry1.add_string "body", "研究会" => # >> docid1 = index.add_entry entry1 => 1 >> q1 = Query.new "body", "研究" => body:"研究" >> results1 = index.search q1 => [] > > The problem here is tokenization. Whistlepig only provides a very simple > tokenizer, namely, it looks for space-separated things [1]. So you have to > space-separate your tokens in both the indexing and querying stages, e.g.: > >  entry1.add_string "body", "研 究 会" => # >  docid1 = index.add_entry entry1      => 1 >  q1 = Query.new "body", "研 究"       => AND body:"研" body:"究" >  q1 = Query.new "body", "\"研 究\""   => PHRASE body:"研" body:"究" >  results1 = index.search q1           => [1] > > For Japanese, proper tokenization is tricky. You could simply space-separate > every character and deal with the spurious matches across word boundaries. > Or you could do it right by plugging in a proper tokenizer, e.g. something > like http://www.chasen.org/~taku/software/TinySegmenter/. > > [1] It also strips any prefix or suffix characters that match [:punct:]. This > is all pretty ad-hoc and undocumented. Providing simpler whitespace-only > tokenizer as an alternative is in the works. > -- > William > _______________________________________________ > Sup-devel mailing list > Sup-devel@rubyforge.org > http://rubyforge.org/mailman/listinfo/sup-devel >