Chasen is the worst tokenizer, is pretty old. The best one is MeCab
that is the faster and from the same author of Chasen.
You can see all major Japanese tokenizer in action at this URL:
http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some
text in the box and press the button.

After some hacking I got a Heliotrope server that works perfectly with
Japanese text. All I did was follow your comments
and applied the MeCab tokenizer to the message body and query strings
before passing them to Whistelpig or more specific
to Heliotrope::Index.

There is one problem I don't see how to handle... I do receive email
in Japanese but also Chinese and Korean. I need a different
tokenizer for each one and I have no idea how to handle this. Do email
messages contain a language header that would allow me
to identify the language and pass it to the corresponding tokenizer??


regards,
Horacio

On Wed, May 4, 2011 at 7:26 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
> Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
>> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
>> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
>> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
>> docid1 = index.add_entry entry1 => 1
>> q1 = Query.new "body", "研究" => body:"研究"
>> results1 = index.search q1 => []
>
> The problem here is tokenization. Whistlepig only provides a very simple
> tokenizer, namely, it looks for space-separated things [1]. So you have to
> space-separate your tokens in both the indexing and querying stages, e.g.:
>
>  entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
>  docid1 = index.add_entry entry1      => 1
>  q1 = Query.new "body", "研 究"       => AND body:"研" body:"究"
>  q1 = Query.new "body", "\"研 究\""   => PHRASE body:"研" body:"究"
>  results1 = index.search q1           => [1]
>
> For Japanese, proper tokenization is tricky. You could simply space-separate
> every character and deal with the spurious matches across word boundaries.
> Or you could do it right by plugging in a proper tokenizer, e.g. something
> like http://www.chasen.org/~taku/software/TinySegmenter/.
>
> [1] It also strips any prefix or suffix characters that match [:punct:]. This
> is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
> tokenizer as an alternative is in the works.
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>