From: William Morgan <wmorgan-sup@masanjin.net>
To: sup-devel <sup-devel@rubyforge.org>
Subject: Re: [sup-devel] Cannot query Japanese characters
Date: Tue, 03 May 2011 22:26:04 +0000 [thread overview]
Message-ID: <1304460745-sup-6241@masanjin.net> (raw)
In-Reply-To: <BANLkTi=tSnbEijoEHG76Z5Fy9-3G4TPxVw@mail.gmail.com>
Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
> docid1 = index.add_entry entry1 => 1
> q1 = Query.new "body", "研究" => body:"研究"
> results1 = index.search q1 => []
The problem here is tokenization. Whistlepig only provides a very simple
tokenizer, namely, it looks for space-separated things [1]. So you have to
space-separate your tokens in both the indexing and querying stages, e.g.:
entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
docid1 = index.add_entry entry1 => 1
q1 = Query.new "body", "研 究" => AND body:"研" body:"究"
q1 = Query.new "body", "\"研 究\"" => PHRASE body:"研" body:"究"
results1 = index.search q1 => [1]
For Japanese, proper tokenization is tricky. You could simply space-separate
every character and deal with the spurious matches across word boundaries.
Or you could do it right by plugging in a proper tokenizer, e.g. something
like http://www.chasen.org/~taku/software/TinySegmenter/.
[1] It also strips any prefix or suffix characters that match [:punct:]. This
is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
tokenizer as an alternative is in the works.
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
next prev parent reply other threads:[~2011-05-03 22:32 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-04-25 1:23 Horacio Sanson
2011-04-26 4:49 ` William Morgan
2011-04-29 4:52 ` William Morgan
2011-05-01 15:35 ` Horacio Sanson
2011-05-01 15:46 ` Horacio Sanson
2011-05-03 14:24 ` Horacio Sanson
2011-05-03 22:26 ` William Morgan [this message]
2011-05-04 1:42 ` Horacio Sanson
2011-05-04 2:03 ` Horacio Sanson
2011-05-04 16:56 ` William Morgan
2011-05-06 3:30 ` Horacio Sanson
2011-06-08 5:21 ` William Morgan
2011-06-09 13:48 ` Horacio Sanson
2011-06-09 14:08 ` Horacio Sanson
2011-06-09 22:46 ` William Morgan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1304460745-sup-6241@masanjin.net \
--to=wmorgan-sup@masanjin.net \
--cc=sup-devel@rubyforge.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox