From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 10.52.188.165 with SMTP id gb5cs68138vdc; Tue, 3 May 2011 22:06:19 -0700 (PDT) Received: by 10.224.40.211 with SMTP id l19mr689940qae.46.1304485578904; Tue, 03 May 2011 22:06:18 -0700 (PDT) Return-Path: Received: from rubyforge.org (rubyforge.org [205.234.109.19]) by mx.google.com with ESMTP id t16si944430qco.6.2011.05.03.22.06.18; Tue, 03 May 2011 22:06:18 -0700 (PDT) Received-SPF: pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) client-ip=205.234.109.19; Authentication-Results: mx.google.com; spf=pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) smtp.mail=sup-devel-bounces@rubyforge.org; dkim=neutral (body hash did not verify) header.i=@gmail.com Received: from rubyforge.org (rubyforge.org [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id 3992315B8035 for ; Wed, 4 May 2011 01:06:18 -0400 (EDT) Received: from mail-vx0-f178.google.com (mail-vx0-f178.google.com [209.85.220.178]) by rubyforge.org (Postfix) with ESMTP id BED021858374 for ; Tue, 3 May 2011 21:42:26 -0400 (EDT) Received: by vxc11 with SMTP id 11so907919vxc.23 for ; Tue, 03 May 2011 18:42:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=uBfHxnwPmcXrym2dRbf/BHzfl5dQPp4YIrkHlVFWzRw=; b=CJaVQWRQ2M3AUrjtExE0CDMXbZdHhCF8crw9szJP1husTyC4boRSE3eCO3OLAeD+su 4FvOwVS1AG+C9kpL9ajxTfUOHeJuIWN57zf38Kxjs+JcYfzBold0paoRLdxKwklrzDSm MFN/nm+vXwt9Bpyehcz2cZrFMowdurnZT7diE= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=Ew3foTtIgfsjZblZH2fCNUI55P4J6S5d3CVmQhnTIyqJvHsaBngUu3QvFkMhTj1Q/I pCkZ1nWpl62Kjv+ov/MGx7RQ7a5r8WyYCqeuFxxaSJYpMSNKs/L+t5PApgrlwfPIWR26 LF174YCpVPQ76e1mHsmVw0wDFb666MvKw1ECU= MIME-Version: 1.0 Received: by 10.52.114.104 with SMTP id jf8mr648436vdb.193.1304473346002; Tue, 03 May 2011 18:42:26 -0700 (PDT) Received: by 10.52.107.2 with HTTP; Tue, 3 May 2011 18:42:25 -0700 (PDT) In-Reply-To: <1304460745-sup-6241@masanjin.net> References: <201104251023.19659.hsanson@gmail.com> <1303793294-sup-688@masanjin.net> <1304052708-sup-4240@masanjin.net> <1304460745-sup-6241@masanjin.net> Date: Wed, 4 May 2011 10:42:25 +0900 Message-ID: From: Horacio Sanson To: Sup developer discussion Content-Type: multipart/mixed; boundary=bcaec547c92d68528504a2695e00 Subject: Re: [sup-devel] Cannot query Japanese characters X-BeenThere: sup-devel@rubyforge.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: Sup developer discussion List-Id: Sup developer discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: sup-devel-bounces@rubyforge.org Errors-To: sup-devel-bounces@rubyforge.org --bcaec547c92d68528504a2695e00 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Chasen is the worst tokenizer, is pretty old. The best one is MeCab that is the faster and from the same author of Chasen. You can see all major Japanese tokenizer in action at this URL: http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some text in the box and press the button. After some hacking I got a Heliotrope server that works perfectly with Japanese text. All I did was follow your comments and applied the MeCab tokenizer to the message body and query strings before passing them to Whistelpig or more specific to Heliotrope::Index. There is one problem I don't see how to handle... I do receive email in Japanese but also Chinese and Korean. I need a different tokenizer for each one and I have no idea how to handle this. Do email messages contain a language header that would allow me to identify the language and pass it to the corresponding tokenizer?? regards, Horacio On Wed, May 4, 2011 at 7:26 AM, William Morgan w= rote: > Reformatted excerpts from Horacio Sanson's message of 2011-05-03: >> index =3D Index.new "index" =3D> # >> entry1 =3D Entry.new =3D> # >> entry1.add_string "body", "=E7=A0=94=E7=A9=B6=E4=BC=9A" =3D> # >> docid1 =3D index.add_entry entry1 =3D> 1 >> q1 =3D Query.new "body", "=E7=A0=94=E7=A9=B6" =3D> body:"=E7=A0=94=E7=A9= =B6" >> results1 =3D index.search q1 =3D> [] > > The problem here is tokenization. Whistlepig only provides a very simple > tokenizer, namely, it looks for space-separated things [1]. So you have t= o > space-separate your tokens in both the indexing and querying stages, e.g.= : > > =C2=A0entry1.add_string "body", "=E7=A0=94 =E7=A9=B6 =E4=BC=9A" =3D> # > =C2=A0docid1 =3D index.add_entry entry1 =C2=A0 =C2=A0 =C2=A0=3D> 1 > =C2=A0q1 =3D Query.new "body", "=E7=A0=94 =E7=A9=B6" =C2=A0 =C2=A0 =C2=A0= =3D> AND body:"=E7=A0=94" body:"=E7=A9=B6" > =C2=A0q1 =3D Query.new "body", "\"=E7=A0=94 =E7=A9=B6\"" =C2=A0 =3D> PHRA= SE body:"=E7=A0=94" body:"=E7=A9=B6" > =C2=A0results1 =3D index.search q1 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =3D= > [1] > > For Japanese, proper tokenization is tricky. You could simply space-separ= ate > every character and deal with the spurious matches across word boundaries= . > Or you could do it right by plugging in a proper tokenizer, e.g. somethin= g > like http://www.chasen.org/~taku/software/TinySegmenter/. > > [1] It also strips any prefix or suffix characters that match [:punct:]. = This > is all pretty ad-hoc and undocumented. Providing simpler whitespace-only > tokenizer as an alternative is in the works. > -- > William > _______________________________________________ > Sup-devel mailing list > Sup-devel@rubyforge.org > http://rubyforge.org/mailman/listinfo/sup-devel > --bcaec547c92d68528504a2695e00 Content-Type: text/x-patch; charset=US-ASCII; name="0001-Fix-crash-for-non-ASCII-chars.patch" Content-Disposition: attachment; filename="0001-Fix-crash-for-non-ASCII-chars.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_gn9lqdwv0 RnJvbSBmNDg0YjA5NTE4ZGI0N2EwNjY5MGUwOWE3MTBjZjZlODY2YzU1NjFiIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBIb3JhY2lvIFNhbnNvbiA8aHNhbnNvbkBnbWFpbC5jb20+CkRh dGU6IFdlZCwgNCBNYXkgMjAxMSAxMDozMToxMiArMDkwMApTdWJqZWN0OiBbUEFUQ0ggMS8yXSBG aXggY3Jhc2ggZm9yIG5vbiBBU0NJSSBjaGFycwoKLS0tCiBiaW4vaGVsaW90cm9wZS1zZXJ2ZXIg fCAgICAyICstCiAxIGZpbGVzIGNoYW5nZWQsIDEgaW5zZXJ0aW9ucygrKSwgMSBkZWxldGlvbnMo LSkKCmRpZmYgLS1naXQgYS9iaW4vaGVsaW90cm9wZS1zZXJ2ZXIgYi9iaW4vaGVsaW90cm9wZS1z ZXJ2ZXIKaW5kZXggNDc5M2FjMi4uZWQ5YzNiZSAxMDA2NDQKLS0tIGEvYmluL2hlbGlvdHJvcGUt c2VydmVyCisrKyBiL2Jpbi9oZWxpb3Ryb3BlLXNlcnZlcgpAQCAtMTUxLDcgKzE1MSw3IEBAIGNs YXNzIEhlbGlvdHJvcGVTZXJ2ZXIgPCBTaW5hdHJhOjpCYXNlCiAgICAgICBuYXYgKz0gIjwvZGl2 PiIKIAogICAgICAgaGVhZGVyKCJTZWFyY2g6ICN7cXVlcnkub3JpZ2luYWxfcXVlcnlfc30iLCBx dWVyeS5vcmlnaW5hbF9xdWVyeV9zKSArCi0gICAgICAgICI8ZGl2PlBhcnNlZCBxdWVyeTogI3tl c2NhcGVfaHRtbCBxdWVyeS5wYXJzZWRfcXVlcnlfc308L2Rpdj4iICsKKyAgICAgICAgIjxkaXY+ UGFyc2VkIHF1ZXJ5OiAje2VzY2FwZV9odG1sIHF1ZXJ5LnBhcnNlZF9xdWVyeV9zLmZvcmNlX2Vu Y29kaW5nKCdVVEYtOCcpfTwvZGl2PiIgKwogICAgICAgICAiPGRpdj5TZWFyY2ggdG9vayAje3Nw cmludGYgJyUuMmYnLCBpbmZvWzplbGFwc2VkXX1zIGFuZCAje2luZm9bOmNvbnRpbnVlZF0gPyAn d2FzJyA6ICd3YXMgTk9UJ30gY29udGludWVkPC9kaXY+IiArCiAgICAgICAgICIje25hdn08dGFi bGU+IiArCiAgICAgICAgIHJlc3VsdHMubWFwIHsgfHJ8IHRocmVhZGluZm9fdG9faHRtbCByIH0u am9pbiArCi0tIAoxLjcuNC4xCgo= --bcaec547c92d68528504a2695e00 Content-Type: text/x-patch; charset=US-ASCII; name="0002-Add-MeCab-japanese-text-analyzer.patch" Content-Disposition: attachment; filename="0002-Add-MeCab-japanese-text-analyzer.patch" Content-Transfer-Encoding: base64 X-Attachment-Id: f_gn9lqk0f1 RnJvbSA2NTk1YWYwYjU1ZDUyZDFmNjg1NjJmYmRkMGYxYjIzZGZlZTM0MDM5IE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBIb3JhY2lvIFNhbnNvbiA8aHNhbnNvbkBnbWFpbC5jb20+CkRh dGU6IFdlZCwgNCBNYXkgMjAxMSAxMDozNDo0OCArMDkwMApTdWJqZWN0OiBbUEFUQ0ggMi8yXSBB ZGQgTWVDYWIgamFwYW5lc2UgdGV4dCBhbmFseXplci4KCkphcGFuZXNlIHRleHQgaGFzIG5vIHdo aXRlIHNwYWNlIHNlcGFyYXRpb24gY2F1c2luZyB0aGUgV2hpc3RlbHBpZwp0b2tlbml6ZXIgdG8g ZmFpbC4gVGhpcyBwYXRjaCBwcm9jZXNzZXMgdGhlIGVtYWlsIGluZGV4YWJsZSB0ZXh0CmFuZCBz ZWFyY2ggcXVlcmllcyB3aXRoIE1lQ2FiIGJlZm9yZSBwYXNzaW5nIHRoZW0gdG8gV2hpc3RlbHBp Zy4KLS0tCiBiaW4vaGVsaW90cm9wZS1zZXJ2ZXIgICAgIHwgICAgMyArKy0KIGxpYi9oZWxpb3Ry b3BlL21lc3NhZ2UucmIgfCAgICA1ICsrKy0tCiAyIGZpbGVzIGNoYW5nZWQsIDUgaW5zZXJ0aW9u cygrKSwgMyBkZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9iaW4vaGVsaW90cm9wZS1zZXJ2ZXIg Yi9iaW4vaGVsaW90cm9wZS1zZXJ2ZXIKaW5kZXggZWQ5YzNiZS4uZjNiZDVkNCAxMDA2NDQKLS0t IGEvYmluL2hlbGlvdHJvcGUtc2VydmVyCisrKyBiL2Jpbi9oZWxpb3Ryb3BlLXNlcnZlcgpAQCAt NjcsNiArNjcsNyBAQCBjbGFzcyBIZWxpb3Ryb3BlU2VydmVyIDwgU2luYXRyYTo6QmFzZQogICAg IGVuZC50b19qc29uCiAgIGVuZAogCisgIHJlcXVpcmUgIk1lQ2FiIgogICBkZWYgZ2V0X3F1ZXJ5 X2Zyb21fcGFyYW1zCiAgICAgIyMgd29yayBhcm91bmQgYSByYWNrICg/KSBidWcgd2hlcmUgcXVv dGVzIGFyZSBvbWl0dGVkIGluIHF1ZXJpZXMgbGlrZSAiaGVsbG8gYm9iIgogICAgIHF1ZXJ5ID0g aWYgZW52WyJyYWNrLnJlcXVlc3QucXVlcnlfc3RyaW5nIl0gPX4gL1xicT0oLis/KSgmfCQpLwpA QCAtNzYsNyArNzcsNyBAQCBjbGFzcyBIZWxpb3Ryb3BlU2VydmVyIDwgU2luYXRyYTo6QmFzZQog ICAgIGVuZAogCiAgICAgcmFpc2UgUmVxdWVzdEVycm9yLCAibmVlZCBhIHF1ZXJ5IiB1bmxlc3Mg cXVlcnkKLSAgICBxdWVyeQorICAgIE1lQ2FiOjpUYWdnZXIubmV3KCItT3dha2F0aSIpLnBhcnNl KHF1ZXJ5KS5mb3JjZV9lbmNvZGluZygiVVRGLTgiKQogICBlbmQKIAogICBkZWYgZ2V0X3NlYXJj aF9yZXN1bHRzCmRpZmYgLS1naXQgYS9saWIvaGVsaW90cm9wZS9tZXNzYWdlLnJiIGIvbGliL2hl bGlvdHJvcGUvbWVzc2FnZS5yYgppbmRleCBiNDgzMjliLi5lNjFkOGJkIDEwMDY0NAotLS0gYS9s aWIvaGVsaW90cm9wZS9tZXNzYWdlLnJiCisrKyBiL2xpYi9oZWxpb3Ryb3BlL21lc3NhZ2UucmIK QEAgLTc2LDYgKzc2LDcgQEAgY2xhc3MgTWVzc2FnZQogICBkZWYgaW5kaXJlY3RfcmVjaXBpZW50 czsgY2MgKyBiY2MgZW5kCiAgIGRlZiByZWNpcGllbnRzOyBkaXJlY3RfcmVjaXBpZW50cyArIGlu ZGlyZWN0X3JlY2lwaWVudHMgZW5kCiAKKyAgcmVxdWlyZSAiTWVDYWIiCiAgIGRlZiBpbmRleGFi bGVfdGV4dAogICAgIEBpbmRleGFibGVfdGV4dCB8fD0gYmVnaW4KICAgICAgIHYgPSAoW2Zyb20u aW5kZXhhYmxlX3RleHRdICsKQEAgLTkwLDggKzkxLDggQEAgY2xhc3MgTWVzc2FnZQogICAgICAg ICBlbmQKICAgICAgICkuZmxhdHRlbi5jb21wYWN0LmpvaW4oIiAiKQogCi0gICAgICB2LmdzdWIo L1xzK1tcV1xkX10rKFxzfCQpLywgIiAiKS4gIyBkcm9wIGZ1bm55IHRva2VucwotICAgICAgICBn c3ViKC9ccysvLCAiICIpCisgICAgICBNZUNhYjo6VGFnZ2VyLm5ldygiLU93YWthdGkiKS5wYXJz ZSh2KSAgICMgVG9rZW5pemUgSmFwYW5lc2UgVGV4dAorICAgICAgICAuZ3N1YigvXHMrLywgIiAi KQogICAgIGVuZAogICBlbmQKIAotLSAKMS43LjQuMQoK --bcaec547c92d68528504a2695e00 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel --bcaec547c92d68528504a2695e00--