From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 10.52.188.165 with SMTP id gb5cs102497vdc; Wed, 4 May 2011 10:21:28 -0700 (PDT) Received: by 10.229.77.13 with SMTP id e13mr1038304qck.125.1304529687436; Wed, 04 May 2011 10:21:27 -0700 (PDT) Return-Path: Received: from rubyforge.org (rubyforge.org [205.234.109.19]) by mx.google.com with ESMTP id j3si2839642qcu.205.2011.05.04.10.21.26; Wed, 04 May 2011 10:21:26 -0700 (PDT) Received-SPF: pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) client-ip=205.234.109.19; Authentication-Results: mx.google.com; spf=pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) smtp.mail=sup-devel-bounces@rubyforge.org Received: from rubyforge.org (rubyforge.org [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id E9F8918583B4; Wed, 4 May 2011 13:21:25 -0400 (EDT) Received: from masanjin.net (masanjin.net [209.20.72.13]) by rubyforge.org (Postfix) with ESMTP id BBC14185836B for ; Wed, 4 May 2011 12:50:28 -0400 (EDT) Received: from w by masanjin.net with local (Exim 4.71) (envelope-from ) id 1QHfNh-0001jK-BW for sup-devel@rubyforge.org; Wed, 04 May 2011 16:56:57 +0000 From: William Morgan To: sup-devel In-reply-to: References: <201104251023.19659.hsanson@gmail.com> <1303793294-sup-688@masanjin.net> <1304052708-sup-4240@masanjin.net> <1304460745-sup-6241@masanjin.net> Date: Wed, 04 May 2011 16:56:57 +0000 Message-Id: <1304527268-sup-7661@masanjin.net> User-Agent: Sup/git Subject: Re: [sup-devel] Cannot query Japanese characters X-BeenThere: sup-devel@rubyforge.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: Sup developer discussion List-Id: Sup developer discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: sup-devel-bounces@rubyforge.org Errors-To: sup-devel-bounces@rubyforge.org Hi Horacio, Thanks for all your help so far. Reformatted excerpts from Horacio Sanson's message of 2011-05-04: > After some hacking I got a Heliotrope server that works perfectly with > Japanese text. All I did was follow your comments > and applied the MeCab tokenizer to the message body and query strings > before passing them to Whistelpig or more specific > to Heliotrope::Index. Great! > There is one problem I don't see how to handle... I do receive email > in Japanese but also Chinese and Korean. I need a different > tokenizer for each one and I have no idea how to handle this. Do email > messages contain a language header that would allow me > to identify the language and pass it to the corresponding tokenizer?? There's not a great way to do this in email. You can look at the content-type headers, which is sometimes present, and that will sometimes give you a clue. But it's usually useless. You can write some heuristics by hand, of course. Or you can try naive bayes, which performs pretty well on this type of task. It looks like someone just started a ruby project here: https://github.com/fela/rlid. It seems to only have Eurpoean languages so far, but you can probably just dump in some CKJ text and retrain. As for your patches: I've applied a related patch to fix the encoding issue with Query#parsed_query_s. Can you let me know if that works? Rather than sticking mecab directly in heliotrope, I am going to make a hook for users to plug in their own custom tokenization code like you're doing. -- William _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel