From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 10.52.188.165 with SMTP id gb5cs190885vdc; Thu, 5 May 2011 21:01:09 -0700 (PDT) Received: by 10.52.186.133 with SMTP id fk5mr4017455vdc.184.1304654468903; Thu, 05 May 2011 21:01:08 -0700 (PDT) Return-Path: Received: from rubyforge.org (rubyforge.org [205.234.109.19]) by mx.google.com with ESMTP id t7si2829173vbz.77.2011.05.05.21.01.08; Thu, 05 May 2011 21:01:08 -0700 (PDT) Received-SPF: pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) client-ip=205.234.109.19; Authentication-Results: mx.google.com; spf=pass (google.com: domain of sup-devel-bounces@rubyforge.org designates 205.234.109.19 as permitted sender) smtp.mail=sup-devel-bounces@rubyforge.org; dkim=neutral (body hash did not verify) header.i=@gmail.com Received: from rubyforge.org (rubyforge.org [127.0.0.1]) by rubyforge.org (Postfix) with ESMTP id 8E2551858378; Fri, 6 May 2011 00:01:08 -0400 (EDT) Received: from mail-vx0-f178.google.com (mail-vx0-f178.google.com [209.85.220.178]) by rubyforge.org (Postfix) with ESMTP id ADB4B1858378 for ; Thu, 5 May 2011 23:30:26 -0400 (EDT) Received: by vxc11 with SMTP id 11so4194097vxc.23 for ; Thu, 05 May 2011 20:30:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=5sN8/srSUEY7ZalutLX5DZVCwnIvNcgw1r3OXbubkN4=; b=UEb5cTL8bzRJDRozxryIEUP20WjEHOLKFvMB4HNDIfN90pTQ7RChC4bzjThx6Nizu6 /6m4jxYD41+xUqgdNiop/boqh+5Klk1fyUWos81n0gvZ6igPX8/jOlEnBL8ysidrSZhB qk7SVcy/xi/FZJ3EsxIbD+Bp8DxmwbmUFlz9c= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=m8MFDqpQDEMlw5swp8VwENzoKrsQrUiZJTkn6F0ICwdpSOp68svfp0vFfBAaIrrzOu 2U+nw4rvV/tmw8Za6pGUdASBdjeANdTv6dfs81ZSN8Xr0i3rremx6urIyQAMP+1puBxy gLoHi0pSC+CtuSVnDAvaG9Fq5Urk7zrQOsE84= MIME-Version: 1.0 Received: by 10.52.90.243 with SMTP id bz19mr3928747vdb.113.1304652626193; Thu, 05 May 2011 20:30:26 -0700 (PDT) Received: by 10.52.107.2 with HTTP; Thu, 5 May 2011 20:30:26 -0700 (PDT) In-Reply-To: <1304527268-sup-7661@masanjin.net> References: <201104251023.19659.hsanson@gmail.com> <1303793294-sup-688@masanjin.net> <1304052708-sup-4240@masanjin.net> <1304460745-sup-6241@masanjin.net> <1304527268-sup-7661@masanjin.net> Date: Fri, 6 May 2011 12:30:26 +0900 Message-ID: From: Horacio Sanson To: Sup developer discussion Subject: Re: [sup-devel] Cannot query Japanese characters X-BeenThere: sup-devel@rubyforge.org X-Mailman-Version: 2.1.12 Precedence: list Reply-To: Sup developer discussion List-Id: Sup developer discussion List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: sup-devel-bounces@rubyforge.org Errors-To: sup-devel-bounces@rubyforge.org Great, let me know when you have the modifications so I can stress test them. regards, Horacio On Thu, May 5, 2011 at 1:56 AM, William Morgan wrote: > Hi Horacio, > > Thanks for all your help so far. > > Reformatted excerpts from Horacio Sanson's message of 2011-05-04: >> After some hacking I got a Heliotrope server that works perfectly with >> Japanese text. All I did was follow your comments >> and applied the MeCab tokenizer to the message body and query strings >> before passing them to Whistelpig or more specific >> to Heliotrope::Index. > > Great! > >> There is one problem I don't see how to handle... I do receive email >> in Japanese but also Chinese and Korean. I need a different >> tokenizer for each one and I have no idea how to handle this. Do email >> messages contain a language header that would allow me >> to identify the language and pass it to the corresponding tokenizer?? > > There's not a great way to do this in email. You can look at the > content-type headers, which is sometimes present, and that will > sometimes give you a clue. But it's usually useless. > > You can write some heuristics by hand, of course. Or you can try naive > bayes, which performs pretty well on this type of task. It looks like > someone just started a ruby project here: https://github.com/fela/rlid. > It seems to only have Eurpoean languages so far, but you can probably > just dump in some CKJ text and retrain. > > As for your patches: I've applied a related patch to fix the encoding > issue with Query#parsed_query_s. Can you let me know if that works? > > Rather than sticking mecab directly in heliotrope, I am going to make a > hook for users to plug in their own custom tokenization code like you're > doing. > -- > William > _______________________________________________ > Sup-devel mailing list > Sup-devel@rubyforge.org > http://rubyforge.org/mailman/listinfo/sup-devel > _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel