Archive of RubyForge sup-devel mailing list
 help / color / mirror / Atom feed
From: Horacio Sanson <hsanson@gmail.com>
To: Sup developer discussion <sup-devel@rubyforge.org>
Subject: Re: [sup-devel] Cannot query Japanese characters
Date: Fri, 6 May 2011 12:30:26 +0900	[thread overview]
Message-ID: <BANLkTimr0u=6oB4uSK_6itUU2UMR-yJz+w@mail.gmail.com> (raw)
In-Reply-To: <1304527268-sup-7661@masanjin.net>

Great, let me know when you have the modifications so I can stress test them.

regards,
Horacio

On Thu, May 5, 2011 at 1:56 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
> Hi Horacio,
>
> Thanks for all your help so far.
>
> Reformatted excerpts from Horacio Sanson's message of 2011-05-04:
>> After some hacking I got a Heliotrope server that works perfectly with
>> Japanese text. All I did was follow your comments
>> and applied the MeCab tokenizer to the message body and query strings
>> before passing them to Whistelpig or more specific
>> to Heliotrope::Index.
>
> Great!
>
>> There is one problem I don't see how to handle... I do receive email
>> in Japanese but also Chinese and Korean. I need a different
>> tokenizer for each one and I have no idea how to handle this. Do email
>> messages contain a language header that would allow me
>> to identify the language and pass it to the corresponding tokenizer??
>
> There's not a great way to do this in email. You can look at the
> content-type headers, which is sometimes present, and that will
> sometimes give you a clue. But it's usually useless.
>
> You can write some heuristics by hand, of course. Or you can try naive
> bayes, which performs pretty well on this type of task. It looks like
> someone just started a ruby project here: https://github.com/fela/rlid.
> It seems to only have Eurpoean languages so far, but you can probably
> just dump in some CKJ text and retrain.
>
> As for your patches: I've applied a related patch to fix the encoding
> issue with Query#parsed_query_s. Can you let me know if that works?
>
> Rather than sticking mecab directly in heliotrope, I am going to make a
> hook for users to plug in their own custom tokenization code like you're
> doing.
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel


  reply	other threads:[~2011-05-06  4:01 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-25  1:23 Horacio Sanson
2011-04-26  4:49 ` William Morgan
2011-04-29  4:52   ` William Morgan
2011-05-01 15:35     ` Horacio Sanson
2011-05-01 15:46       ` Horacio Sanson
2011-05-03 14:24         ` Horacio Sanson
2011-05-03 22:26           ` William Morgan
2011-05-04  1:42             ` Horacio Sanson
2011-05-04  2:03               ` Horacio Sanson
2011-05-04 16:56               ` William Morgan
2011-05-06  3:30                 ` Horacio Sanson [this message]
2011-06-08  5:21                   ` William Morgan
2011-06-09 13:48                     ` Horacio Sanson
2011-06-09 14:08                       ` Horacio Sanson
2011-06-09 22:46                       ` William Morgan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='BANLkTimr0u=6oB4uSK_6itUU2UMR-yJz+w@mail.gmail.com' \
    --to=hsanson@gmail.com \
    --cc=sup-devel@rubyforge.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox