Archive of RubyForge sup-devel mailing list
 help / color / mirror / Atom feed
From: William Morgan <wmorgan-sup@masanjin.net>
To: sup-devel <sup-devel@rubyforge.org>
Subject: Re: [sup-devel] Cannot query Japanese characters
Date: Wed, 04 May 2011 16:56:57 +0000	[thread overview]
Message-ID: <1304527268-sup-7661@masanjin.net> (raw)
In-Reply-To: <BANLkTikbENFqT2GsE5uWjqN_DTMq43FFkw@mail.gmail.com>

Hi Horacio,

Thanks for all your help so far.

Reformatted excerpts from Horacio Sanson's message of 2011-05-04:
> After some hacking I got a Heliotrope server that works perfectly with
> Japanese text. All I did was follow your comments
> and applied the MeCab tokenizer to the message body and query strings
> before passing them to Whistelpig or more specific
> to Heliotrope::Index.

Great!

> There is one problem I don't see how to handle... I do receive email
> in Japanese but also Chinese and Korean. I need a different
> tokenizer for each one and I have no idea how to handle this. Do email
> messages contain a language header that would allow me
> to identify the language and pass it to the corresponding tokenizer??

There's not a great way to do this in email. You can look at the
content-type headers, which is sometimes present, and that will
sometimes give you a clue. But it's usually useless.

You can write some heuristics by hand, of course. Or you can try naive
bayes, which performs pretty well on this type of task. It looks like
someone just started a ruby project here: https://github.com/fela/rlid.
It seems to only have Eurpoean languages so far, but you can probably
just dump in some CKJ text and retrain.

As for your patches: I've applied a related patch to fix the encoding
issue with Query#parsed_query_s. Can you let me know if that works?

Rather than sticking mecab directly in heliotrope, I am going to make a
hook for users to plug in their own custom tokenization code like you're
doing.
-- 
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel


  parent reply	other threads:[~2011-05-04 17:21 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-25  1:23 Horacio Sanson
2011-04-26  4:49 ` William Morgan
2011-04-29  4:52   ` William Morgan
2011-05-01 15:35     ` Horacio Sanson
2011-05-01 15:46       ` Horacio Sanson
2011-05-03 14:24         ` Horacio Sanson
2011-05-03 22:26           ` William Morgan
2011-05-04  1:42             ` Horacio Sanson
2011-05-04  2:03               ` Horacio Sanson
2011-05-04 16:56               ` William Morgan [this message]
2011-05-06  3:30                 ` Horacio Sanson
2011-06-08  5:21                   ` William Morgan
2011-06-09 13:48                     ` Horacio Sanson
2011-06-09 14:08                       ` Horacio Sanson
2011-06-09 22:46                       ` William Morgan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1304527268-sup-7661@masanjin.net \
    --to=wmorgan-sup@masanjin.net \
    --cc=sup-devel@rubyforge.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox