* [sup-devel] Cannot query Japanese characters
@ 2011-04-25 1:23 Horacio Sanson
2011-04-26 4:49 ` William Morgan
0 siblings, 1 reply; 15+ messages in thread
From: Horacio Sanson @ 2011-04-25 1:23 UTC (permalink / raw)
To: sup-devel
I like sup's idea and have a lot of hope in heliotrope but unfortunately both
have problems when dealing with my language: Japanese.
When I put a search string like this "subject: 手紙" I get the following crash:
27.0.0.1 - - [25/Apr/2011 10:17:17] "GET /search?q=%E6%89%8B%E7%B4%99
HTTP/1.1" 200 12306 0.0169
localhost.localdomain - - [25/Apr/2011:10:17:17 JST] "GET
/search?q=%E6%89%8B%E7%B4%99 HTTP/1.1" 200 12306
http://localhost:8042/search?q=%E6%89%8B%E7%B4%99 ->
/search?q=%E6%89%8B%E7%B4%99
127.0.0.1 - - [25/Apr/2011 10:17:17] "GET /favicon.ico HTTP/1.1" 404 441
0.0007
localhost.localdomain - - [25/Apr/2011:10:17:17 JST] "GET /favicon.ico
HTTP/1.1" 404 441
- -> /favicon.ico
HeliotropeServer::RequestError - can't parse query: parse error: line 1:
syntax error, unexpected $end, expecting WORD or '"' or '(':
bin/heliotrope-server:161:in `rescue in block in <class:HeliotropeServer>'
bin/heliotrope-server:138:in `block in <class:HeliotropeServer>'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:1165:in `call'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:1165:in `block in
compile!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:738:in
`instance_eval'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:738:in
`route_eval'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:722:in `block (2
levels) in route!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:772:in `block in
process_route'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:769:in `catch'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:769:in
`process_route'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:721:in `block in
route!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:720:in `each'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:720:in `route!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:857:in `dispatch!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:648:in `block in
call!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:822:in
`instance_eval'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:822:in `block in
invoke'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:822:in `catch'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:822:in `invoke'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:648:in `call!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/base.rb:633:in `call'
/var/lib/gems/1.9.1/gems/sinatra-1.2.3/lib/sinatra/showexceptions.rb:21:in
`call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:48:in `_call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:36:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/showexceptions.rb:24:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/commonlogger.rb:18:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/content_length.rb:13:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/handler/webrick.rb:52:in
`service'
/usr/lib/ruby/1.9.1/webrick/httpserver.rb:111:in `service'
/usr/lib/ruby/1.9.1/webrick/httpserver.rb:70:in `run'
/usr/lib/ruby/1.9.1/webrick/server.rb:183:in `block in start_thread'
127.0.0.1 - - [25/Apr/2011 10:17:28] "GET
/search?q=subject%3A+%E6%89%8B%E7%B4%99 HTTP/1.1" 500 92955 0.0266
localhost.localdomain - - [25/Apr/2011:10:17:28 JST] "GET
/search?q=subject%3A+%E6%89%8B%E7%B4%99 HTTP/1.1" 500 92955
http://localhost:8042/search?q=%E6%89%8B%E7%B4%99 ->
/search?q=subject%3A+%E6%89%8B%E7%B4%99
127.0.0.1 - - [25/Apr/2011 10:17:28] "GET /__sinatra__/500.png HTTP/1.1" 304 -
0.0006
localhost.localdomain - - [25/Apr/2011:10:17:28 JST] "GET /__sinatra__/500.png
HTTP/1.1" 304 0
http://localhost:8042/search?q=subject%3A+%E6%89%8B%E7%B4%99 ->
/__sinatra__/500.png
127.0.0.1 - - [25/Apr/2011 10:17:28] "GET /favicon.ico HTTP/1.1" 404 441
0.0008
localhost.localdomain - - [25/Apr/2011:10:17:28 JST] "GET /favicon.ico
HTTP/1.1" 404 441
I am running the latest heliotrope from git with ruby 1.9.2 from the default
Kubuntu 10.10 distribution.
--
regards,
Horacio Sanson
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-04-25 1:23 [sup-devel] Cannot query Japanese characters Horacio Sanson
@ 2011-04-26 4:49 ` William Morgan
2011-04-29 4:52 ` William Morgan
0 siblings, 1 reply; 15+ messages in thread
From: William Morgan @ 2011-04-26 4:49 UTC (permalink / raw)
To: sup-devel
Reformatted excerpts from Horacio Sanson's message of 2011-04-25:
> I like sup's idea and have a lot of hope in heliotrope but unfortunately both
> have problems when dealing with my language: Japanese.
>
> When I put a search string like this "subject: 手紙" I get the following
> crash:
Thanks for the bug report on this one too. It's great to have someone testing
this stuff with non-ASCII code. This is a known bug in Whistlepig and I should
be releasing a fix soon.
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-04-26 4:49 ` William Morgan
@ 2011-04-29 4:52 ` William Morgan
2011-05-01 15:35 ` Horacio Sanson
0 siblings, 1 reply; 15+ messages in thread
From: William Morgan @ 2011-04-29 4:52 UTC (permalink / raw)
To: sup-devel
Reformatted excerpts from William Morgan's message of 2011-04-26:
> Thanks for the bug report on this one too. It's great to have someone
> testing this stuff with non-ASCII code. This is a known bug in
> Whistlepig and I should be releasing a fix soon.
This is fixed in Whistlepig 0.6. Heliotrope should now be fine with
utf-8 input. I'm still working on this issue in turnsole.
Let me know if you have any more issues!
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-04-29 4:52 ` William Morgan
@ 2011-05-01 15:35 ` Horacio Sanson
2011-05-01 15:46 ` Horacio Sanson
0 siblings, 1 reply; 15+ messages in thread
From: Horacio Sanson @ 2011-05-01 15:35 UTC (permalink / raw)
To: Sup developer discussion
Installed whistelpig 0.6 but now I get a different error that looks
similar to the turnsole problem. Below the backtrace:
http://localhost:8042/search?q=primo -> /search?q=%7Einbox&start=0&num=20
127.0.0.1 - - [02/May/2011 00:31:58] "GET /favicon.ico HTTP/1.1" 404 447 0.0008
localhost - - [02/May/2011:00:31:58 JST] "GET /favicon.ico HTTP/1.1" 404 447
- -> /favicon.ico
search(body:"会", 0, 20) took 0.0ms
Encoding::CompatibilityError - incompatible character encodings: UTF-8
and ASCII-8BIT:
bin/heliotrope-server:154:in `block in <class:HeliotropeServer>'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:1152:in `call'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:1152:in
`block in compile!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:724:in
`instance_eval'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:724:in `route_eval'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:708:in
`block (2 levels) in route!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:758:in
`block in process_route'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:755:in `catch'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:755:in
`process_route'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:707:in
`block in route!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:706:in `each'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:706:in `route!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:843:in `dispatch!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:644:in
`block in call!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in
`instance_eval'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in
`block in invoke'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in `catch'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in `invoke'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:644:in `call!'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:629:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/head.rb:9:in `call'
/var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/showexceptions.rb:21:in
`call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:48:in `_call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:36:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/showexceptions.rb:24:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/commonlogger.rb:18:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/content_length.rb:13:in `call'
/var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/handler/webrick.rb:52:in `service'
/usr/lib/ruby/1.9.1/webrick/httpserver.rb:111:in `service'
/usr/lib/ruby/1.9.1/webrick/httpserver.rb:70:in `run'
/usr/lib/ruby/1.9.1/webrick/server.rb:183:in `block in start_thread'
127.0.0.1 - - [02/May/2011 00:32:09] "GET /search?q=%E4%BC%9A
HTTP/1.1" 500 89861 0.0228
localhost - - [02/May/2011:00:32:09 JST] "GET /search?q=%E4%BC%9A
HTTP/1.1" 500 89861
http://localhost:8042/search?q=%7Einbox&start=0&num=20 -> /search?q=%E4%BC%9A
127.0.0.1 - - [02/May/2011 00:32:09] "GET /favicon.ico HTTP/1.1" 404 447 0.0009
localhost - - [02/May/2011:00:32:09 JST] "GET /favicon.ico HTTP/1.1" 404 447
- -> /favicon.ico
regards,
Horacio
On Fri, Apr 29, 2011 at 1:52 PM, William Morgan
<wmorgan-sup@masanjin.net> wrote:
> Reformatted excerpts from William Morgan's message of 2011-04-26:
>> Thanks for the bug report on this one too. It's great to have someone
>> testing this stuff with non-ASCII code. This is a known bug in
>> Whistlepig and I should be releasing a fix soon.
>
> This is fixed in Whistlepig 0.6. Heliotrope should now be fine with
> utf-8 input. I'm still working on this issue in turnsole.
>
> Let me know if you have any more issues!
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-01 15:35 ` Horacio Sanson
@ 2011-05-01 15:46 ` Horacio Sanson
2011-05-03 14:24 ` Horacio Sanson
0 siblings, 1 reply; 15+ messages in thread
From: Horacio Sanson @ 2011-05-01 15:46 UTC (permalink / raw)
To: Sup developer discussion
I also tried with ruby 1.8 and heliotrope does not crash but searching
any Japanese word returns no matches even for search terms I now have
matches.
And by the way the installation instructions should mention that for
ruby 1.8 we also need to install the json gem or heliotrope won't
start.
regards,
Horacio
On Mon, May 2, 2011 at 12:35 AM, Horacio Sanson <hsanson@gmail.com> wrote:
> Installed whistelpig 0.6 but now I get a different error that looks
> similar to the turnsole problem. Below the backtrace:
>
> http://localhost:8042/search?q=primo -> /search?q=%7Einbox&start=0&num=20
> 127.0.0.1 - - [02/May/2011 00:31:58] "GET /favicon.ico HTTP/1.1" 404 447 0.0008
> localhost - - [02/May/2011:00:31:58 JST] "GET /favicon.ico HTTP/1.1" 404 447
> - -> /favicon.ico
> search(body:"会", 0, 20) took 0.0ms
> Encoding::CompatibilityError - incompatible character encodings: UTF-8
> and ASCII-8BIT:
> bin/heliotrope-server:154:in `block in <class:HeliotropeServer>'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:1152:in `call'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:1152:in
> `block in compile!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:724:in
> `instance_eval'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:724:in `route_eval'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:708:in
> `block (2 levels) in route!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:758:in
> `block in process_route'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:755:in `catch'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:755:in
> `process_route'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:707:in
> `block in route!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:706:in `each'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:706:in `route!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:843:in `dispatch!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:644:in
> `block in call!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in
> `instance_eval'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in
> `block in invoke'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in `catch'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in `invoke'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:644:in `call!'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:629:in `call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/head.rb:9:in `call'
> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/showexceptions.rb:21:in
> `call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:48:in `_call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:36:in `call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/showexceptions.rb:24:in `call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/commonlogger.rb:18:in `call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/content_length.rb:13:in `call'
> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/handler/webrick.rb:52:in `service'
> /usr/lib/ruby/1.9.1/webrick/httpserver.rb:111:in `service'
> /usr/lib/ruby/1.9.1/webrick/httpserver.rb:70:in `run'
> /usr/lib/ruby/1.9.1/webrick/server.rb:183:in `block in start_thread'
> 127.0.0.1 - - [02/May/2011 00:32:09] "GET /search?q=%E4%BC%9A
> HTTP/1.1" 500 89861 0.0228
> localhost - - [02/May/2011:00:32:09 JST] "GET /search?q=%E4%BC%9A
> HTTP/1.1" 500 89861
> http://localhost:8042/search?q=%7Einbox&start=0&num=20 -> /search?q=%E4%BC%9A
> 127.0.0.1 - - [02/May/2011 00:32:09] "GET /favicon.ico HTTP/1.1" 404 447 0.0009
> localhost - - [02/May/2011:00:32:09 JST] "GET /favicon.ico HTTP/1.1" 404 447
> - -> /favicon.ico
>
> regards,
> Horacio
>
> On Fri, Apr 29, 2011 at 1:52 PM, William Morgan
> <wmorgan-sup@masanjin.net> wrote:
>> Reformatted excerpts from William Morgan's message of 2011-04-26:
>>> Thanks for the bug report on this one too. It's great to have someone
>>> testing this stuff with non-ASCII code. This is a known bug in
>>> Whistlepig and I should be releasing a fix soon.
>>
>> This is fixed in Whistlepig 0.6. Heliotrope should now be fine with
>> utf-8 input. I'm still working on this issue in turnsole.
>>
>> Let me know if you have any more issues!
>> --
>> William <wmorgan-sup@masanjin.net>
>> _______________________________________________
>> Sup-devel mailing list
>> Sup-devel@rubyforge.org
>> http://rubyforge.org/mailman/listinfo/sup-devel
>>
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-01 15:46 ` Horacio Sanson
@ 2011-05-03 14:24 ` Horacio Sanson
2011-05-03 22:26 ` William Morgan
0 siblings, 1 reply; 15+ messages in thread
From: Horacio Sanson @ 2011-05-03 14:24 UTC (permalink / raw)
To: Sup developer discussion
[-- Attachment #1: Type: text/plain, Size: 5730 bytes --]
I managed to stop the crash when searching for Japanese text by
forcing UTF-8 encoding in que query parameter (see patch).
But seems that Whistelpig cannot speak Japanese. I tried the following
small test and as you
can see I get no results:
> require 'rubygems' => true
> require 'whistlepig' => true
> include Whistlepig => Object
> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
> docid1 = index.add_entry entry1 => 1
> q1 = Query.new "body", "研究" => body:"研究"
> results1 = index.search q1 => []
I will now dig in Whistelpig source code to see if I can fix this but
any pointer/directions or tips
were to start looking would be greatly appreciated.
On Mon, May 2, 2011 at 12:46 AM, Horacio Sanson <hsanson@gmail.com> wrote:
> I also tried with ruby 1.8 and heliotrope does not crash but searching
> any Japanese word returns no matches even for search terms I now have
> matches.
>
> And by the way the installation instructions should mention that for
> ruby 1.8 we also need to install the json gem or heliotrope won't
> start.
>
> regards,
> Horacio
>
> On Mon, May 2, 2011 at 12:35 AM, Horacio Sanson <hsanson@gmail.com> wrote:
>> Installed whistelpig 0.6 but now I get a different error that looks
>> similar to the turnsole problem. Below the backtrace:
>>
>> http://localhost:8042/search?q=primo -> /search?q=%7Einbox&start=0&num=20
>> 127.0.0.1 - - [02/May/2011 00:31:58] "GET /favicon.ico HTTP/1.1" 404 447 0.0008
>> localhost - - [02/May/2011:00:31:58 JST] "GET /favicon.ico HTTP/1.1" 404 447
>> - -> /favicon.ico
>> search(body:"会", 0, 20) took 0.0ms
>> Encoding::CompatibilityError - incompatible character encodings: UTF-8
>> and ASCII-8BIT:
>> bin/heliotrope-server:154:in `block in <class:HeliotropeServer>'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:1152:in `call'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:1152:in
>> `block in compile!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:724:in
>> `instance_eval'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:724:in `route_eval'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:708:in
>> `block (2 levels) in route!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:758:in
>> `block in process_route'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:755:in `catch'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:755:in
>> `process_route'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:707:in
>> `block in route!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:706:in `each'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:706:in `route!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:843:in `dispatch!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:644:in
>> `block in call!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in
>> `instance_eval'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in
>> `block in invoke'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in `catch'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:808:in `invoke'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:644:in `call!'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/base.rb:629:in `call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/head.rb:9:in `call'
>> /var/lib/gems/1.9.1/gems/sinatra-1.2.5/lib/sinatra/showexceptions.rb:21:in
>> `call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:48:in `_call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/lint.rb:36:in `call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/showexceptions.rb:24:in `call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/commonlogger.rb:18:in `call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/content_length.rb:13:in `call'
>> /var/lib/gems/1.9.1/gems/rack-1.2.2/lib/rack/handler/webrick.rb:52:in `service'
>> /usr/lib/ruby/1.9.1/webrick/httpserver.rb:111:in `service'
>> /usr/lib/ruby/1.9.1/webrick/httpserver.rb:70:in `run'
>> /usr/lib/ruby/1.9.1/webrick/server.rb:183:in `block in start_thread'
>> 127.0.0.1 - - [02/May/2011 00:32:09] "GET /search?q=%E4%BC%9A
>> HTTP/1.1" 500 89861 0.0228
>> localhost - - [02/May/2011:00:32:09 JST] "GET /search?q=%E4%BC%9A
>> HTTP/1.1" 500 89861
>> http://localhost:8042/search?q=%7Einbox&start=0&num=20 -> /search?q=%E4%BC%9A
>> 127.0.0.1 - - [02/May/2011 00:32:09] "GET /favicon.ico HTTP/1.1" 404 447 0.0009
>> localhost - - [02/May/2011:00:32:09 JST] "GET /favicon.ico HTTP/1.1" 404 447
>> - -> /favicon.ico
>>
>> regards,
>> Horacio
>>
>> On Fri, Apr 29, 2011 at 1:52 PM, William Morgan
>> <wmorgan-sup@masanjin.net> wrote:
>>> Reformatted excerpts from William Morgan's message of 2011-04-26:
>>>> Thanks for the bug report on this one too. It's great to have someone
>>>> testing this stuff with non-ASCII code. This is a known bug in
>>>> Whistlepig and I should be releasing a fix soon.
>>>
>>> This is fixed in Whistlepig 0.6. Heliotrope should now be fine with
>>> utf-8 input. I'm still working on this issue in turnsole.
>>>
>>> Let me know if you have any more issues!
>>> --
>>> William <wmorgan-sup@masanjin.net>
>>> _______________________________________________
>>> Sup-devel mailing list
>>> Sup-devel@rubyforge.org
>>> http://rubyforge.org/mailman/listinfo/sup-devel
>>>
>>
>
[-- Attachment #2: 0001-Fix-crash-for-non-ASCII-chars.patch --]
[-- Type: text/x-patch, Size: 986 bytes --]
From 0881630c8b410b6f78df578bf686afacbb78ec64 Mon Sep 17 00:00:00 2001
From: Horacio Sanson <hsanson@gmail.com>
Date: Tue, 3 May 2011 23:18:22 +0900
Subject: [PATCH] Fix crash for non ASCII chars.
---
bin/heliotrope-server | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/bin/heliotrope-server b/bin/heliotrope-server
index 4793ac2..ed9c3be 100644
--- a/bin/heliotrope-server
+++ b/bin/heliotrope-server
@@ -151,7 +151,7 @@ class HeliotropeServer < Sinatra::Base
nav += "</div>"
header("Search: #{query.original_query_s}", query.original_query_s) +
- "<div>Parsed query: #{escape_html query.parsed_query_s}</div>" +
+ "<div>Parsed query: #{escape_html query.parsed_query_s.force_encoding('UTF-8')}</div>" +
"<div>Search took #{sprintf '%.2f', info[:elapsed]}s and #{info[:continued] ? 'was' : 'was NOT'} continued</div>" +
"#{nav}<table>" +
results.map { |r| threadinfo_to_html r }.join +
--
1.7.4.1
[-- Attachment #3: Type: text/plain, Size: 143 bytes --]
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-03 14:24 ` Horacio Sanson
@ 2011-05-03 22:26 ` William Morgan
2011-05-04 1:42 ` Horacio Sanson
0 siblings, 1 reply; 15+ messages in thread
From: William Morgan @ 2011-05-03 22:26 UTC (permalink / raw)
To: sup-devel
Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
> docid1 = index.add_entry entry1 => 1
> q1 = Query.new "body", "研究" => body:"研究"
> results1 = index.search q1 => []
The problem here is tokenization. Whistlepig only provides a very simple
tokenizer, namely, it looks for space-separated things [1]. So you have to
space-separate your tokens in both the indexing and querying stages, e.g.:
entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
docid1 = index.add_entry entry1 => 1
q1 = Query.new "body", "研 究" => AND body:"研" body:"究"
q1 = Query.new "body", "\"研 究\"" => PHRASE body:"研" body:"究"
results1 = index.search q1 => [1]
For Japanese, proper tokenization is tricky. You could simply space-separate
every character and deal with the spurious matches across word boundaries.
Or you could do it right by plugging in a proper tokenizer, e.g. something
like http://www.chasen.org/~taku/software/TinySegmenter/.
[1] It also strips any prefix or suffix characters that match [:punct:]. This
is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
tokenizer as an alternative is in the works.
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-03 22:26 ` William Morgan
@ 2011-05-04 1:42 ` Horacio Sanson
2011-05-04 2:03 ` Horacio Sanson
2011-05-04 16:56 ` William Morgan
0 siblings, 2 replies; 15+ messages in thread
From: Horacio Sanson @ 2011-05-04 1:42 UTC (permalink / raw)
To: Sup developer discussion
[-- Attachment #1: Type: text/plain, Size: 2702 bytes --]
Chasen is the worst tokenizer, is pretty old. The best one is MeCab
that is the faster and from the same author of Chasen.
You can see all major Japanese tokenizer in action at this URL:
http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some
text in the box and press the button.
After some hacking I got a Heliotrope server that works perfectly with
Japanese text. All I did was follow your comments
and applied the MeCab tokenizer to the message body and query strings
before passing them to Whistelpig or more specific
to Heliotrope::Index.
There is one problem I don't see how to handle... I do receive email
in Japanese but also Chinese and Korean. I need a different
tokenizer for each one and I have no idea how to handle this. Do email
messages contain a language header that would allow me
to identify the language and pass it to the corresponding tokenizer??
regards,
Horacio
On Wed, May 4, 2011 at 7:26 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
> Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
>> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
>> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
>> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
>> docid1 = index.add_entry entry1 => 1
>> q1 = Query.new "body", "研究" => body:"研究"
>> results1 = index.search q1 => []
>
> The problem here is tokenization. Whistlepig only provides a very simple
> tokenizer, namely, it looks for space-separated things [1]. So you have to
> space-separate your tokens in both the indexing and querying stages, e.g.:
>
> entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
> docid1 = index.add_entry entry1 => 1
> q1 = Query.new "body", "研 究" => AND body:"研" body:"究"
> q1 = Query.new "body", "\"研 究\"" => PHRASE body:"研" body:"究"
> results1 = index.search q1 => [1]
>
> For Japanese, proper tokenization is tricky. You could simply space-separate
> every character and deal with the spurious matches across word boundaries.
> Or you could do it right by plugging in a proper tokenizer, e.g. something
> like http://www.chasen.org/~taku/software/TinySegmenter/.
>
> [1] It also strips any prefix or suffix characters that match [:punct:]. This
> is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
> tokenizer as an alternative is in the works.
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>
[-- Attachment #2: 0001-Fix-crash-for-non-ASCII-chars.patch --]
[-- Type: text/x-patch, Size: 989 bytes --]
From f484b09518db47a06690e09a710cf6e866c5561b Mon Sep 17 00:00:00 2001
From: Horacio Sanson <hsanson@gmail.com>
Date: Wed, 4 May 2011 10:31:12 +0900
Subject: [PATCH 1/2] Fix crash for non ASCII chars
---
bin/heliotrope-server | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/bin/heliotrope-server b/bin/heliotrope-server
index 4793ac2..ed9c3be 100644
--- a/bin/heliotrope-server
+++ b/bin/heliotrope-server
@@ -151,7 +151,7 @@ class HeliotropeServer < Sinatra::Base
nav += "</div>"
header("Search: #{query.original_query_s}", query.original_query_s) +
- "<div>Parsed query: #{escape_html query.parsed_query_s}</div>" +
+ "<div>Parsed query: #{escape_html query.parsed_query_s.force_encoding('UTF-8')}</div>" +
"<div>Search took #{sprintf '%.2f', info[:elapsed]}s and #{info[:continued] ? 'was' : 'was NOT'} continued</div>" +
"#{nav}<table>" +
results.map { |r| threadinfo_to_html r }.join +
--
1.7.4.1
[-- Attachment #3: 0002-Add-MeCab-japanese-text-analyzer.patch --]
[-- Type: text/x-patch, Size: 1914 bytes --]
From 6595af0b55d52d1f68562fbdd0f1b23dfee34039 Mon Sep 17 00:00:00 2001
From: Horacio Sanson <hsanson@gmail.com>
Date: Wed, 4 May 2011 10:34:48 +0900
Subject: [PATCH 2/2] Add MeCab japanese text analyzer.
Japanese text has no white space separation causing the Whistelpig
tokenizer to fail. This patch processes the email indexable text
and search queries with MeCab before passing them to Whistelpig.
---
bin/heliotrope-server | 3 ++-
lib/heliotrope/message.rb | 5 +++--
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/bin/heliotrope-server b/bin/heliotrope-server
index ed9c3be..f3bd5d4 100644
--- a/bin/heliotrope-server
+++ b/bin/heliotrope-server
@@ -67,6 +67,7 @@ class HeliotropeServer < Sinatra::Base
end.to_json
end
+ require "MeCab"
def get_query_from_params
## work around a rack (?) bug where quotes are omitted in queries like "hello bob"
query = if env["rack.request.query_string"] =~ /\bq=(.+?)(&|$)/
@@ -76,7 +77,7 @@ class HeliotropeServer < Sinatra::Base
end
raise RequestError, "need a query" unless query
- query
+ MeCab::Tagger.new("-Owakati").parse(query).force_encoding("UTF-8")
end
def get_search_results
diff --git a/lib/heliotrope/message.rb b/lib/heliotrope/message.rb
index b48329b..e61d8bd 100644
--- a/lib/heliotrope/message.rb
+++ b/lib/heliotrope/message.rb
@@ -76,6 +76,7 @@ class Message
def indirect_recipients; cc + bcc end
def recipients; direct_recipients + indirect_recipients end
+ require "MeCab"
def indexable_text
@indexable_text ||= begin
v = ([from.indexable_text] +
@@ -90,8 +91,8 @@ class Message
end
).flatten.compact.join(" ")
- v.gsub(/\s+[\W\d_]+(\s|$)/, " "). # drop funny tokens
- gsub(/\s+/, " ")
+ MeCab::Tagger.new("-Owakati").parse(v) # Tokenize Japanese Text
+ .gsub(/\s+/, " ")
end
end
--
1.7.4.1
[-- Attachment #4: Type: text/plain, Size: 143 bytes --]
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-04 1:42 ` Horacio Sanson
@ 2011-05-04 2:03 ` Horacio Sanson
2011-05-04 16:56 ` William Morgan
1 sibling, 0 replies; 15+ messages in thread
From: Horacio Sanson @ 2011-05-04 2:03 UTC (permalink / raw)
To: Sup developer discussion
Forgot to mention you need the mecab ruby gem. In Ubuntu 10.04 this
gem is part of the distribution and can be installed with the command:
sudo apt-get install libmecab-ruby1.8 libmecab-ruby1.9.1 mecab-ipadic-utf8
regards
Horacio
On Wed, May 4, 2011 at 10:42 AM, Horacio Sanson <hsanson@gmail.com> wrote:
> Chasen is the worst tokenizer, is pretty old. The best one is MeCab
> that is the faster and from the same author of Chasen.
> You can see all major Japanese tokenizer in action at this URL:
> http://nomadscafe.jp/test/keitaiso/index.cgi. Just put some
> text in the box and press the button.
>
> After some hacking I got a Heliotrope server that works perfectly with
> Japanese text. All I did was follow your comments
> and applied the MeCab tokenizer to the message body and query strings
> before passing them to Whistelpig or more specific
> to Heliotrope::Index.
>
> There is one problem I don't see how to handle... I do receive email
> in Japanese but also Chinese and Korean. I need a different
> tokenizer for each one and I have no idea how to handle this. Do email
> messages contain a language header that would allow me
> to identify the language and pass it to the corresponding tokenizer??
>
>
> regards,
> Horacio
>
> On Wed, May 4, 2011 at 7:26 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
>> Reformatted excerpts from Horacio Sanson's message of 2011-05-03:
>>> index = Index.new "index" => #<Whistlepig::Index:0x00000002093f60>
>>> entry1 = Entry.new => #<Whistlepig::Entry:0x0000000207d328>
>>> entry1.add_string "body", "研究会" => #<Whistlepig::Entry:0x0000000207d328>
>>> docid1 = index.add_entry entry1 => 1
>>> q1 = Query.new "body", "研究" => body:"研究"
>>> results1 = index.search q1 => []
>>
>> The problem here is tokenization. Whistlepig only provides a very simple
>> tokenizer, namely, it looks for space-separated things [1]. So you have to
>> space-separate your tokens in both the indexing and querying stages, e.g.:
>>
>> entry1.add_string "body", "研 究 会" => #<Whistlepig::Entry:0x90b873c>
>> docid1 = index.add_entry entry1 => 1
>> q1 = Query.new "body", "研 究" => AND body:"研" body:"究"
>> q1 = Query.new "body", "\"研 究\"" => PHRASE body:"研" body:"究"
>> results1 = index.search q1 => [1]
>>
>> For Japanese, proper tokenization is tricky. You could simply space-separate
>> every character and deal with the spurious matches across word boundaries.
>> Or you could do it right by plugging in a proper tokenizer, e.g. something
>> like http://www.chasen.org/~taku/software/TinySegmenter/.
>>
>> [1] It also strips any prefix or suffix characters that match [:punct:]. This
>> is all pretty ad-hoc and undocumented. Providing simpler whitespace-only
>> tokenizer as an alternative is in the works.
>> --
>> William <wmorgan-sup@masanjin.net>
>> _______________________________________________
>> Sup-devel mailing list
>> Sup-devel@rubyforge.org
>> http://rubyforge.org/mailman/listinfo/sup-devel
>>
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-04 1:42 ` Horacio Sanson
2011-05-04 2:03 ` Horacio Sanson
@ 2011-05-04 16:56 ` William Morgan
2011-05-06 3:30 ` Horacio Sanson
1 sibling, 1 reply; 15+ messages in thread
From: William Morgan @ 2011-05-04 16:56 UTC (permalink / raw)
To: sup-devel
Hi Horacio,
Thanks for all your help so far.
Reformatted excerpts from Horacio Sanson's message of 2011-05-04:
> After some hacking I got a Heliotrope server that works perfectly with
> Japanese text. All I did was follow your comments
> and applied the MeCab tokenizer to the message body and query strings
> before passing them to Whistelpig or more specific
> to Heliotrope::Index.
Great!
> There is one problem I don't see how to handle... I do receive email
> in Japanese but also Chinese and Korean. I need a different
> tokenizer for each one and I have no idea how to handle this. Do email
> messages contain a language header that would allow me
> to identify the language and pass it to the corresponding tokenizer??
There's not a great way to do this in email. You can look at the
content-type headers, which is sometimes present, and that will
sometimes give you a clue. But it's usually useless.
You can write some heuristics by hand, of course. Or you can try naive
bayes, which performs pretty well on this type of task. It looks like
someone just started a ruby project here: https://github.com/fela/rlid.
It seems to only have Eurpoean languages so far, but you can probably
just dump in some CKJ text and retrain.
As for your patches: I've applied a related patch to fix the encoding
issue with Query#parsed_query_s. Can you let me know if that works?
Rather than sticking mecab directly in heliotrope, I am going to make a
hook for users to plug in their own custom tokenization code like you're
doing.
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-04 16:56 ` William Morgan
@ 2011-05-06 3:30 ` Horacio Sanson
2011-06-08 5:21 ` William Morgan
0 siblings, 1 reply; 15+ messages in thread
From: Horacio Sanson @ 2011-05-06 3:30 UTC (permalink / raw)
To: Sup developer discussion
Great, let me know when you have the modifications so I can stress test them.
regards,
Horacio
On Thu, May 5, 2011 at 1:56 AM, William Morgan <wmorgan-sup@masanjin.net> wrote:
> Hi Horacio,
>
> Thanks for all your help so far.
>
> Reformatted excerpts from Horacio Sanson's message of 2011-05-04:
>> After some hacking I got a Heliotrope server that works perfectly with
>> Japanese text. All I did was follow your comments
>> and applied the MeCab tokenizer to the message body and query strings
>> before passing them to Whistelpig or more specific
>> to Heliotrope::Index.
>
> Great!
>
>> There is one problem I don't see how to handle... I do receive email
>> in Japanese but also Chinese and Korean. I need a different
>> tokenizer for each one and I have no idea how to handle this. Do email
>> messages contain a language header that would allow me
>> to identify the language and pass it to the corresponding tokenizer??
>
> There's not a great way to do this in email. You can look at the
> content-type headers, which is sometimes present, and that will
> sometimes give you a clue. But it's usually useless.
>
> You can write some heuristics by hand, of course. Or you can try naive
> bayes, which performs pretty well on this type of task. It looks like
> someone just started a ruby project here: https://github.com/fela/rlid.
> It seems to only have Eurpoean languages so far, but you can probably
> just dump in some CKJ text and retrain.
>
> As for your patches: I've applied a related patch to fix the encoding
> issue with Query#parsed_query_s. Can you let me know if that works?
>
> Rather than sticking mecab directly in heliotrope, I am going to make a
> hook for users to plug in their own custom tokenization code like you're
> doing.
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-05-06 3:30 ` Horacio Sanson
@ 2011-06-08 5:21 ` William Morgan
2011-06-09 13:48 ` Horacio Sanson
0 siblings, 1 reply; 15+ messages in thread
From: William Morgan @ 2011-06-08 5:21 UTC (permalink / raw)
To: sup-devel
Reformatted excerpts from Horacio Sanson's message of 2011-05-06:
> Great, let me know when you have the modifications so I can stress
> test them.
In the most recent version of Heliotrope, there are two hooks you can
use to do this: transform-text and transform-query. To use them, place
your Ruby code in files called <store dir>/hooks/<hook-name>.rb.
For transform-text, place your code in <store dir>/hooks/transform-text.rb
This hook will be called on any text added to the index. The 'text'
variable will contain the text, and the hook should return (i.e. the
last command should evaluate to) the transformed text.
Example:
$ cat store/hooks/transform-text.rb
require 'MeCab'
MeCab::Tagger.new("-Owakati").parse(text).gsub(/\s+/, " ")
For transform-query, place your code in <store
dir>/hooks/transform-query.rb. This hook will be called on any query
before it is executed. The 'query' variable will contain the query
string, and the hook should return (i.e. the last command should
evaluate to) the transformed query.
Example:
$ cat store/hooks/transform-query.rb
require 'MeCab'
MeCab::Tagger.new("-Owakati").parse(query).gsub(/\s+/, " ")
Let me know if you have any problems with these hooks!
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-06-08 5:21 ` William Morgan
@ 2011-06-09 13:48 ` Horacio Sanson
2011-06-09 14:08 ` Horacio Sanson
2011-06-09 22:46 ` William Morgan
0 siblings, 2 replies; 15+ messages in thread
From: Horacio Sanson @ 2011-06-09 13:48 UTC (permalink / raw)
To: Sup developer discussion
[-- Attachment #1: Type: text/plain, Size: 1959 bytes --]
On Wed, Jun 8, 2011 at 2:21 PM, William Morgan <wmorgan-sup@masanjin.net> wrote:
> Reformatted excerpts from Horacio Sanson's message of 2011-05-06:
>> Great, let me know when you have the modifications so I can stress
>> test them.
>
> In the most recent version of Heliotrope, there are two hooks you can
> use to do this: transform-text and transform-query. To use them, place
> your Ruby code in files called <store dir>/hooks/<hook-name>.rb.
>
> For transform-text, place your code in <store dir>/hooks/transform-text.rb
> This hook will be called on any text added to the index. The 'text'
> variable will contain the text, and the hook should return (i.e. the
> last command should evaluate to) the transformed text.
>
> Example:
> $ cat store/hooks/transform-text.rb
> require 'MeCab'
> MeCab::Tagger.new("-Owakati").parse(text).gsub(/\s+/, " ")
>
> For transform-query, place your code in <store
> dir>/hooks/transform-query.rb. This hook will be called on any query
> before it is executed. The 'query' variable will contain the query
> string, and the hook should return (i.e. the last command should
> evaluate to) the transformed query.
>
> Example:
> $ cat store/hooks/transform-query.rb
> require 'MeCab'
> MeCab::Tagger.new("-Owakati").parse(query).gsub(/\s+/, " ")
>
> Let me know if you have any problems with these hooks!
Great I am downloading my gmail accounts now (again). I can see you
have improved the imap-dumper.rb to handle uidvalidity and uidnext
that is also great. In the git log says gmail labels are also copied
to heliotrope but I don't see them in my index.
BTW there are two small bugs in the imap-dumper.rb, see attached patch
for details.
regards,
Horacio
> --
> William <wmorgan-sup@masanjin.net>
> _______________________________________________
> Sup-devel mailing list
> Sup-devel@rubyforge.org
> http://rubyforge.org/mailman/listinfo/sup-devel
>
gmail
[-- Attachment #2: 0001-Fix-imap-dumper.patch --]
[-- Type: text/x-patch, Size: 1034 bytes --]
From 4bf24f16612c954bbbdcdb9b48a70571c3bb1a4d Mon Sep 17 00:00:00 2001
From: Horacio Sanson <hsanson@gmail.com>
Date: Thu, 9 Jun 2011 22:39:39 +0900
Subject: [PATCH] Fix imap-dumper.
---
lib/heliotrope/imap-dumper.rb | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/lib/heliotrope/imap-dumper.rb b/lib/heliotrope/imap-dumper.rb
index 5a96960..e5dcc1b 100644
--- a/lib/heliotrope/imap-dumper.rb
+++ b/lib/heliotrope/imap-dumper.rb
@@ -3,7 +3,7 @@ require "net/imap"
require 'json'
module Heliotrope
-class ImapDumper
+class IMAPDumper
def initialize opts
@host = opts[:host] or raise ArgumentError, "need :host"
@username = opts[:username] or raise ArgumentError, "need :username"
@@ -11,7 +11,7 @@ class ImapDumper
@fn = opts[:fn] or raise ArgumentError, "need :fn"
@ssl = opts.member?(:ssl) ? opts[:ssl] : true
- @port = opts[:port] || (ssl ? 993 : 143)
+ @port = opts[:port] || (@ssl ? 993 : 143)
@folder = opts[:folder] || "inbox"
@msgs = []
--
1.7.4.1
[-- Attachment #3: Type: text/plain, Size: 143 bytes --]
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-06-09 13:48 ` Horacio Sanson
@ 2011-06-09 14:08 ` Horacio Sanson
2011-06-09 22:46 ` William Morgan
1 sibling, 0 replies; 15+ messages in thread
From: Horacio Sanson @ 2011-06-09 14:08 UTC (permalink / raw)
To: Sup developer discussion
Unfortunately the gmail sync failed.... Below the error:
requesting messages 40266..40315 from imap server
; gmail loving gave us 19 messages in 4.6s = a whopping 4.1m/s
scanned 748, indexed 747, skipped 0 bad and 1 seen messages in 400.4s = 1.9 m/s
; requesting messages 40316..40365 from imap server
; gmail loving gave us 8 messages in 2.2s = a whopping 3.7m/s
; requesting messages 40366..40415 from imap server
/media/DATA/Apps/heliotrope/lib/heliotrope/imap-dumper.rb:73:in
`next_message': undefined method `size' for nil:NilClass
(NoMethodError)
from bin/heliotrope-add:128:in `<main>'
For some reason "uid_fetch" is returning nil instead of an empty array
(contrary to what Net::IMAP documentation says). Not really sure if
checking for nil and simply repeat the query can resolve the problem
but seems to be the best option.
regards
Horacio
On Thu, Jun 9, 2011 at 10:48 PM, Horacio Sanson <hsanson@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 2:21 PM, William Morgan <wmorgan-sup@masanjin.net> wrote:
>> Reformatted excerpts from Horacio Sanson's message of 2011-05-06:
>>> Great, let me know when you have the modifications so I can stress
>>> test them.
>>
>> In the most recent version of Heliotrope, there are two hooks you can
>> use to do this: transform-text and transform-query. To use them, place
>> your Ruby code in files called <store dir>/hooks/<hook-name>.rb.
>>
>> For transform-text, place your code in <store dir>/hooks/transform-text.rb
>> This hook will be called on any text added to the index. The 'text'
>> variable will contain the text, and the hook should return (i.e. the
>> last command should evaluate to) the transformed text.
>>
>> Example:
>> $ cat store/hooks/transform-text.rb
>> require 'MeCab'
>> MeCab::Tagger.new("-Owakati").parse(text).gsub(/\s+/, " ")
>>
>> For transform-query, place your code in <store
>> dir>/hooks/transform-query.rb. This hook will be called on any query
>> before it is executed. The 'query' variable will contain the query
>> string, and the hook should return (i.e. the last command should
>> evaluate to) the transformed query.
>>
>> Example:
>> $ cat store/hooks/transform-query.rb
>> require 'MeCab'
>> MeCab::Tagger.new("-Owakati").parse(query).gsub(/\s+/, " ")
>>
>> Let me know if you have any problems with these hooks!
>
> Great I am downloading my gmail accounts now (again). I can see you
> have improved the imap-dumper.rb to handle uidvalidity and uidnext
> that is also great. In the git log says gmail labels are also copied
> to heliotrope but I don't see them in my index.
>
> BTW there are two small bugs in the imap-dumper.rb, see attached patch
> for details.
>
> regards,
> Horacio
>
>> --
>> William <wmorgan-sup@masanjin.net>
>> _______________________________________________
>> Sup-devel mailing list
>> Sup-devel@rubyforge.org
>> http://rubyforge.org/mailman/listinfo/sup-devel
>>
>
> gmail
>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [sup-devel] Cannot query Japanese characters
2011-06-09 13:48 ` Horacio Sanson
2011-06-09 14:08 ` Horacio Sanson
@ 2011-06-09 22:46 ` William Morgan
1 sibling, 0 replies; 15+ messages in thread
From: William Morgan @ 2011-06-09 22:46 UTC (permalink / raw)
To: sup-devel
Reformatted excerpts from Horacio Sanson's message of 2011-06-09:
> Great I am downloading my gmail accounts now (again). I can see you
> have improved the imap-dumper.rb to handle uidvalidity and uidnext
> that is also great. In the git log says gmail labels are also copied
> to heliotrope but I don't see them in my index.
I actually added a separate gmail importer that does that stuff,
borrowing from your gmail.rb. Try the -g option.
It's mostly duplicated with imap-dumper.rb so I'm trying to decide how
to best merge them.
> BTW there are two small bugs in the imap-dumper.rb, see attached patch
> for details.
Thank you!
--
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2011-06-09 22:51 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-04-25 1:23 [sup-devel] Cannot query Japanese characters Horacio Sanson
2011-04-26 4:49 ` William Morgan
2011-04-29 4:52 ` William Morgan
2011-05-01 15:35 ` Horacio Sanson
2011-05-01 15:46 ` Horacio Sanson
2011-05-03 14:24 ` Horacio Sanson
2011-05-03 22:26 ` William Morgan
2011-05-04 1:42 ` Horacio Sanson
2011-05-04 2:03 ` Horacio Sanson
2011-05-04 16:56 ` William Morgan
2011-05-06 3:30 ` Horacio Sanson
2011-06-08 5:21 ` William Morgan
2011-06-09 13:48 ` Horacio Sanson
2011-06-09 14:08 ` Horacio Sanson
2011-06-09 22:46 ` William Morgan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox