From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from [129.177.117.174] ([129.177.117.174]) by mx.google.com with ESMTPSA id a5sm18453098ees.6.2013.05.20.00.25.44 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Mon, 20 May 2013 00:25:45 -0700 (PDT) Message-ID: <5199CFA7.9020900@gaute.vetsj.com> Date: Mon, 20 May 2013 09:24:23 +0200 From: Gaute Hope User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130403 Thunderbird/17.0.5 MIME-Version: 1.0 To: Horacio Sanson CC: Sup developer discussion Subject: Re: [sup-devel] sup 0.13 References: <1367233230-sup-3593@tesla> <517E5BD8.6030803@gaute.vetsj.com> <5182210E.4040100@gaute.vetsj.com> In-Reply-To: X-Enigmail-Version: 1.5.1 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Hi, There has recently been opened an issue regarding this: https://github.com/sup-heliotrope/sup/issues/60 Regards, Gaute On 09. mai 2013 03:39, Horacio Sanson wrote: > UTF-8 handles most cases but I still have to deal with emails in > ISO2022-JP, Shift-JIS and EUC-JP. After some research it seems Xapian has > no support for Asian languages. I will try to make some tests and open an > issue if I cannot make it work. > > I can see in the sup configuration file that the stem language can be > configured but there are no CJK stemmers for Xapian that I can find. > > > On Thu, May 2, 2013 at 5:17 PM, Gaute Hope wrote: > >> >> >> On 30. april 2013 11:44, Horacio Sanson wrote: >>> Great to see Sup getting back on track again.. >>> >>> I submitted some patches for the Gmail dumper of Heliotrope some time ago >>> but the lack of non alphabet languages (Japanese, Chinese) made it >>> impossible for me to keep using heliotrope/turnesole. >>> >>> The main issue to support Japanese/Chinese with heliotrope was that >>> whistlepig (indexer) lacked the ability to tokenize these languages. Also >>> the half baked UTF-8 support caused several issues with these languages. >>> >>> I would like to help in testing/implementing support for these languages, >>> starting with Japanese, but I would require some guidance. First I would >>> like to know is there is a way to configure the Xapian tokenizer >>> (segmenter) within sup? Please consider that I am new to both sup and to >>> Xapian. >> >> Hi Horacio, >> >> consider opening an issue at >> https://github.com/sup-heliotrope/sup/issues to make sure this doesn't >> disappear. Some changes will probably be made to the indexer when going >> to Mail (from RMail), but I hope to be able to migrate the existing >> index. Perhaps its time to get it right for arbitrary languages as well. >> I am unfamiliar with Japanes/Chinese - does UTF-8 cover the needs? >> >> Mail is better at handling UTF-8 and I think there was some fork that >> had some extra support for Japanese. >> >> Regards, Gaute >> >