From mboxrd@z Thu Jan 1 00:00:00 1970 Received: by 10.52.177.71 with SMTP id co7csp31227vdc; Wed, 8 May 2013 18:39:09 -0700 (PDT) X-Received: by 10.60.16.69 with SMTP id e5mr3105027oed.46.1368063549366; Wed, 08 May 2013 18:39:09 -0700 (PDT) Return-Path: Received: from mail-oa0-f48.google.com (mail-oa0-f48.google.com [209.85.219.48]) by mx.google.com with ESMTPS id no5si509626obc.120.2013.05.08.18.39.09 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Wed, 08 May 2013 18:39:09 -0700 (PDT) Received-SPF: pass (google.com: domain of hsanson@gmail.com designates 209.85.219.48 as permitted sender) client-ip=209.85.219.48; Authentication-Results: mx.google.com; spf=pass (google.com: domain of hsanson@gmail.com designates 209.85.219.48 as permitted sender) smtp.mail=hsanson@gmail.com; dkim=pass header.i=@gmail.com Received: by mail-oa0-f48.google.com with SMTP id i4so2838451oah.21 for ; Wed, 08 May 2013 18:39:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:x-received:in-reply-to:references:date:message-id :subject:from:to:cc:content-type; bh=7zOfsYE04QSSVGPiBouflEuhKcU4ymGFKwvetJ6/2h4=; b=kkIx7W27uZHqe7+zacHLCCsqFfwxfx4ySX+F3oMKfbQt/LM2L4TfLwQ3XNRwFjLLoQ +f1eYrhIO68cUIqdGfYaQUwro1c54aHn3ZwpvmJDXpwLjZZhV7SFRBWZrt4UeuPNyGCE Wh43uDImAJ27w385FHp3ipZ+oSW02WDIXCuWrDDt2dr2xNLeSLhFcvmY+RxljZwL9cuq UPxEIFrubYH2o0wcpFJ/5ku5F+yBfR13E3emGxdTMI8enWxX5+LbtP5nBxGCvccPgFXj j5MlTRnxGIwT4ewLxm6ytLxGbUxmlRF+q4azZQiPFDE5Qzj/geyRo75TK7Yr2riK2EgC nugQ== MIME-Version: 1.0 X-Received: by 10.182.84.135 with SMTP id z7mr3169358oby.35.1368063548974; Wed, 08 May 2013 18:39:08 -0700 (PDT) Received: by 10.182.123.2 with HTTP; Wed, 8 May 2013 18:39:08 -0700 (PDT) In-Reply-To: <5182210E.4040100@gaute.vetsj.com> References: <1367233230-sup-3593@tesla> <517E5BD8.6030803@gaute.vetsj.com> <5182210E.4040100@gaute.vetsj.com> Date: Thu, 9 May 2013 10:39:08 +0900 Message-ID: Subject: Re: [sup-devel] sup 0.13 From: Horacio Sanson To: Gaute Hope Cc: Sup developer discussion Content-Type: multipart/alternative; boundary=089e013a2696dded8404dc3f1d8d --089e013a2696dded8404dc3f1d8d Content-Type: text/plain; charset=ISO-8859-1 UTF-8 handles most cases but I still have to deal with emails in ISO2022-JP, Shift-JIS and EUC-JP. After some research it seems Xapian has no support for Asian languages. I will try to make some tests and open an issue if I cannot make it work. I can see in the sup configuration file that the stem language can be configured but there are no CJK stemmers for Xapian that I can find. On Thu, May 2, 2013 at 5:17 PM, Gaute Hope wrote: > > > On 30. april 2013 11:44, Horacio Sanson wrote: > > Great to see Sup getting back on track again.. > > > > I submitted some patches for the Gmail dumper of Heliotrope some time ago > > but the lack of non alphabet languages (Japanese, Chinese) made it > > impossible for me to keep using heliotrope/turnesole. > > > > The main issue to support Japanese/Chinese with heliotrope was that > > whistlepig (indexer) lacked the ability to tokenize these languages. Also > > the half baked UTF-8 support caused several issues with these languages. > > > > I would like to help in testing/implementing support for these languages, > > starting with Japanese, but I would require some guidance. First I would > > like to know is there is a way to configure the Xapian tokenizer > > (segmenter) within sup? Please consider that I am new to both sup and to > > Xapian. > > Hi Horacio, > > consider opening an issue at > https://github.com/sup-heliotrope/sup/issues to make sure this doesn't > disappear. Some changes will probably be made to the indexer when going > to Mail (from RMail), but I hope to be able to migrate the existing > index. Perhaps its time to get it right for arbitrary languages as well. > I am unfamiliar with Japanes/Chinese - does UTF-8 cover the needs? > > Mail is better at handling UTF-8 and I think there was some fork that > had some extra support for Japanese. > > Regards, Gaute > --089e013a2696dded8404dc3f1d8d Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable
UTF-8 handles most cases but I still have to deal with ema= ils in ISO2022-JP, Shift-JIS and EUC-JP. After some research it seems Xapia= n has no support for Asian languages. I will try to make some tests and ope= n an issue if I cannot make it work.

I can see in the sup configuration file that the stem = language can be configured but there are no CJK stemmers for Xapian that I = can find.


On Thu, May 2, 2013 at 5:17 PM, Gaute Hope <eg@gaute.vetsj.com> wrote:


On 30. april 2013 11:44, Horacio Sanson wrote:
> Great to see Sup getting back on track again..
>
> I submitted some patches for the Gmail dumper of Heliotrope some time = ago
> but the lack of non alphabet languages (Japanese, Chinese) made it
> impossible for me to keep using heliotrope/turnesole.
>
> The main issue to support Japanese/Chinese with heliotrope was that > whistlepig (indexer) lacked the ability to tokenize these languages. A= lso
> the half baked UTF-8 support caused several issues with these language= s.
>
> I would like to help in testing/implementing support for these languag= es,
> starting with Japanese, but I would require some guidance. First I wou= ld
> like to know is there is a way to configure the Xapian tokenizer
> (segmenter) within sup? Please consider that I am new to both sup and = to
> Xapian.

Hi Horacio,

consider opening an issue at
= https://github.com/sup-heliotrope/sup/issues to make sure this doesn= 9;t
disappear. Some changes will probably be made to the indexer when going
to Mail (from RMail), but I hope to be able to migrate the existing
index. Perhaps its time to get it right for arbitrary languages as well. I am unfamiliar with Japanes/Chinese - does UTF-8 cover the needs?

Mail is better at handling UTF-8 and I think there was some fork that
had some extra support for Japanese.

Regards, Gaute

--089e013a2696dded8404dc3f1d8d--