[sup-devel] Arch utf8 vs UTF-8 fix and wide character support

Archive of RubyForge sup-devel mailing list
 help / color / mirror / Atom feed

* [sup-devel] Arch utf8 vs UTF-8 fix and wide character support
       [not found] <y2j6242182a1005061059w5e32fb54vd10ccfd7e4a1911e@mail.gmail.com>
@ 2010-05-06 18:02 ` Matti Eiden
  2010-05-07 16:46   ` Rich Lane
  0 siblings, 1 reply; 6+ messages in thread
From: Matti Eiden @ 2010-05-06 18:02 UTC (permalink / raw)
  To: sup-devel

Hey folks,

I've been experimenting with sup for the past few days, and of course,
I love it. Firstly I had some trouble with getting unicode display
going. This problem was already described in an old post on this
mailing list:

http://rubyforge.org/pipermail/sup-devel/2010-March/000522.html

So Arch Linux defines encoding as utf8, but Iconv requires it to be
UTF-8. I would say this is a bug in Arch Linux for not following
standards, but anyway, I fixed it with the little modification to
sup.rb:

## determine encoding and character set
$encoding = Locale.current.charset
$encoding = "UTF-8" if $encoding == "utf8"

Then about wide character support. And I mean really wide. Like CJK
characters. Scandics (ä,ö,å) and other European accent characters work
nicely, as we all who are concerned probably know. These characters
have a byte length of 2 and unicode length of 1.

However, take an example of the following two-character Korean word
(byte length of such single character is 3 instead of 2!)

http://www.kotiposti.net/eiden/soulbound/hellovim.png (looking good in vim)
http://www.kotiposti.net/eiden/soulbound/hellosup.png (sup lost 2
characters (or bytes) from the line that has the Korean word)

It seems that for every Korean character with a byte length of 3, one
byte is lost from the end of the line. In the above example, two bytes
are missing in sup, as there are two Korean characters on the same
line.

If the line consist of a single Korean character, nothing appears in
sup (last byte out of three is missing?).
If the line consist of two Korean characters, last character is
missing (last two bytes out of six are missing?).
etc.

Some sort of miscalculation somewhere is causing this, perhaps
assuming that unicode characters always have a byte length of 2? Can
anybody with Ruby skills take a look on this?

Thanks,
Matti
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support
  2010-05-06 18:02 ` [sup-devel] Arch utf8 vs UTF-8 fix and wide character support Matti Eiden
@ 2010-05-07 16:46   ` Rich Lane
  2010-05-11 18:50     ` Matti Eiden
  0 siblings, 1 reply; 6+ messages in thread
From: Rich Lane @ 2010-05-07 16:46 UTC (permalink / raw)
  To: Matti Eiden; +Cc: sup-devel

Excerpts from Matti Eiden's message of 2010-05-06 14:02:46 -0400:
> Hey folks,
> 
> I've been experimenting with sup for the past few days, and of course,
> I love it. Firstly I had some trouble with getting unicode display
> going. This problem was already described in an old post on this
> mailing list:
> 
> http://rubyforge.org/pipermail/sup-devel/2010-March/000522.html
> 
> So Arch Linux defines encoding as utf8, but Iconv requires it to be
> UTF-8. I would say this is a bug in Arch Linux for not following
> standards, but anyway, I fixed it with the little modification to
> sup.rb:
> 
> ## determine encoding and character set
> $encoding = Locale.current.charset
> $encoding = "UTF-8" if $encoding == "utf8"

I've applied this fix, thanks.

> Then about wide character support. And I mean really wide. Like CJK
> characters. Scandics (ä,ö,å) and other European accent characters work
> nicely, as we all who are concerned probably know. These characters
> have a byte length of 2 and unicode length of 1.
> 
> However, take an example of the following two-character Korean word
> (byte length of such single character is 3 instead of 2!)
> 
> http://www.kotiposti.net/eiden/soulbound/hellovim.png (looking good in vim)
> http://www.kotiposti.net/eiden/soulbound/hellosup.png (sup lost 2
> characters (or bytes) from the line that has the Korean word)
> 
> It seems that for every Korean character with a byte length of 3, one
> byte is lost from the end of the line. In the above example, two bytes
> are missing in sup, as there are two Korean characters on the same
> line.
> 
> If the line consist of a single Korean character, nothing appears in
> sup (last byte out of three is missing?).
> If the line consist of two Korean characters, last character is
> missing (last two bytes out of six are missing?).
> etc.
> 
> Some sort of miscalculation somewhere is causing this, perhaps
> assuming that unicode characters always have a byte length of 2? Can
> anybody with Ruby skills take a look on this?

It's actually the multiple screen cells that causes problems, not
multiple bytes [1]. Sup currently thinks all characters are 1 cell wide.
The right thing is probably a C extension that uses wcswidth.

[1] http://mid.gmane.org/1264629880-sup-9232%40zyrg.net
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support
  2010-05-07 16:46   ` Rich Lane
@ 2010-05-11 18:50     ` Matti Eiden
  2010-05-11 19:19       ` William Morgan
  0 siblings, 1 reply; 6+ messages in thread
From: Matti Eiden @ 2010-05-11 18:50 UTC (permalink / raw)
  To: sup-devel

>
> It's actually the multiple screen cells that causes problems, not
> multiple bytes [1]. Sup currently thinks all characters are 1 cell wide.
> The right thing is probably a C extension that uses wcswidth.
>
> [1] http://mid.gmane.org/1264629880-sup-9232%40zyrg.net
>

So okay, I sent my previous answer accidentally only to Rich. In this
mail I mentioned a Ruby library called terminfo (
http://www.a-k-r.org/ruby-terminfo/ ), which contains a function
wcswidth.

I downloaded terminfo and installed it, and edited lib/sup/util.rb slightly.
1) Added of course "require 'terminfo'" on the top.
2) Modified the display_length function of "nasty multibyte hack" to
use TermInfo.wcswidth instead of native "size"

Results: Everything seems to work now.

I don't know what is the opinion of other sup users, whether adding a
new dependency (to terminfo) is desirable, as the current list of
dependencies is already rather high.. Discuss. If somebody with some C
skills knows how to move that wcswidth function to the ruby-ncursesw
(Rich? :D) would that be a more favourable option?

Here's the actual patch of what I did to keep it clear (I notice there
has been an earlier utf8 patch here, for pre Ruby 1.9.1 versions. I
don't know how many of you use some earlier Ruby version still, of
course this would also solve the "nasty" utf scan patch.):

--- util-old.rb	2010-05-11 21:38:55.736596584 +0300
+++ util.rb	2010-05-11 21:36:12.653281044 +0300
@@ -3,6 +3,7 @@
 require 'mime/types'
 require 'pathname'
 require 'set'
+require 'terminfo'
 ## time for some monkeypatching!
 class Lockfile
   def gen_lock_id
@@ -183,7 +184,7 @@
     if RUBY_VERSION < '1.9.1' && ($encoding == "UTF-8" || $encoding == "utf8")
       scan(/./u).size
     else
-      size
+      TermInfo.wcswidth(self)
     end
   end
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support
  2010-05-11 18:50     ` Matti Eiden
@ 2010-05-11 19:19       ` William Morgan
  2010-05-11 21:51         ` Matti Eiden
  0 siblings, 1 reply; 6+ messages in thread
From: William Morgan @ 2010-05-11 19:19 UTC (permalink / raw)
  To: sup-devel

Reformatted excerpts from Matti Eiden's message of 2010-05-11:
> I don't know what is the opinion of other sup users, whether adding a
> new dependency (to terminfo) is desirable, as the current list of
> dependencies is already rather high.

I'm working on a gem that provides both wcswidth functionality and the
ability to slice a string at specific display widths, both of which, I
think, will be required to function entirely.

You can try it with 'gem install console'. It's 1.9-only for now.
-- 
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support
  2010-05-11 19:19       ` William Morgan
@ 2010-05-11 21:51         ` Matti Eiden
  2010-05-13 12:33           ` William Morgan
  0 siblings, 1 reply; 6+ messages in thread
From: Matti Eiden @ 2010-05-11 21:51 UTC (permalink / raw)
  To: Sup developer discussion

[-- Attachment #1: Type: text/plain, Size: 1024 bytes --]

Oh right, splitting. Yes right, makes sense. I tried your
console/string, seems good, except the display_split ignores the
padding request? Or did I understand this feature wrongly? I mean, it
slices the string exactly where the end offset is, not by the nearest
" ", space?

Well anyway, I showed it inside sup, seems to be working nicely.

Here's what I did to get it to work, if anybody's interested. I'm in a
rush to work, so there may be mistakes. I tried to check that
everything works.

Summary:
- buffer.rb is patched to slice all strings according to @width, this
fixes issues in inbox-mode when email subjects have wide characters.
Old "hacks" were removed.
- utils.rb is patched to wrap using display_slice and then looking for
nearest space. if no space is found, it uses simply the original
output of display_slice. display_length function defaults to the
display_width

With quick testing for resizing the window with different kind of test
emails, I see no lost characters or text corruption.


Nice, thanks.

[-- Attachment #2: console-sup-buffer.patch --]
[-- Type: text/x-patch, Size: 755 bytes --]

--- buffer-old.rb	2010-05-12 00:42:50.501278238 +0300
+++ buffer.rb	2010-05-12 00:42:37.711280439 +0300
@@ -1,5 +1,6 @@
 require 'etc'
 require 'thread'
+require 'console/string'
 
 begin
   require 'ncursesw'
@@ -129,10 +130,8 @@
     @w.attrset Colormap.color_for(opts[:color] || :none, opts[:highlight])
     s ||= ""
     maxl = @width - x # maximum display width width
-    stringl = maxl    # string "length"
-    ## the next horribleness is thanks to ruby's lack of widechar support
-    stringl += 1 while stringl < s.length && s[0 ... stringl].display_length < maxl
-    @w.mvaddstr y, x, s[0 ... stringl]
+    s = s.display_slice(0,maxl,"")
+    @w.mvaddstr y, x, s
     unless opts[:no_fill]
       l = s.display_length
       unless l >= maxl

[-- Attachment #3: console-sup-util.patch --]
[-- Type: text/x-patch, Size: 1450 bytes --]

--- util-old.rb	2010-05-11 21:38:55.736596584 +0300
+++ util.rb	2010-05-12 00:33:16.128001053 +0300
@@ -3,6 +3,7 @@
 require 'mime/types'
 require 'pathname'
 require 'set'
+require 'console/string'
 ## time for some monkeypatching!
 class Lockfile
   def gen_lock_id
@@ -177,16 +178,12 @@
 end
 
 class String
-  ## nasty multibyte hack for ruby 1.8. if it's utf-8, split into chars using
-  ## the utf8 regex and count those. otherwise, use the byte length.
+
   def display_length
-    if RUBY_VERSION < '1.9.1' && ($encoding == "UTF-8" || $encoding == "utf8")
-      scan(/./u).size
-    else
-      size
-    end
+    display_width
   end
 
+
   def camel_to_hyphy
     self.gsub(/([a-z])([A-Z0-9])/, '\1-\2').downcase
   end
@@ -270,14 +267,17 @@
   def wrap len
     ret = []
     s = self
-    while s.length > len
-      cut = s[0 ... len].rindex(/\s/)
-      if cut
-        ret << s[0 ... cut]
-        s = s[(cut + 1) .. -1]
+    while s.display_width > len
+      cut = s.display_slice(0,len," ")
+      # find the last space, since display slices it precisely
+      space = cut.rindex(/\s/)
+      space = cut.size unless space #No spaces?
+      cut = s[0 ... space]
+      ret << cut
+      if space != cut.size #+1 to kill the space in the beginning of next line
+        s = s[(cut.size + 1) .. -1]
       else
-        ret << s[0 ... len]
-        s = s[len .. -1]
+        s = s[cut.size .. -1]
       end
     end
     ret << s

[-- Attachment #4: Type: text/plain, Size: 143 bytes --]

_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support
  2010-05-11 21:51         ` Matti Eiden
@ 2010-05-13 12:33           ` William Morgan
  0 siblings, 0 replies; 6+ messages in thread
From: William Morgan @ 2010-05-13 12:33 UTC (permalink / raw)
  To: sup-devel

Reformatted excerpts from Matti Eiden's message of 2010-05-11:
> Oh right, splitting. Yes right, makes sense. I tried your
> console/string, seems good, except the display_split ignores the
> padding request? Or did I understand this feature wrongly? I mean, it
> slices the string exactly where the end offset is, not by the nearest
> " ", space?

The padding only comes into play when you ask it to slice a string in
the middle of a multi-column character. It will then pad the
left/right/both ends of the string to make the total display width of
the returned string what you asked for.

> With quick testing for resizing the window with different kind of test
> emails, I see no lost characters or text corruption.

Awesome. I'll see what I can do about 1.8-compatibility.
-- 
William <wmorgan-sup@masanjin.net>
_______________________________________________
Sup-devel mailing list
Sup-devel@rubyforge.org
http://rubyforge.org/mailman/listinfo/sup-devel


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-05-13 12:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <y2j6242182a1005061059w5e32fb54vd10ccfd7e4a1911e@mail.gmail.com>
2010-05-06 18:02 ` [sup-devel] Arch utf8 vs UTF-8 fix and wide character support Matti Eiden
2010-05-07 16:46   ` Rich Lane
2010-05-11 18:50     ` Matti Eiden
2010-05-11 19:19       ` William Morgan
2010-05-11 21:51         ` Matti Eiden
2010-05-13 12:33           ` William Morgan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox