* [sup-devel] Arch utf8 vs UTF-8 fix and wide character support [not found] <y2j6242182a1005061059w5e32fb54vd10ccfd7e4a1911e@mail.gmail.com> @ 2010-05-06 18:02 ` Matti Eiden 2010-05-07 16:46 ` Rich Lane 0 siblings, 1 reply; 6+ messages in thread From: Matti Eiden @ 2010-05-06 18:02 UTC (permalink / raw) To: sup-devel Hey folks, I've been experimenting with sup for the past few days, and of course, I love it. Firstly I had some trouble with getting unicode display going. This problem was already described in an old post on this mailing list: http://rubyforge.org/pipermail/sup-devel/2010-March/000522.html So Arch Linux defines encoding as utf8, but Iconv requires it to be UTF-8. I would say this is a bug in Arch Linux for not following standards, but anyway, I fixed it with the little modification to sup.rb: ## determine encoding and character set $encoding = Locale.current.charset $encoding = "UTF-8" if $encoding == "utf8" Then about wide character support. And I mean really wide. Like CJK characters. Scandics (ä,ö,å) and other European accent characters work nicely, as we all who are concerned probably know. These characters have a byte length of 2 and unicode length of 1. However, take an example of the following two-character Korean word (byte length of such single character is 3 instead of 2!) http://www.kotiposti.net/eiden/soulbound/hellovim.png (looking good in vim) http://www.kotiposti.net/eiden/soulbound/hellosup.png (sup lost 2 characters (or bytes) from the line that has the Korean word) It seems that for every Korean character with a byte length of 3, one byte is lost from the end of the line. In the above example, two bytes are missing in sup, as there are two Korean characters on the same line. If the line consist of a single Korean character, nothing appears in sup (last byte out of three is missing?). If the line consist of two Korean characters, last character is missing (last two bytes out of six are missing?). etc. Some sort of miscalculation somewhere is causing this, perhaps assuming that unicode characters always have a byte length of 2? Can anybody with Ruby skills take a look on this? Thanks, Matti _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support 2010-05-06 18:02 ` [sup-devel] Arch utf8 vs UTF-8 fix and wide character support Matti Eiden @ 2010-05-07 16:46 ` Rich Lane 2010-05-11 18:50 ` Matti Eiden 0 siblings, 1 reply; 6+ messages in thread From: Rich Lane @ 2010-05-07 16:46 UTC (permalink / raw) To: Matti Eiden; +Cc: sup-devel Excerpts from Matti Eiden's message of 2010-05-06 14:02:46 -0400: > Hey folks, > > I've been experimenting with sup for the past few days, and of course, > I love it. Firstly I had some trouble with getting unicode display > going. This problem was already described in an old post on this > mailing list: > > http://rubyforge.org/pipermail/sup-devel/2010-March/000522.html > > So Arch Linux defines encoding as utf8, but Iconv requires it to be > UTF-8. I would say this is a bug in Arch Linux for not following > standards, but anyway, I fixed it with the little modification to > sup.rb: > > ## determine encoding and character set > $encoding = Locale.current.charset > $encoding = "UTF-8" if $encoding == "utf8" I've applied this fix, thanks. > Then about wide character support. And I mean really wide. Like CJK > characters. Scandics (ä,ö,å) and other European accent characters work > nicely, as we all who are concerned probably know. These characters > have a byte length of 2 and unicode length of 1. > > However, take an example of the following two-character Korean word > (byte length of such single character is 3 instead of 2!) > > http://www.kotiposti.net/eiden/soulbound/hellovim.png (looking good in vim) > http://www.kotiposti.net/eiden/soulbound/hellosup.png (sup lost 2 > characters (or bytes) from the line that has the Korean word) > > It seems that for every Korean character with a byte length of 3, one > byte is lost from the end of the line. In the above example, two bytes > are missing in sup, as there are two Korean characters on the same > line. > > If the line consist of a single Korean character, nothing appears in > sup (last byte out of three is missing?). > If the line consist of two Korean characters, last character is > missing (last two bytes out of six are missing?). > etc. > > Some sort of miscalculation somewhere is causing this, perhaps > assuming that unicode characters always have a byte length of 2? Can > anybody with Ruby skills take a look on this? It's actually the multiple screen cells that causes problems, not multiple bytes [1]. Sup currently thinks all characters are 1 cell wide. The right thing is probably a C extension that uses wcswidth. [1] http://mid.gmane.org/1264629880-sup-9232%40zyrg.net _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support 2010-05-07 16:46 ` Rich Lane @ 2010-05-11 18:50 ` Matti Eiden 2010-05-11 19:19 ` William Morgan 0 siblings, 1 reply; 6+ messages in thread From: Matti Eiden @ 2010-05-11 18:50 UTC (permalink / raw) To: sup-devel > > It's actually the multiple screen cells that causes problems, not > multiple bytes [1]. Sup currently thinks all characters are 1 cell wide. > The right thing is probably a C extension that uses wcswidth. > > [1] http://mid.gmane.org/1264629880-sup-9232%40zyrg.net > So okay, I sent my previous answer accidentally only to Rich. In this mail I mentioned a Ruby library called terminfo ( http://www.a-k-r.org/ruby-terminfo/ ), which contains a function wcswidth. I downloaded terminfo and installed it, and edited lib/sup/util.rb slightly. 1) Added of course "require 'terminfo'" on the top. 2) Modified the display_length function of "nasty multibyte hack" to use TermInfo.wcswidth instead of native "size" Results: Everything seems to work now. I don't know what is the opinion of other sup users, whether adding a new dependency (to terminfo) is desirable, as the current list of dependencies is already rather high.. Discuss. If somebody with some C skills knows how to move that wcswidth function to the ruby-ncursesw (Rich? :D) would that be a more favourable option? Here's the actual patch of what I did to keep it clear (I notice there has been an earlier utf8 patch here, for pre Ruby 1.9.1 versions. I don't know how many of you use some earlier Ruby version still, of course this would also solve the "nasty" utf scan patch.): --- util-old.rb 2010-05-11 21:38:55.736596584 +0300 +++ util.rb 2010-05-11 21:36:12.653281044 +0300 @@ -3,6 +3,7 @@ require 'mime/types' require 'pathname' require 'set' +require 'terminfo' ## time for some monkeypatching! class Lockfile def gen_lock_id @@ -183,7 +184,7 @@ if RUBY_VERSION < '1.9.1' && ($encoding == "UTF-8" || $encoding == "utf8") scan(/./u).size else - size + TermInfo.wcswidth(self) end end _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support 2010-05-11 18:50 ` Matti Eiden @ 2010-05-11 19:19 ` William Morgan 2010-05-11 21:51 ` Matti Eiden 0 siblings, 1 reply; 6+ messages in thread From: William Morgan @ 2010-05-11 19:19 UTC (permalink / raw) To: sup-devel Reformatted excerpts from Matti Eiden's message of 2010-05-11: > I don't know what is the opinion of other sup users, whether adding a > new dependency (to terminfo) is desirable, as the current list of > dependencies is already rather high. I'm working on a gem that provides both wcswidth functionality and the ability to slice a string at specific display widths, both of which, I think, will be required to function entirely. You can try it with 'gem install console'. It's 1.9-only for now. -- William <wmorgan-sup@masanjin.net> _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support 2010-05-11 19:19 ` William Morgan @ 2010-05-11 21:51 ` Matti Eiden 2010-05-13 12:33 ` William Morgan 0 siblings, 1 reply; 6+ messages in thread From: Matti Eiden @ 2010-05-11 21:51 UTC (permalink / raw) To: Sup developer discussion [-- Attachment #1: Type: text/plain, Size: 1024 bytes --] Oh right, splitting. Yes right, makes sense. I tried your console/string, seems good, except the display_split ignores the padding request? Or did I understand this feature wrongly? I mean, it slices the string exactly where the end offset is, not by the nearest " ", space? Well anyway, I showed it inside sup, seems to be working nicely. Here's what I did to get it to work, if anybody's interested. I'm in a rush to work, so there may be mistakes. I tried to check that everything works. Summary: - buffer.rb is patched to slice all strings according to @width, this fixes issues in inbox-mode when email subjects have wide characters. Old "hacks" were removed. - utils.rb is patched to wrap using display_slice and then looking for nearest space. if no space is found, it uses simply the original output of display_slice. display_length function defaults to the display_width With quick testing for resizing the window with different kind of test emails, I see no lost characters or text corruption. Nice, thanks. [-- Attachment #2: console-sup-buffer.patch --] [-- Type: text/x-patch, Size: 755 bytes --] --- buffer-old.rb 2010-05-12 00:42:50.501278238 +0300 +++ buffer.rb 2010-05-12 00:42:37.711280439 +0300 @@ -1,5 +1,6 @@ require 'etc' require 'thread' +require 'console/string' begin require 'ncursesw' @@ -129,10 +130,8 @@ @w.attrset Colormap.color_for(opts[:color] || :none, opts[:highlight]) s ||= "" maxl = @width - x # maximum display width width - stringl = maxl # string "length" - ## the next horribleness is thanks to ruby's lack of widechar support - stringl += 1 while stringl < s.length && s[0 ... stringl].display_length < maxl - @w.mvaddstr y, x, s[0 ... stringl] + s = s.display_slice(0,maxl,"") + @w.mvaddstr y, x, s unless opts[:no_fill] l = s.display_length unless l >= maxl [-- Attachment #3: console-sup-util.patch --] [-- Type: text/x-patch, Size: 1450 bytes --] --- util-old.rb 2010-05-11 21:38:55.736596584 +0300 +++ util.rb 2010-05-12 00:33:16.128001053 +0300 @@ -3,6 +3,7 @@ require 'mime/types' require 'pathname' require 'set' +require 'console/string' ## time for some monkeypatching! class Lockfile def gen_lock_id @@ -177,16 +178,12 @@ end class String - ## nasty multibyte hack for ruby 1.8. if it's utf-8, split into chars using - ## the utf8 regex and count those. otherwise, use the byte length. + def display_length - if RUBY_VERSION < '1.9.1' && ($encoding == "UTF-8" || $encoding == "utf8") - scan(/./u).size - else - size - end + display_width end + def camel_to_hyphy self.gsub(/([a-z])([A-Z0-9])/, '\1-\2').downcase end @@ -270,14 +267,17 @@ def wrap len ret = [] s = self - while s.length > len - cut = s[0 ... len].rindex(/\s/) - if cut - ret << s[0 ... cut] - s = s[(cut + 1) .. -1] + while s.display_width > len + cut = s.display_slice(0,len," ") + # find the last space, since display slices it precisely + space = cut.rindex(/\s/) + space = cut.size unless space #No spaces? + cut = s[0 ... space] + ret << cut + if space != cut.size #+1 to kill the space in the beginning of next line + s = s[(cut.size + 1) .. -1] else - ret << s[0 ... len] - s = s[len .. -1] + s = s[cut.size .. -1] end end ret << s [-- Attachment #4: Type: text/plain, Size: 143 bytes --] _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [sup-devel] Arch utf8 vs UTF-8 fix and wide character support 2010-05-11 21:51 ` Matti Eiden @ 2010-05-13 12:33 ` William Morgan 0 siblings, 0 replies; 6+ messages in thread From: William Morgan @ 2010-05-13 12:33 UTC (permalink / raw) To: sup-devel Reformatted excerpts from Matti Eiden's message of 2010-05-11: > Oh right, splitting. Yes right, makes sense. I tried your > console/string, seems good, except the display_split ignores the > padding request? Or did I understand this feature wrongly? I mean, it > slices the string exactly where the end offset is, not by the nearest > " ", space? The padding only comes into play when you ask it to slice a string in the middle of a multi-column character. It will then pad the left/right/both ends of the string to make the total display width of the returned string what you asked for. > With quick testing for resizing the window with different kind of test > emails, I see no lost characters or text corruption. Awesome. I'll see what I can do about 1.8-compatibility. -- William <wmorgan-sup@masanjin.net> _______________________________________________ Sup-devel mailing list Sup-devel@rubyforge.org http://rubyforge.org/mailman/listinfo/sup-devel ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2010-05-13 12:34 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <y2j6242182a1005061059w5e32fb54vd10ccfd7e4a1911e@mail.gmail.com> 2010-05-06 18:02 ` [sup-devel] Arch utf8 vs UTF-8 fix and wide character support Matti Eiden 2010-05-07 16:46 ` Rich Lane 2010-05-11 18:50 ` Matti Eiden 2010-05-11 19:19 ` William Morgan 2010-05-11 21:51 ` Matti Eiden 2010-05-13 12:33 ` William Morgan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox