* [sup-talk] [PATCH] Unwrap br0ken URLs.
@ 2008-02-19 10:17 Nicolas Pouillard
2008-02-28 17:29 ` William Morgan
0 siblings, 1 reply; 2+ messages in thread
From: Nicolas Pouillard @ 2008-02-19 10:17 UTC (permalink / raw)
---
lib/sup/message-chunks.rb | 2 +-
lib/sup/util.rb | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+), 1 deletions(-)
diff --git a/lib/sup/message-chunks.rb b/lib/sup/message-chunks.rb
index 0606395..98b829c 100644
--- a/lib/sup/message-chunks.rb
+++ b/lib/sup/message-chunks.rb
@@ -147,7 +147,7 @@ EOS
attr_reader :lines
def initialize lines
- @lines = lines.map { |l| l.chomp.wrap WRAP_LEN }.flatten # wrap
+ @lines = lines.unwrap_urls.map { |l| l.chomp.wrap WRAP_LEN }.flatten # wrap
## trim off all empty lines except one
@lines.pop while @lines.length > 1 && @lines[-1] =~ /^\s*$/ && @lines[-2] =~ /^\s*$/
diff --git a/lib/sup/util.rb b/lib/sup/util.rb
index ceaf0b8..99e73b4 100644
--- a/lib/sup/util.rb
+++ b/lib/sup/util.rb
@@ -401,6 +401,31 @@ class Array
def last= e; self[-1] = e end
def nonempty?; !empty? end
+
+ URL_CHAR = /[a-zA-Z0-9\-@;\/?:&=%$_.+!*\x27()~,#]/
+ URL_CHAR_LESS = /[a-zA-Z0-9\-@;\/?:&=%$_.+!*\x27()~]/
+ URL_RE = %r{(?:https?://|ftp://|news://|mailto:|file://)#{URL_CHAR}+}
+ TRAILING_URL_RE = /#{URL_RE}$/
+ LEADING_URL_RE = /^#{URL_RE}/
+ URL_PART_RE = /^#{URL_CHAR}+#{URL_CHAR_LESS}/
+
+ def unwrap_urls
+ res = []
+ len = size
+ i = 0
+ while i < len do
+ x = self[i]
+ y = self[i+1]
+ if y && x =~ TRAILING_URL_RE && y =~ URL_PART_RE && y !~ LEADING_URL_RE
+ res << x.chomp + y
+ i += 2
+ else
+ res << x
+ i += 1
+ end
+ end
+ res
+ end
end
class Time
--
1.5.3.1.109.gacd69
^ permalink raw reply [flat|nested] 2+ messages in thread
* [sup-talk] [PATCH] Unwrap br0ken URLs.
2008-02-19 10:17 [sup-talk] [PATCH] Unwrap br0ken URLs Nicolas Pouillard
@ 2008-02-28 17:29 ` William Morgan
0 siblings, 0 replies; 2+ messages in thread
From: William Morgan @ 2008-02-28 17:29 UTC (permalink / raw)
I would love to have a feature like this in Sup. This patch still has
some issues in terms of being over-aggressive. What I would really like
to see as a starting point is a corpus of broken URL examples that we
can build unit tests of. Then we can tweak these regexes until we get
something that has both high precision and high recall.
Also, have you looked at URI.regexp? I think that can do a lot of the
dirty work.
--
William <wmorgan-sup at masanjin.net>
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2008-02-28 17:29 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-02-19 10:17 [sup-talk] [PATCH] Unwrap br0ken URLs Nicolas Pouillard
2008-02-28 17:29 ` William Morgan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox