don't "fix" encoding of raw message/rfc822 parts

commit d3fbac1341399049762cea6ed788d4db231a85f6
parent 4204170b7a52847c6d14e799123dd9681f75e11c
Author: Dan Callaghan <djc@djc.id.au>
Date:   Sun, 12 Jul 2020 18:22:36 +1000

don't "fix" encoding of raw message/rfc822 parts

In the code for handling message/rfc822 MIME parts, message.rb line 498,
we were calling the #normalize_whitespace method on the body string
before it was decoded.

I'm not too sure if messing with whitespace is the right thing to do
there, but that aside, that method was then also calling #fix_encoding!
which would forcibly transcode the raw body to UTF-8. Instead, we want to
keep the body as ASCII-8BIT at that point, and let it be decoded using
all the normal message decoding mechanisms.

The only other calls to #normalize_whitespace are in the UI, and in the
code path which handles body text of messages, message.rb line 592,
where the body text has already been decoded. So it seems like we can
safely make #normalize_whitespace just mess with whitespace and leave
the string encoding alone.

Fixes #205.

Diffstat:

M	lib/sup/util.rb	\|	1	-
A	test/fixtures/non-ascii-header-in-nested-message.eml	\|	36	++++++++++++++++++++++++++++++++++++
M	test/test_message.rb	\|	23	+++++++++++++++++++++++

3 files changed, 59 insertions(+), 1 deletion(-)
diff --git a/lib/sup/util.rb b/lib/sup/util.rb
@@ -376,7 +376,6 @@ class String
   end
 
   def normalize_whitespace
-    fix_encoding!
     gsub(/\t/, "    ").gsub(/\r/, "")
   end
 
diff --git a/test/fixtures/non-ascii-header-in-nested-message.eml b/test/fixtures/non-ascii-header-in-nested-message.eml
@@ -0,0 +1,36 @@
+Return-Path: <spammer@example.com>
+From: SPAM � <spammer@example.com>
+To: <a@b.c>
+Subject: spam � spam
+MIME-Version: 1.0
+Content-Type: multipart/mixed; boundary="----------=_4F506AC2.EE281DC4"
+Message-Id: <20120302063755.0FE2122017@a.a.a.a>
+Date: Fri,  2 Mar 2012 07:37:55 +0100 (CET)
+
+This is a multi-part message in MIME format.
+
+------------=_4F506AC2.EE281DC4
+Content-Type: text/plain; charset=iso-8859-1
+Content-Disposition: inline
+Content-Transfer-Encoding: 8bit
+
+Spam detection software, running on the system "a.a.a.a.a.", has
+identified this incoming email as possible spam.  The original message
+has been attached to this so you can view it (if it isn't spam) or label
+similar future email.
+
+
+------------=_4F506AC2.EE281DC4
+Content-Type: message/rfc822; x-spam-type=original
+Content-Description: original message before SpamAssassin
+Content-Disposition: attachment
+Content-Transfer-Encoding: 8bit
+
+From: SPAM � <spammer@example.com>
+To: <a@b.c>
+Subject: spam � spam
+
+This is a spam.
+
+------------=_4F506AC2.EE281DC4--
+
diff --git a/test/test_message.rb b/test/test_message.rb
@@ -248,6 +248,29 @@ class TestMessage < Minitest::Test
     assert_equal("spam \ufffd spam", sup_message.subj)
   end
 
+  def test_nonascii_header_in_nested_message
+    source = DummySource.new("sup-test://test_nonascii_header_in_nested_message")
+    source.messages = [ fixture_path("non-ascii-header-in-nested-message.eml") ]
+    source_info = 0
+
+    sup_message = Message.build_from_source(source, source_info)
+    chunks = sup_message.load_from_source!
+
+    assert_equal(3, chunks.length)
+
+    assert(chunks[0].is_a? Redwood::Chunk::Text)
+
+    assert(chunks[1].is_a? Redwood::Chunk::EnclosedMessage)
+    ## TODO need to fix EnclosedMessage#lines
+    #assert_equal(4, chunks[1].lines.length)
+    #assert_equal("From: SPAM \ufffd <spammer@example.com>", chunks[1].lines[0])
+    #assert_equal("spam \ufffd spam", chunks[1].lines[3])
+
+    assert(chunks[2].is_a? Redwood::Chunk::Text)
+    assert_equal(1, chunks[2].lines.length)
+    assert_equal("This is a spam.", chunks[2].lines[0])
+  end
+
   def test_malicious_attachment_names
     source = DummySource.new("sup-test://test_blank_header_lines")
     source.messages = [ fixture_path('malicious-attachment-names.eml') ]

sup.git