update -eos flag docs and test

author Steve Hancock <perltidy@users.sourceforge.net>

Tue, 15 Mar 2022 19:07:14 +0000 (12:07 -0700)

committer Steve Hancock <perltidy@users.sourceforge.net>

Tue, 15 Mar 2022 19:07:14 +0000 (12:07 -0700)
author Steve Hancock <perltidy@users.sourceforge.net>
Tue, 15 Mar 2022 19:07:14 +0000 (12:07 -0700)
committer Steve Hancock <perltidy@users.sourceforge.net>
Tue, 15 Mar 2022 19:07:14 +0000 (12:07 -0700)
diff --git a/docs/eos_flag.md b/docs/eos_flag.md

new file mode 100644 (file)

index 0000000..5b66292
--- /dev/null
+++ b/docs/eos_flag.md
@@ -0,0 +1,173 @@
+# The --encode-output-strings Flag
+
+## What's this about?
+
+This is about making perltidy work better as a filter when called from other
+Perl scripts.  For example, in the following example a reference to a
+string `$source` is passed to perltidy and it stores the formatted result in
+a string named `$output`:
+
+```
+    my $output;
+    my $err = Perl::Tidy::perltidy(
+        source      => \$source,
+        destination => \$output,
+        argv        => $argv,
+    );
+```
+
+For a filtering operation we expect to be able to directly compare the source and output strings, like this:
+
+```
+    if ($output eq $source) {print "Your formatting is unchanged\n";}
+```
+
+The problem is that in versions of perltidy prior to 2022 there was a use case
+where this was not possible.  That case was when perltidy received an encoded
+source string and decoded it from a utf8 but did not re-encode the output.  So
+the source string was in a different storage mode than the output string, and
+a direct comparison was not meaningful.
+
+This problem is an unintentional result of the historical evolution of perltidy and needs to be fixed.
+
+## How will the problem be fixed?
+
+A fix is being phased in over a couple of steps. The first step was to
+introduce a new flag in in version 20220217.  The new flag is
+**--encode-output-strings**, or **-eos**.  When this is set, perltidy will fix
+the specific problem mentioned above by doing an encoding before returning.
+So perltidy will behave well as a filter when **-eos** is set.  In
+fact, a more intuitive name for the flag might have been **--be-a-nice-filter**. To
+illustrate using this flag in the above example, we could write
+
+```
+    my $output;
+
+    # Make perltidy versions after 2022 behave well as a filter
+    $argv .= " -eos" if ($Perl::Tidy::VERSION > 20220101);
+    my $err = Perl::Tidy::perltidy(
+        source      => \$source,
+        destination => \$output,
+        argv        => $argv,
+    );
+```
+
+With this modification we can make a meaningful direct comparison of `$source` and `$output`. The test on `$VERSION` allows this to work with older versions of perltidy (which would not recognize the flag -eos).  An update such as the above can be made right now to facilitate a smooth transition to the new default.
+
+In the second step, possibly later in 2022, the new **-eos** flag will become the default.
+
+## What can go wrong?
+
+The first step is safe because the default behavior is unchanged.  But the programmer has to set **-eos** for the corrected behavior to go into effect.
+
+The second step, in which **-eos** becomes the default, will have no effect on programs which do not require perltidy to decode strings, and it will make some programs start processing encoded strings correctly.  But there is also the possibility of  **double encoding** of the output, or in other words data corruption, in some cases.  This could happen if an existing program already has already worked around this issue by encoding the output that it receives back from perltidy.  It is important to check for this.
+
+To see how common this problem might be, all programs on CPAN which use Perl::Tidy as a filter were examined.  Of a total of 43 programs located, one was identified for which the change in default would definitely cause double encoding, and in a couple of other programs it might be possible possible but it was difficult to determine.  The majority of programs would either not be affected or would start working correctly when processing encoded files. Here is a slightly revised version of the code for the program which would have a problem with double encoding with the new default:
+
+```
+    my $output;
+    Perl::Tidy::perltidy(
+        source      => \$self->{data},
+        destination => \$output,
+        stderr      => \$perltidy_err,
+        errorfile   => \$perltidy_err,
+        logfile     => \$perltidy_log,
+        argv        => $perltidy_argv,
+    );
+
+    # convert source back to raw
+    encode_utf8 $output;
+```
+
+The problem is in the last line where encoding is done after the call to perltidy.  This
+encoding operation was added by the author to compensate for the lack of an
+encoding step with the old default behavior.  But if we run this code with
+**-eos**, which is the planned new default, encoding will also be done by perltidy before
+it returns, with the result that `$output` gets double encoding.  This must be avoided. Here
+is one way to modify the above code to avoid double encoding:
+
+```
+    my $has_eos_flag = $Perl::Tidy::VERSION > 20220101;
+    $perltidy_argv .= ' -eos' if $has_eos_flag;
+
+    Perl::Tidy::perltidy(
+        source      => \$self->{data},
+        destination => \$output,
+        stderr      => \$perltidy_err,
+        errorfile   => \$perltidy_err,
+        logfile     => \$perltidy_log,
+        argv        => $perltidy_argv,
+    );
+
+    # convert source back to raw if perltidy did not do it
+    encode_utf8($output) if ( !$has_eos_flag );
+```
+
+A related problem is if an update of Perl::Tidy is made without also updating
+a corrected version of a module such as the above.  To help reduce the chance
+that this will occur the Change Log for perltidy will contain a warning to be
+alert for the double encoding problem, and how to reset the default if
+necessary.  This is also the reason for waiting some time before the second step is made.
+
+If double encoding does appear to be occuring after the default change for some program which calls Perl::Tidy, then a quick emergency fix can be made by the program user by setting **-neos** to revert to the old default.  A better fix can eventually be made by the program author by removing the second encoding using a technique such as illustrated above.
+
+## Summary
+
+A new flag, **-eos**, has been added to cause Perl::Tidy to behave better as a
+filter when called from other Perl scripts.  This flag will eventually become
+the default setting.  Programs which use Perl::Tidy as a
+filter can be tested right now with the new **-eos** flag to be sure that double
+encoding is not possible when the default is changed.
+
+## Appendix, a little closer look
+
+A string of text (here, a Perl script) can be stored by Perl in one of two
+internal storage formats.  For simplicity let's call them 'B' mode (for
+'Byte') mode and 'C' mode (for 'Character').  The 'C' mode is needed for
+working with multi-byte characters.  Thinking of a Perl script as a single
+long string of text, we can look at the mode of the text of a source script as
+it is processed by perltidy at three well-defined points:
+
+    - when it enters as a source
+    - at the intermediate stage as it is processed
+    - when it is leaves to its destination
+
+Since 'C' mode only has meaning within Perl scripts, a rule is that
+outside of the realm of Perl the text must exist in 'B' mode.  So the source
+can only be in 'C' mode if it arrives by a call from another Perl program, and
+the destination can only be in 'C' mode if the destination is a Perl program.
+
+Let us make a list of all possible sets of modes to be sure that all cases are
+covered.  If each of the three states could be in 'B' or 'C' mode then we
+would have a total of 2 x 2 x 2 = 8 combinations of states.  Here is a list of
+them, with a note indicating which ones are possible, and when:
+
+    1 - B->B->B  always ok
+    2 - B->B->C  never (trailing B->C never done)
+    3 - B->C->B  ok if destination is a file or -eos is set     [NEW DEFAULT]
+    4 - B->C->C  ok if destination is a string and -neos is set [OLD DEFAULT]
+    5 - C->B->B  never (leading C->B never done)
+    6 - C->B->C  never (leading C->B and trailing B->C never done)
+    7 - C->C->B  only for string-to-file
+    8 - C->C->C  only for string-to-string
+
+So three of these cases (2, 5, and 6) cannot occur and the other five can
+occur.  Of these five possible cases, four are possible when the
+destination is a string:
+
+    1 - B->B->B  ok
+    3 - B->C->B  ok if -eos is set  [NEW DEFAULt]
+    4 - B->C->C  ok if -neos is set [OLD DEFAULT]
+    8 - C->C->C  ok 
+
+From this we can see that, if **-eos** is set, then the problematic case 4 will
+not occur and the starting and ending states have the same storage mode for
+all routes through perltidy.  This verfies that perltidy work well as a filter in
+all cases when **-eos** flag.
+
+It is worth noting that programs which decode text before calling perltidy pass by the C->C->C route and are not influenced by the -eos flag setting. This is a fairly common usage pattern.
+
+Also note that case 7, the C->C->B route, is an unusual but possible situation
+involving a source string being sent directly to a file.  It is the only
+situation in which perltidy does an encoding without having done a
+corresponding previous decoding.
diff --git a/lib/Perl/Tidy.pm b/lib/Perl/Tidy.pm

index ceb683b03bd89f67e61fd56a7ef5ac2986ab21b5..8eb5d0743b7de48d23e7463a404220dba02b31d9 100644 (file)
--- a/lib/Perl/Tidy.pm
+++ b/lib/Perl/Tidy.pm
@@ -394,6 +394,25 @@ sub Warn_msg { my $msg = shift; $fh_stderr->print($msg); return }
  # Output Warn message and bump Warn count
  sub Warn { my $msg = shift; $fh_stderr->print($msg); $Warn_count++; return }
  
+sub is_char_mode {
+
+    my ($string) = @_;
+
+    # Returns:
+    #   true  if $string is in Perl's internal character mode
+    #         (also called the 'upgraded form', or UTF8=1)
+    #   false if $string is in Perl's internal byte mode
+
+    # This function isolates the call to Perl's internal function
+    # utf8::is_utf8() which is true for strings represented in an 'upgraded
+    # form'. It is available after Perl version 5.8.
+    # See https://perldoc.perl.org/Encode.
+    # See also comments in Carp.pm and other modules using this function
+
+    return 1 if ( utf8::is_utf8($string) );
+    return;
+}
+
  sub perltidy {
  
      my %input_hash = @_;
@@ -420,24 +439,36 @@ sub perltidy {
      );
  
      # Status information which can be returned for diagnostic purposes.
-    # This is still under development and has not been documented.
+    # This is intended for testing and subject to change.
  
-    # file_count         => number of files processed in this call
+    # List of "key => value" hash entries:
+
+    # Some relevant user input parameters for convenience:
      # opt_format         => value of --format: 'tidy', 'html', or 'user'
      # opt_encoding       => value of -enc flag: 'utf8', 'none', or 'guess'
      # opt_encode_output  => value of -eos flag: 'eos' or 'neos'
      # opt_max_iterations => value of --iterations=n
-    #
-    # If there are multiple files, these values will be for the last file:
-    # input_name         => display name of the input stream
-    # output_name        => display name of the output stream
-    # string_mode_source => 'byte' or 'char' : in which of Perl's two string
-    #   modes is the source ( after reading from any file.  Will be 'char'
-    #   mode if we received a string with utf8::is_utf8() set ).
-    # string_mode_used   => 'byte' or 'char' : in which of Perl's two string
-    #                        modes was the text processed?
+
+    # file_count         => number of files processed in this call
+
+    # If multiple files are processed, then the following values will be for
+    # the last file only:
+
+    # input_name         => name of the input stream
+    # output_name        => name of the output stream
+
+    # The following two variables refer to Perl's two internal string modes,
+    # and have the values 0 for 'byte' mode and 1 for 'char' mode:
+    # char_mode_source   => true if source is in 'char' mode. Will be false
+    #      unless we received a source string ref with utf8::is_utf8() set.
+    # char_mode_used     => true if text processed by perltidy in 'char' mode.
+    #      Normally true for text identified as utf8, otherwise false.
+
+    # These variables tell what utf8 decoding/encoding was done:
      # input_decoded_as   => non-blank if perltidy decoded the source text
      # output_encoded_as  => non-blank if perltidy encoded before return
+
+    # These variables are related to iterations and convergence testing:
      # iteration_count    => number of iterations done
      #                       ( can be from 1 to opt_max_iterations )
      # converged          => true if stopped on convergence
@@ -453,15 +484,15 @@ sub perltidy {
          opt_encode_output  => "",
          opt_max_iterations => "",
  
-        input_name         => "",
-        output_name        => "",
-        string_mode_source => 0,
-        string_mode_used   => "",
-        input_decoded_as   => "",
-        output_encoded_as  => "",
-        iteration_count    => 0,
-        converged          => 0,
-        blinking           => 0,
+        input_name        => "",
+        output_name       => "",
+        char_mode_source  => 0,
+        char_mode_used    => 0,
+        input_decoded_as  => "",
+        output_encoded_as => "",
+        iteration_count   => 0,
+        converged         => 0,
+        blinking          => 0,
      };
  
      # Fix for issue git #57
@@ -1066,23 +1097,22 @@ EOM
          my $remove_terminal_newline =
            !$rOpts->{'add-terminal-newline'} && substr( $buf, -1, 1 ) !~ /\n/;
  
-        # Decode the input stream if necessary requested
+        # Decode the input stream if necessary or requested
          my $encoding_in              = "";
          my $rOpts_character_encoding = $rOpts->{'character-encoding'};
          my $encoding_log_message;
          my $decoded_input_as = "";
-        $rstatus->{'string_mode_source'} = 'byte';
-
-        # Case 1: If the UTF8 flag is set, then Perl is already in a
-        # character-oriented mode for this string rather than a byte-oriented
-        # mode.  This can happen for example if the caller has decoded the
-        # string before calling perltidy.  In any case, our only option is to
-        # ignore any encoding flag.  See https://perldoc.perl.org/Encode.
-        # Note: you have to do this test within 'if' parens as below.  You
-        # can NOT use: my $flag = utf8::is_utf8($buf) - gives wrong result.
-        if ( utf8::is_utf8($buf) ) {
+        $rstatus->{'char_mode_source'} = 0;
+
+        # Case 1: If Perl is already in a character-oriented mode for this
+        # string rather than a byte-oriented mode.  Normally, this happens if
+        # the caller has decoded a utf8 string before calling perltidy.  But it
+        # could also happen if the user has done some unusual manipulations of
+        # the source.  In any case, we will not attempt to decode it because
+        # that could result in an output string in a different mode.
+        if ( is_char_mode($buf) ) {
              $encoding_in = "utf8";
-            $rstatus->{'string_mode_source'} = 'char';
+            $rstatus->{'char_mode_source'} = 1;
          }
  
          # Case 2. No input stream encoding requested.  This is appropriate
@@ -1182,7 +1212,7 @@ EOM
  
          $rstatus->{'input_name'}       = $display_name;
          $rstatus->{'opt_encoding'}     = $rOpts_character_encoding;
-        $rstatus->{'string_mode_used'} = $encoding_in ? 'char' : 'byte';
+        $rstatus->{'char_mode_used'}   = $encoding_in ? 1 : 0;
          $rstatus->{'input_decoded_as'} = $decoded_input_as;
  
          # Define the function to determine the display width of character
@@ -2781,6 +2811,7 @@ sub generate_options {
        delete-old-newlines
        delete-semicolons
        extended-syntax
+      encode-output-strings
        function-paren-vertical-alignment
        fuzzy-line-length
        hanging-side-comments
diff --git a/t/testwide.t b/t/testwide.t

index 4a8a1ce5861c86e08199cfa3d9952342339e08a2..3395b722214f480bd20ac791eb9db382400d9f62 100755 (executable)
--- a/t/testwide.t
+++ b/t/testwide.t
@@ -3,10 +3,9 @@ use utf8;
  use Test;
  use Carp;
  use FindBin;
-BEGIN {unshift @INC, "./"}
-BEGIN {plan tests => 3}
-use Perl::Tidy; 
-
+BEGIN { unshift @INC, "./" }
+BEGIN { plan tests => 3 }
+use Perl::Tidy;
  
  my $source = <<'EOM';
  %pangrams=("Plain","ASCII",
@@ -15,7 +14,7 @@ my $source = <<'EOM';
  "Любя, съешь щипцы, — вздохнёт мэр, — кайф жгуч.","RU");
  EOM
  
-my $expected_output=<<'EOM';
+my $expected_output = <<'EOM';
  %pangrams = (
               "Plain",                                                  "ASCII",
               "Zwölf große Boxkämpfer jagen Vik quer über den Sylter.", "DE",
@@ -39,20 +38,30 @@ Perl::Tidy::perltidy(
      argv        => '-nsyn',
  );
  
-ok($output, $expected_output);
+ok( $output, $expected_output );
  
-# Since default encoding is 'guess' and the source is a file of utf8, perltidy
-# will guess utf8 and decode the input. But we are comparing to a string which
-# is in character mode, so we do not want perltidy to encode the output in this
-# case and therefore must set -neos.
  Perl::Tidy::perltidy(
      source      => $FindBin::Bin . '/testwide.pl.src',
      destination => \$output,
      perltidyrc  => \$perltidyrc,
-    argv        => '-nsyn -neos',
+    argv        => '-nsyn',
  );
  
-ok($output, $expected_output);
+# We have to be careful here ...  In this test we are comparing $output to a
+# source string which is in character mode (since it is in this file declared
+# with 'use utf8'). We need to compare strings which have the same storage
+# mode.
+
+# The internal storage mode of $output was character mode (decoded) for
+# vesions prior to 20220217.02, but is byte mode (encoded) for the latest
+# version of perltidy.
+
+# The following statement will decode $output if it is stored in byte mode,
+# and leave it unchanged (and return an error) otherwise.  So this will work
+# with all version of perltidy.  See https://perldoc.perl.org/utf8
+utf8::decode($output);
+
+ok( $output, $expected_output );
  
  # Test writing encoded output to stdout with the -st flag
  # References: RT #133166, RT #133171, git #35
@@ -72,7 +81,7 @@ do {
          source => \$source,
          ##destination => ... we are using -st, so no destination is specified
          perltidyrc => \$perltidyrc,
-        argv       => '-nsyn -st',  # added -st
+        argv       => '-nsyn -st',    # added -st
      );
      close STDOUT;
  
@@ -83,5 +92,5 @@ do {
      while ( my $line = <TMP> ) { $output .= $line }
  };
  
-ok($output, $expected_output);
+ok( $output, $expected_output );
author	Steve Hancock <perltidy@users.sourceforge.net>
	Tue, 15 Mar 2022 19:07:14 +0000 (12:07 -0700)
committer	Steve Hancock <perltidy@users.sourceforge.net>
	Tue, 15 Mar 2022 19:07:14 +0000 (12:07 -0700)
docs/eos_flag.md	[new file with mode: 0644]	patch \| blob
lib/Perl/Tidy.pm		patch \| blob \| history
t/testwide.t		patch \| blob \| history