Documentation/contributor/regressions.itexi

   1 @c -*- coding: utf-8; mode: texinfo; -*-
   2 @node Regression tests
   3 @chapter Regression tests
   4
   5 @menu
   6 * Introduction to regression tests::
   7 * Precompiled regression tests::
   8 * Compiling regression tests::
   9 * Regtest comparison::
  10 * Pixel-based regtest comparison::
  11 * Finding the cause of a regression::
  12 * Memory and coverage tests::
  13 * MusicXML tests::
  14 * Grand Regression Test Checking::
  15 @end menu
  16
  17
  18 @node Introduction to regression tests
  19 @section Introduction to regression tests
  20
  21 LilyPond has a complete suite of regression tests that are used
  22 to ensure that changes to the code do not break existing behavior.
  23 These regression tests comprise small LilyPond snippets that test
  24 the functionality of each part of LilyPond.
  25
  26 Regression tests are added when new functionality is added to
  27 LilyPond.
  28 We do not yet have a policy on when it is appropriate to add or
  29 modify a regtest when bugs are fixed.  Individual developers
  30 should use their best judgement until this is clarified during the
  31 @ref{Grand Organization Project (GOP)}.
  32
  33 The regression tests are compiled using special @code{make}
  34 targets.  There are three primary uses for the regression
  35 tests.  First, successful completion of the regression tests means
  36 that LilyPond has been properly built.  Second, the output of the
  37 regression tests can be manually checked to ensure that
  38 the graphical output matches the description of the intended
  39 output.  Third, the regression test output from two different
  40 versions of LilyPond can be automatically compared to identify
  41 any differences.  These differences should then be manually
  42 checked to ensure that the differences are intended.
  43
  44 Regression tests (@qq{regtests}) are available in precompiled form
  45 as part of the documentation.  Regtests can also be compiled
  46 on any machine that has a properly configured LilyPond build
  47 system.
  48
  49
  50 @node Precompiled regression tests
  51 @section Precompiled regression tests
  52
  53 @subheading Regression test output
  54
  55 As part of the release process, the regression tests are run
  56 for every LilyPond release.  Full regression test output is
  57 available for every stable version and the most recent development
  58 version.
  59
  60 Regression test output is available in HTML and PDF format.  Links
  61 to the regression test output are available at the developer's
  62 resources page for the version of interest.
  63
  64 The latest stable version of the regtests is found at:
  65
  66 @example
  67 @uref{http://lilypond.org/doc/stable/input/regression/collated-files.html}
  68 @end example
  69
  70 The latest development version of the regtests is found at:
  71
  72 @example
  73 @uref{http://lilypond.org/doc/latest/input/regression/collated-files.html}
  74 @end example
  75
  76
  77 @subheading Regression test comparison
  78
  79 Each time a new version is released, the regtests are
  80 compiled and the output is automatically compared with the
  81 output of the previous release.  The result of these
  82 comparisons is archived online:
  83
  84 @example
  85 @uref{http://lilypond.org/test/}
  86 @end example
  87
  88 Checking these pages is a very important task for the LilyPond project.
  89 You are invited to report anything that looks broken, or any case
  90 where the output quality is not on par with the previous release,
  91 as described in @rweb{Bug reports}.
  92
  93 @warning{ The special regression test
  94 @file{test-output-distance.ly} will always show up as a
  95 regression.  This test changes each time it is run, and serves to
  96 verify that the regression tests have, in fact, run.}
  97
  98
  99 @subheading What to look for
 100
 101 The test comparison shows all of the changes that occurred between
 102 the current release and the prior release.  Each test that has a
 103 significant (noticeable) difference in output is displayed, with
 104 the old version on the left and the new version on the right.
 105
 106 Some of the small changes can be ignored (slightly different slur
 107 shapes, small variations in note spacing), but this is not always
 108 the case: sometimes even the smallest change means that something
 109 is wrong.  To help in distinguishing these cases, we use bigger
 110 staff size when small differences matter.
 111
 112 Staff size 30 generally means "pay extra attention to details".
 113 Staff size 40 (two times bigger than default size) or more means
 114 that the regtest @strong{is} about the details.
 115
 116 Staff size smaller than default doesn't mean anything.
 117
 118 Regression tests whose output is the same for both versions are
 119 not shown in the test comparison.
 120
 121 @itemize
 122 @item
 123 Images: green blurs in the new version show the approximate
 124 location of elements in the old version.
 125
 126 There are often minor adjustments in spacing which do not indicate
 127 any problem.
 128
 129 @item
 130 Log files: show the difference in command-line output.
 131
 132 The main thing to examine are any changes in page counts -- if a
 133 file used to fit on 1 page but now requires 4 or 5 pages,
 134 something is suspicious!
 135
 136 @item
 137 Profile files: give information about
 138 TODO?  I don't know what they're for.
 139 Apparently they give some information about CPU usage.  If you got
 140 tons of changes in cell counts, this probably means that you compiled
 141 @code{make test-baseline} with a different amount of CPU threads than
 142 @code{make check}. Try redoing tests from scratch with the same
 143 number of threads each time -- see @ref{Saving time with the -j option}.
 144
 145 @end itemize
 146
 147 @warning{
 148 The automatic comparison of the regtests checks the LilyPond
 149 bounding boxes.  This means that Ghostscript changes and changes
 150 in lyrics or text are not found.
 151 }
 152
 153 @node Compiling regression tests
 154 @section Compiling regression tests
 155
 156 Developers may wish to see the output of the complete regression
 157 test suite for the current version of the source repository
 158 between releases.  Current source code is available; see
 159 @ref{Working with source code}.
 160
 161 For regression testing @code{../configure} should be run with the
 162 @code{--disable-optimising} option.  Then you will need
 163 to build the LilyPond binary; see @ref{Compiling LilyPond}.
 164
 165 Uninstalling the previous LilyPond version is not necessary, nor is
 166 running @code{make install}, since the tests will automatically be
 167 compiled with the LilyPond binary you have just built in your source
 168 directory.
 169
 170 From this point, the regtests are compiled with:
 171
 172 @example
 173 make test
 174 @end example
 175
 176 If you have a multi-core machine you may want to use the @option{-j}
 177 option and @var{CPU_COUNT} variable, as
 178 described in @ref{Saving time with CPU_COUNT}.
 179 For a quad-core processor the complete command would be:
 180
 181 @example
 182 make -j5 CPU_COUNT=5 test
 183 @end example
 184
 185 The regtest output will then be available in
 186 @file{input/regression/out-test}.
 187 @file{input/regression/out-test/collated-examples.html}
 188 contains a listing of all the regression tests that were run,
 189 but none of the images are included.  Individual images are
 190 also available in this directory.
 191
 192 The primary use of @samp{make@tie{}test} is to verify that the
 193 regression tests all run without error.  The regression test
 194 page that is part of the documentation is created only when the
 195 documentation is built, as described in @ref{Generating documentation}.
 196 Note that building the documentation requires more installed components
 197 than building the source code, as described in
 198 @ref{Requirements for building documentation}.
 199
 200
 201 @node Regtest comparison
 202 @section Regtest comparison
 203
 204 Before modified code is committed to @code{master} (via @code{staging}),
 205 a regression test
 206 comparison must be completed to ensure that the changes have
 207 not caused problems with previously working code.  The comparison
 208 is made automatically upon compiling the regression test suite
 209 twice.
 210
 211 @enumerate
 212
 213 @item
 214 Run @code{make} with current git master without any of your changes.
 215
 216 @item
 217 Before making changes to the code, establish a baseline for the comparison by
 218 going to the @file{$LILYPOND_GIT/build/} directory and running:
 219
 220 @example
 221 make test-baseline
 222 @end example
 223
 224 @item
 225 Make your changes, or apply the patch(es) to consider.
 226
 227 @item
 228 Compile the source with @samp{make} as usual.
 229
 230 @item
 231 Check for unintentional changes to the regtests:
 232
 233 @example
 234 make check
 235 @end example
 236
 237 After this has finished, a regression test comparison will be
 238 available (relative to the current @file{build/} directory) at:
 239
 240 @example
 241 out/test-results/index.html
 242 @end example
 243
 244 For each regression test that differs between the baseline and the
 245 changed code, a regression test entry will be displayed.  Ideally,
 246 the only changes would be the changes that you were working on.
 247 If regressions are introduced, they must be fixed before
 248 committing the code.
 249
 250 @warning{
 251 The special regression test @file{test-output-distance.ly} will always
 252 show up as a regression.  This test changes each time it is run, and
 253 serves to verify that the regression tests have, in fact, run.}
 254
 255 @item
 256 If you are happy with the results, then stop now.
 257
 258 If you want to continue programming, then make any additional code
 259 changes, and continue.
 260
 261 @item
 262 Compile the source with @samp{make} as usual.
 263
 264 @item
 265 To re-check files that differed between the initial
 266 @samp{make@tie{}test-baseline} and your post-changes
 267 @samp{make@tie{}check}, run:
 268
 269 @example
 270 make test-redo
 271 @end example
 272
 273 This updates the regression list at @file{out/test-results/index.html}.
 274 It does @emph{not} redo @file{test-output-distance.ly}.
 275
 276 @item
 277 When all regressions have been resolved, the output list will be empty.
 278
 279 @item
 280 Once all regressions have been resolved, a final check should be completed
 281 by running:
 282
 283 @example
 284 make test-clean
 285 make check
 286 @end example
 287
 288 This cleans the results of the previous @samp{make@tie{}check}, then does the
 289 automatic regression comparison again.
 290
 291 @end enumerate
 292
 293 @advanced{
 294 Once a test baseline has been established, there is no need to run it again
 295 unless git master changed. In other words, if you work with several branches
 296 and want to do regtests comparison for all of them, you can
 297 @code{make test-baseline} with git master, checkout some branch,
 298 @code{make} and @code{make check} it, then switch to another branch,
 299 @code{make test-clean}, @code{make} and @code{make check} it without doing
 300 @code{make test-baseline} again.}
 301
 302 @node Pixel-based regtest comparison
 303 @section Pixel-based regtest comparison
 304
 305 As an alternative to the @code{make test} method for regtest checking (which
 306 relies upon @code{.signature} files created by a LilyPond run and which describe
 307 the placing of grobs) there is a script which compares the output of two
 308 LilyPond versions pixel-by-pixel.  To use this, start by checking out the
 309 version of LilyPond you want to use as a baseline, and run @code{make}.  Then,
 310 do the following:
 311
 312 @example
 313 cd $LILYPOND_GIT/scripts/auxiliar/
 314 ./make-regtest-pngs.sh -j9 -o
 315 @end example
 316
 317 The @code{-j9} option tells the script to use 9 CPUs to create the
 318 images - change this to your own CPU count+1.  @code{-o} means this is the "old"
 319 version.  This will create images of all the regtests in
 320
 321 @example
 322 $LILYPOND_BUILD_DIR/out-png-check/old-regtest-results/
 323 @end example
 324
 325 Now checkout the version you want to compare with the baseline.  Run
 326 @code{make} again to recreate the LilyPond binary.  Then, do the following:
 327
 328 @example
 329 cd $LILYPOND_GIT/scripts/auxiliar/
 330 ./make-regtest-pngs.sh -j9 -n
 331 @end example
 332
 333 The @code{-n} option tells the script to make a "new" version of the
 334 images.  They are created in
 335
 336 @example
 337 $LILYPOND_BUILD_DIR/out-png-check/new-regtest-results/
 338 @end example
 339
 340 Once the new images have been created, the script compares the old images with
 341 the new ones pixel-by-pixel and prints a list of the different images to the
 342 terminal, together with a count of how many differences were found.  The
 343 results of the checks are in
 344
 345 @example
 346 $LILYPOND_BUILD_DIR/out-png-check/regtest-diffs/
 347 @end example
 348
 349 To check for differences, browse that directory with an image
 350 viewer.  Differences are shown in red.  Be aware that some images with complex
 351 fonts or spacing annotations always display a few minor differences.  These can
 352 safely be ignored.
 353
 354
 355 @node Finding the cause of a regression
 356 @section Finding the cause of a regression
 357
 358 Git has special functionality to help tracking down the exact
 359 commit which causes a problem.  See the git manual page for
 360 @code{git bisect}.  This is a job that non-programmers can do,
 361 although it requires familiarity with git, ability to compile
 362 LilyPond, and generally a fair amount of technical knowledge.  A
 363 brief summary is given below, but you may need to consult other
 364 documentation for in-depth explanations.
 365
 366 Even if you are not familiar with git or are not able to compile
 367 LilyPond you can still help to narrow down the cause of a
 368 regression simply by downloading the binary releases of different
 369 LilyPond versions and testing them for the regression.  Knowing
 370 which version of LilyPond first exhibited the regression is
 371 helpful to a developer as it shortens the @code{git bisect}
 372 procedure.
 373
 374 Once a problematic commit is identified, the programmers' job is
 375 much easier.  In fact, for most regression bugs, the majority of
 376 the time is spent simply finding the problematic commit.
 377
 378 More information is in @ref{Regression tests}.
 379
 380 @subheading git bisect setup
 381
 382 We need to set up the bisect for each problem we want to
 383 investigate.
 384
 385 Suppose we have an input file which compiled in version 2.13.32,
 386 but fails in version 2.13.38 and above.
 387
 388 @enumerate
 389 @item
 390 Begin the process:
 391
 392 @example
 393 git bisect start
 394 @end example
 395
 396 @item
 397 Give it the earliest known bad tag:
 398
 399 @example
 400 git bisect bad release/2.13.38-1
 401 @end example
 402
 403 (you can see tags with: @code{git tag} )
 404
 405 @item
 406 Give it the latest known good tag:
 407
 408 @example
 409 git bisect good release/2.13.32-1
 410 @end example
 411
 412 You should now see something like:
 413 @example
 414 Bisecting: 195 revisions left to test after this (roughly 8 steps)
 415 [b17e2f3d7a5853a30f7d5a3cdc6b5079e77a3d2a] Web: Announcement
 416 update for the new @qq{LilyPond Report}.
 417 @end example
 418
 419 @end enumerate
 420
 421 @subheading git bisect actual
 422
 423 @enumerate
 424
 425 @item
 426 Compile the source:
 427
 428 @example
 429 make
 430 @end example
 431
 432 @item
 433 Test your input file:
 434
 435 @example
 436 out/bin/lilypond test.ly
 437 @end example
 438
 439 @item
 440 Test results?
 441
 442 @itemize
 443 @item
 444 Does it crash, or is the output bad?  If so:
 445
 446 @example
 447 git bisect bad
 448 @end example
 449
 450 @item
 451 Does your input file produce good output?  If so:
 452
 453 @example
 454 git bisect good
 455 @end example
 456
 457 @end itemize
 458
 459 @item
 460 Once the exact problem commit has been identified, git will inform
 461 you with a message like:
 462
 463 @example
 464 6d28aebbaaab1be9961a00bf15a1ef93acb91e30 is the first bad commit
 465 %%% ... blah blah blah ...
 466 @end example
 467
 468 If there is still a range of commits, then git will automatically
 469 select a new version for you to test.  Go to step #1.
 470
 471 @end enumerate
 472
 473 @subheading Recommendation: use two terminal windows
 474
 475 @itemize
 476 @item
 477 One window is open to the @code{build/} directory, and alternates
 478 between these commands:
 479
 480 @example
 481 make
 482 out/bin/lilypond test.ly
 483 @end example
 484
 485 @item
 486 One window is open to the top source directory, and alternates
 487 between these commands:
 488
 489 @example
 490 git bisect good
 491 git bisect bad
 492 @end example
 493
 494 @end itemize
 495
 496
 497 @node Memory and coverage tests
 498 @section Memory and coverage tests
 499
 500 In addition to the graphical output of the regression tests, it is
 501 possible to test memory usage and to determine how much of the source
 502 code has been exercised by the tests.
 503
 504 @subheading Memory usage
 505
 506 For tracking memory usage as part of this test, you will need
 507 GUILE CVS; especially the following patch:
 508 @smallexample
 509 @uref{http://www.lilypond.org/vc/old/gub.darcs/patches/guile-1.9-gcstats.patch}.
 510 @end smallexample
 511
 512 @subheading Code coverage
 513
 514 For checking the coverage of the test suite, do the following
 515
 516 @example
 517 ./scripts/auxiliar/build-coverage.sh
 518 @emph{# uncovered files, least covered first}
 519 ./scripts/auxiliar/coverage.py  --summary out-cov/*.cc
 520 @emph{# consecutive uncovered lines, longest first}
 521 ./scripts/auxiliar/coverage.py  --uncovered out-cov/*.cc
 522 @end example
 523
 524
 525 @node MusicXML tests
 526 @section MusicXML tests
 527
 528
 529 LilyPond comes with a complete set of regtests for the
 530 @uref{http://www.musicxml.org/,MusicXML} language.  Originally
 531 developed to test @samp{musicxml2ly}, these regression tests
 532 can be used to test any MusicXML implementation.
 533
 534 The MusicXML regression tests are found at
 535 @file{input/regression/musicxml/}.
 536
 537 The output resulting from running these tests
 538 through @samp{musicxml2ly} followed by @samp{lilypond} is
 539 available in the LilyPond documentation:
 540
 541 @example
 542 @uref{http://lilypond.org/doc/latest/input/regression/musicxml/collated-files}
 543 @end example
 544
 545
 546 @node Grand Regression Test Checking
 547 @section Grand Regression Test Checking
 548
 549 @subheading What is this all about?
 550
 551 Regression tests (usually abbreviated "regtests") is a collection
 552 of @file{.ly} files used to check whether LilyPond is working correctly.
 553 Example: before version 2.15.12 breve noteheads had incorrect width,
 554 which resulted in collisions with other objects.  After the issue was fixed,
 555 a small @file{.ly} file demonstrating the problem was added to the regression
 556 tests as a proof that the fix works.  If someone will accidentally break
 557 breve width again, we will notice this in the output of that regression test.
 558
 559 @subheading How can I help?
 560
 561 We ask you to help us by checking one or two regtests from time to time.
 562 You don't need programming skills to do this, not even LilyPond skills -
 563 just basic music notation knowledge; checking one regtest takes less than
 564 a minute.  Simply go here:
 565
 566 @example
 567 @uref{http://www.philholmes.net/lilypond/regtests/}
 568 @end example
 569
 570 @subheading Some tips on checking regtests
 571
 572 @subsubheading Description text
 573
 574 The description should be clear even for a music beginner.
 575 If there are any special terms used in the description,
 576 they all should be explained in our @rglosnamed{Top, Music Glossary}
 577 or @rinternalsnamed{Top, Internals Reference}.
 578 Vague descriptions (like "behaves well", "looks reasonable") shouldn't be used.
 579
 580 @ignore
 581 this may be useful for advanced regtest checking
 582 @subsubheading Is regtest straightforward and systematic?
 583
 584 Unfortunately some regtests are written poorly.  A good regtest should be
 585 straightforward: it should be obvious what it checks and how.  Also, it
 586 usually shouldn't check everything at once.  For example it's a bad idea to test
 587 accidental placement by constucting one huge chord with many suspended notes
 588 and loads of accidentals.  It's better to divide such problem into a series
 589 of clearly separated cases.
 590 @end ignore