Include hunspell .bdic files for qtwebengine and maybe others

[deb_pkgs/scowl.git] / README
diff --git a/README b/README

index 8218f1ddb294de0a8e328146783e2d563f651366..9e559b5e1ea3c0db5225e8b17ade2804fa70d2f1 100644 (file)
--- a/README
+++ b/README
@@ -1,18 +1,22 @@
  Spell Checking Oriented Word Lists (SCOWL)
-Revision 6
-August 10, 2004
-by Kevin Atkinson
+Version 2020.12.07
+Mon Dec 7 20:14:35 2020 -0500 [5ef55f9]
+by Kevin Atkinson (kevina@gnu.org)
  
  The SCOWL is a collection of word lists split up in various sizes, and
  other categories, intended to be suitable for use in spell checkers.
  However, I am sure it will have numerous other uses as well.
  
-The latest version can be found at http://wordlist.sourceforge.net/
+The latest version can be found at http://wordlist.aspell.net/.
  
  The directory final/ contains the actual word lists broken up into
  various sizes and categories.  The r/ directory contains Readmes from
  the various sources used to create this package.
  
+The misc/ contains a small list of taboo words, see the README file
+for more info.  The speller/ directory contains scripts for creating
+spelling dictionaries for Aspell and Hunspell.
+
  The other directories contain the necessary information to recreate the
  word lists from the raw data.  Unless you are interested in improving the
  words lists you should not need to worry about what's here.  See the
@@ -21,11 +25,14 @@ there.
  
  Except for the special word lists the files follow the following
  naming convention:
-  <spelling category>-<classification>.<size>
+  <spelling category>-<sub-category>.<size>
  Where the spelling category is one of
-  english, american, british, british_z, canadian, 
-  variant_0, varaint_1, variant_2
-Classification is one of
+  english, american, british, british_z, canadian, australian
+  variant_1, variant_2, variant_3,
+  british_variant_1, british_variant_2, 
+  canadian_variant_1, canadian_variant_2,
+  australian_variant_1, australian_variant_2
+Sub-category is one of
    abbreviations, contractions, proper-names, upper, words
  And size is one of
    10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large), 
@@ -35,118 +42,305 @@ The special word lists follow are in the following format:
  Where description is one of:
    roman-numerals, hacker
  
-When combining the words lists the "english" spelling category should
-be used as well as one of "american", "british", "british_z" (british
-with ize spelling), or "canadian".  Great care has been taken so that
-that only one spelling for any particular word is included in the main
-list.  When two variants were considered equal I randomly picked one
-for inclusion in the main word list.  Unfortunately this means that my
-choice in how to spell a word may not match your choice.  If this is
-the case you can try including the "variant_0" spelling category which
+The perl script "mk-list" can be used to create a word list of the
+desired size, its usage is:
+  ./mk-list [-f] [-v#] <spelling categories> <size>
+where <spelling categories> is one of the above spelling categories
+(the english and special categories are automatically included as well
+as all sub-categories) and <size> is the desired size.  The
+"-v" option can be used to also include the appropriate
+variants file up to level '#'.  The normal output will be a sorted
+word list.  If you rather see what files will be included, use the
+"-f" option.
+
+When manually combining the words lists the "english" spelling
+category should be used as well as one of "american", "british",
+"british_z" (british with ize spelling), "canadian" or "australian".
+Great care has been taken so that only one spelling for any particular
+word is included in the main list (with some minor exceptions).  When
+two variants were considered equal I randomly picked one for inclusion
+in the main word list.  Unfortunately this means that my choice in how
+to spell a word may not match your choice.  If this is the case you
+can try including one of the "variant_1" spelling categories which
  includes most variants which are considered almost equal.  The
-"variant_1" spelling category include variants which are also
-generally considered acceptable, and "variant_2" contains variants
-which are seldom used.
+"variant_1" spelling category corresponds mostly to American variants,
+while the "british_variant_1", "canadian_variant_1" and
+"australian_variant_1" are for British, Canadian and Australian
+variants, respectively.  The "variant_2" spelling categories include
+variants which are also generally considered acceptable, and
+"variant_3" contains variants which are seldom used and may not even
+be considered correct.  There is no "british_variant_3",
+"canadian_variant_3" or "australian_variant_3" spelling category since
+the distinction would be almost meaningless.
  
  The "abbreviation" category includes abbreviations and acronyms which
  are not also normal words. The "contractions" category should be self
  explanatory. The "upper" category includes upper case words and proper
  names which are common enough to appear in a typical dictionary. The
-"proper-names" category included all the additional uppercase words.
-Final the "words" category contains all the normal English words.
+"proper-names" category includes all the additional uppercase words.
+Finally the "words" category contains all the normal English words.
  
  To give you an idea of what the words in the various sizes look like
  here is a sample of 25 random words found only in that size:
  
-10: began both buffer cause collection content documenting easiest
-    equally examines expecting first firstly hence inclining
-    irrelevant justified little logs necessarily ought sadly six
-    thing visible
-
-20: chunks commodity contempt contexts cruelty crush dictatorship
-    disgusted dose elementary evolved frog god hordes notion overdraft
-    overlong overlook phoning poster recordings sand skull substituted
-    throughput
-
-35: aliasing blackouts blowout bluntness corroborated derrick
-    dredging elopements entrancing excising fellowship flagpole
-    germination glimpse gondola guidebook madams minimalism minnows
-    partisans petitions shelling swarmed throng welding
-
-40: altercation blender castigation chump coffeehouse determiners
-    doggoning exhibitor finders flophouse gazebo lumbering masochism
-    mopeds poetically pubic refinance reggae scragglier softhearted
-    stubbornness teargassed township underclassman whoosh
-
-50: accumulative adulterant allegorically amorousness astrophysics
-    camphor coif dickey elusiveness enviousness fakers fetishistic
-    flippantly headsets liefs midyears myna pacification persiflage
-    phosphoric pinhole sappy seres unrealistically unworldly
-
-55: becquerel brickie centralist cine conveyancing courgette
-    disarmingly garçon gobstopper infilling insipidity
-    internationalist kabuki lyrebirds obscurantism rejigged
-    revisionist satsuma slapper sozzled sublieutenants teletext vino
-    wellness wracking
-
-60: absorber acceptableness adventurousness antifascists arrhythmia
-    audiology cartage cruses fontanel forelimbs granter hairlike
-    installers jugglery lappets libbers mandrels micrometeorite
-    mineshaft reconsecrates saccharides smellable spavined sud timbrel
-
-70: atomisms benedict carven coxa cyanite detraining diazonium
-    dogberry dogmatics entresol fatherlessnesses firestone imprecator
-    laterality legitimisms maxwell microfloppies nonteaching pelerine
-    pentane pestiferousness piscator profascist tusche twirp
-
-80: cotransfers embrangled forkednesses giftwrapped globosity hatpegs
-    hepsters hermitess interspecific inurbanities lamiae
-    literaehumaniores literatures masulas misbegun plook prerupt
-    quaalude rosanilin sabbatism scowder subreptive thumbstalls
-    understrata yakows
-
-95: anatropal anientise bakshi brouzes corsie daimiote dhaw dislikened
-    ectoretina fortuitisms guardeen hyperlithuria nonanachronistic
-    overacceleration pamphletic parma phytolith starvedly
-    trophoplasmic ulorrhagia undared undertide unplunderously
-    unworkmanly vasoepididymostomy
-
-And here is a rough count on the number of words in the "english"
-spelling category for each size:
-
-  Size  Words   Proper Names  Running Total 
-
-   10    5,000                    5,000                
-   20    8,700                   14,000
-   35   34,500         200       48,000
-   40    6,000         500       55,000
-   50   23,200      17,200       95,000
-   55    7,500                  103,000
-   60   16,000      12,800      132,000
-   70   45,100      34,300      211,000
-   80  137,000      30,400      379,000
-   95  198,000      51,800      628,000
-
-(The "Words" column does not include the proper name count.)
+10: attempt base borrows clever cold concerned contribution decide deletes
+    easiest inclined mine natural obviously opportunity organized pain
+    potential signed significance standing survey this training trick 
+
+20: brave comma confining conviction delicious embedding enlarging equations
+    era farmer flip frustrates keystrokes officers peoples personalities
+    principally restarts revert risks singular sneaky stealing sweep
+    traditionally 
+
+35: bantered barrens bronzing chisel debtors doorstep earache elaborating
+    expressly glistened humping joyfully leashes lofting logician obsessions
+    paralytics pillowed portrayals pruned rarities reconfigured scrupulous
+    tempos uncommoner 
+
+40: astrologer bestsellers busboys childproofed clapboarded crispiest
+    embroiling enfranchises enthused exorcists firebrand gringo irresponsibly
+    matchstick missteps oinks pocketfuls reinventing scorecard streetlights
+    temped turncoat voyeur warmongering wimps 
+
+50: apologias assay biochemists brashness brattier councilman detainees
+    discontentedly ethnology evincing excoriation halberd housemothers
+    humdinger moraines permutes pilaf purebred putsch quadrature
+    secularization skyjacking snowsuit transmuted zeppelins 
+
+55: articulacy bookbinders chapati faffing gunge hotpots hurtfulness innit
+    kaleidoscopically leching megastars ockers paperclips pedestrianization
+    peeler plainsong rand righto stationmasters sundecks tossers triathletes
+    turbocharges twitchiest yobs 
+
+60: allurements bespangle centripetally dashers eclogue estoppel ethologist
+    gleaners gratingly imputable jobholder mendicancy minnesingers muscats
+    nontransparent nosher obtrusion parasympathetics patroons
+    phosphorescently reforging reintegrate stringiness transecting vixenishly 
+70: animalisms bestializing blague chlorpromazine decury dolmans ecclesiology
+    hymnody incommutable listers lucubrator methodic mizenmasts monochord
+    natality ninepence pyrogenic rath sabayons serenata shitwork superlunary
+    talapoin unresigned whickered 
+
+80: batatas diapente discipled doofuses faintheartednesses geophagous gooky
+    grandeurs hypesthesic kagouls mandataries minimalized operettists
+    pseudoephedrine readvertizing rumblegumption sabermetrics scritches
+    sextonship simuliums superspectaculars thickoes tripersonalism unmoneyed
+    whinstones 
+
+95: adalat afdecho basirhinal crossopodia decalomania earthmaker gaudeamuses
+    guanayes haemodoraceous hardsalt heterostrophies kadikane mastoidale
+    misconceited osteoarthrotomy perpetuant photolyte querulation
+    splenonephric storymaker thrangity turgider unquailingly unthriftlike
+    wirrah 
+
+
+And here is a count on the number of words in each spelling category
+(american + english spelling category):
+
+  Size   Words       Names    Running Total  %
+   10    4,425          13        4,438     0.7
+   20    8,126           0       12,564     1.9
+   35   37,260         220       50,044     7.6
+   40    6,858         489       57,391     8.7
+   50   25,289      18,683      101,363    15.4
+   55    6,487           0      107,850    16.4
+   60   14,551         850      123,251    18.7
+   70   35,294       7,897      166,442    25.3
+   80  144,158      33,368      343,968    52.3
+   95  227,633      86,630      658,231   100.0
+
+
+(The "Words" column does not include the name count.)
  
  Size 35 is the recommended small size, 50 the medium and 70 the large.
  Sizes 70 and below contain words found in most dictionaries while the
  80 size contains all the strange and unusual words people like to use
-in word games such as Scrabble (TM).  While a lot of the the words in
-the 80 size are not used very often, they are all generally considered
+in word games such as Scrabble (TM).  While a lot of the words in the
+80 size are not used very often, they are all generally considered
  valid words in the English language.  The 95 contains just about every
  English word in existence and then some.  Many of the words at the 95
-level will probally not be considered valid english words by most
-people.  I don't recommend anyone use levels above 70 for spell
-checking as they contain rarely used words which can hide misspellings
-of similar more commonly used words.  For example the word "ort" can
-hide a common typo of "or".  No one should need to use a size larger
-than 80, the 95 size is labeled insane for a reason.
+level will probably not be considered valid English words by most
+people.
+
+For spell checking I recommend using size 60.  This size is the
+largest size that I am fairly confident does not contain any
+misspellings or invalid words.  In addition an effort is made to
+exclude valid yet problematic words (such as "calender") from the 60
+size that are likely to be a misspelling of a more common word.  The
+70 size is reasonable for those wanting a larger list and don't mind a
+few errors.  The 80 or larger sizes are not reasonable for spell
+checking.
  
-Accents are present on certain words such as café in iso8859-1 format.
+Accents are present on certain words such as café in iso8859-1 format.
  
  CHANGES:
  
+From Version 2019.10.06 to 2020.12.07
+
+  Various new words.
+
+  Variant cleanups.
+
+  Bump irregardless, froward (+ derivatives) and perpend to level 70.
+
+From Version 2018.04.16 to 2019.10.06
+
+  Various new words.
+
+  Remove compare's and fail's.
+
+From Version 2017.08.24 to 2018.04.16
+
+  Various new words.
+
+  Fix build problems on macOS.
+
+From Version 2017.01.22 to 2017.08.24
+
+  Various new words.
+
+From Version 2016.11.20 to 2017.01.22
+
+  Various new words.
+
+From Version 2016.06.26 to 2016.11.20
+
+  New Australian spelling category thanks to the work of Benjamin
+  Titze (btitze@protonmail.ch)
+
+  Various new words.
+
+From Version 2016.01.19 to 2016.06.26
+
+  Various new words.
+
+  Updated to Version 6.0.2 of 12dicts
+
+  Other minor changes.
+
+From Version 2015.08.24 to 2016.01.19
+
+  Various new words.
+
+  Clarified README to indicate why the 60 size is the preferred size
+  for spell checking.
+
+  Remove some very uncommon possessive forms.
+
+  Change "SET UTF8" to "SET UTF-8" in hunspell affix file.
+
+From Version 2015.05.18 to 2015.08.24 (Aug 24, 2015)
+
+  Various new words.
+
+From Version 2015.04.24 to 2015.05.18 (May 18, 2015)
+
+  Added some new words found to have a high frequency in the COCA
+  corpus.  (http://corpus.byu.edu/coca/).
+
+  Fix en spelling suggestions for 'alot' and 'exersize' in hunspell
+  dictionary (upstreamed from the changes made in Firefox).
+
+From Version 2015.02.15 to 2015.04.24 (April 24, 2015)
+
+  Added some new words.
+
+  Convert hunspell dictionary to UTF-8 in order to handle smart
+  quotes correctly.
+
+From Version 2015.01.28 to 2015.02.15 (February 15, 2015)
+
+  Added a large number of neologisms (newly invented words)
+  such as "selfie" and "smartwatch" thanks to Alan Beale.
+
+  Various other new words.
+
+  Clean up the special-hacker category by removing some words that
+  didn't exist in the Google Book's Corpus (1980 - 2008) and
+  originated from the "Unofficial Jargon File Word Lists".
+
+From Version 2014.11.17 to 2015.01.28 (January 28, 2015)
+
+  Various new words, many from analyzing the Google Book's Corpus
+  (1980 - 2008).  See http://app.aspell.net/lookup-freq.
+
+  Moved some uncommon words that can easily hide a misspelling of a
+  more common word to level 70.  (calender, adrenalin and Joesph)
+
+  Removed several -er and -est forms from adjectives that were so
+  uncommon that they were not found anywhere is the Google Book's
+  Corpus (1980 - 2008).
+
+From Version 2014.08.11.1 to 2014.11.17 (November 17, 2014)
+
+  Various new words.
+
+  Fix typo in Hunspell readme.
+
+From Version 2014.08.11 to 2014.08.11.1 (August 13, 2014)
+
+  Forgot to mention this important change from 7.1 to 2014.08.11:
+
+    Shifted the variant levels up by one: variant_0 is now variant_1,
+    variant_1 is now variant_2, and variant_2 is now variant_3.
+
+  Other minor fixes in this README.
+
+  No changes to the contents of the lists.
+
+From Revision 7.1 to Version 2014.08.11 (August 11, 2014)
+
+  Added some missing possessive forms.
+
+  Added some new words and proper names.
+
+  Clean up the categories (words, upper, proper-names etc) so that they
+  are more accurate.
+
+  Convert documentation to UTF-8.  For now, the wordlist are still in
+  ISO-8859-1 to prevent compatibility problems.
+
+  Add schema and scripts for creating a SQLite database from SCOWL.
+  Add some utility and library functions using them.  This database is
+  used by the new web app's (http://app.aspell.net/lookup & create).
+
+  Enhance speller/make-hunspell-dict.  The biggest improvement is that
+  it that it now generates several more dictionaries in addition to
+  the official ones.  These additional dictionaries are ones for
+  British English and larger dictionaries that include up to SCOWL
+  size 70.
+
+From Revision 7 to 7.1 (January 6, 2011)
+
+  Updated to revision 5.1 of Varcon which corrected several errors.
+
+  Fixed various problems with the variant processing which corrected a
+  few more errors.
+
+  Added several now common proper names and some other words now
+  in common use.
+
+  Include misc/ and speller/ directory which were in SVN but left
+  out of the release tarball.
+
+  Other minor fixes, including some fixes to the taboo word lists.
+
+From Revision 6 to 7 (December 27, 2010)
+
+  Updated to revision 5.0 of Varcon which corrected many errors,
+  especially in the British and Canadian spelling categories.  Also
+  added new spelling categories for the British and Canadian spelling
+  variants and separated them out from the main variant_* categories.
+  
+  Moved Moby names lists (3897male.nam 4946fema.len 21986na.mes) to 95
+  level since they contain too many errors and rare names.
+
+  Moved frequently class 0 from Brian Kelk's Wordlist from 
+  level 60 to 70, and also filter it with level 80 due to, too many
+  misspellings.
+
+  Many other minor fixes.
+
  From Revision 5 to 6 (August 10, 2004)
  
    Updated to version 4.0 of the 12dicts package.
@@ -163,7 +357,7 @@ From Revision 5 to 6 (August 10, 2004)
  
    Updated to version 4.1 of VarCon.
  
-  Added the "british_z" spelling category which it British using the
+  Added the "british_z" spelling category which is British using the
    "ize" spelling.
  
  From Revision 4a to 5 (January 3, 2002)
@@ -174,7 +368,7 @@ From Revision 4a to 5 (January 3, 2002)
    Fixed a bug which caused variants of words to incorrectly appear in
    the non-variant lists.
  
-  Moved rarly used inflections of a word into higher number lists.
+  Moved rarely used inflections of a word into higher number lists.
  
    Added other inflections of a words based on the following criteria
      If the word is in the base form: only include that word.
@@ -186,7 +380,7 @@ From Revision 4a to 5 (January 3, 2002)
    Updated to the latest version of many of the source dictionaries.
  
    Removed the DEC Word List due to the questionable licence and
-  because removing it will not seriously decrese the quality of SCOWL
+  because removing it will not seriously decrease the quality of SCOWL
    (there are a few less proper names).  
  
  From Revision 4 to 4a (April 4, 2001)
@@ -201,13 +395,13 @@ From Revision 3 to 4 (January 28, 2001)
    Added words in the Ispell word list at the 65 level.
  
    Other changes due to using more recent versions of various sources
-  included a more accurete version of AGID thanks to the word of
+  included a more accurate version of AGID thanks to the work of
    Alan Beale
  
  From Revision 2 to 3 (August 18, 2000)
  
    Renamed special-unix-terms to special-hacker and added a large
-  number of communly used words within the hacker (not cracker)
+  number of commonly used words within the hacker (not cracker)
    community.
  
    Added a couple more signature words including "newbie".
@@ -232,10 +426,10 @@ From Revision 1 to 2 (August 5, 2000)
  
  COPYRIGHT, SOURCES, and CREDITS:
  
-The collective work is Copyright 2000-2004 by Kevin Atkinson as well
+The collective work is Copyright 2000-2018 by Kevin Atkinson as well
  as any of the copyrights mentioned below:
  
-  Copyright 2000-2004 by Kevin Atkinson
+  Copyright 2000-2018 by Kevin Atkinson
  
    Permission to use, copy, modify, distribute and sell these word
    lists, the associated scripts, the output created from the scripts,
@@ -346,7 +540,7 @@ The 40 level includes words from Alan's 3esl list found in version 4.0
  of his 12dicts package.  Like his other stuff the 3esl list is also in the
  public domain.
  
-The 50 level includes Brian's frequency class 1, words words appearing
+The 50 level includes Brian's frequency class 1, words appearing
  in at least 5 of 12 of the dictionaries as indicated in the 12Dicts
  package, and uppercase words in at least 4 of the previous 12
  dictionaries.  A decent number of proper names is also included: The
@@ -368,20 +562,18 @@ The 55 level includes words from Alan's 2of4brif list found in version
  4.0 of his 12dicts package.  Like his other stuff the 2of4brif is also
  in the public domain.
  
-The 60 level includes Brian's frequency class 0 and all words
-appearing in at least 2 of the 12 dictionaries as indicated by the
-12Dicts package.  A large number of names are also included: The 4,946
-female names and the 3,897 male names from the MWords package.
+The 60 level includes all words appearing in at least 2 of the 12
+dictionaries as indicated by the 12Dicts package.
  
-The 70 level includes the 74,550 common dictionary words and the
-21,986 names list from the MWords package The common dictionary words,
+The 70 level includes Brian's frequency class 0 and the 74,550 common
+dictionary words from the MWords package.  The common dictionary words,
  like those from the 12Dicts package, have had all likely inflections
  added.  The 70 level also included the 5desk list from version 4.0 of
-the 12Dics package which is the public domain
+the 12Dics package which is in the public domain.
  
  The 80 level includes the ENABLE word list, all the lists in the
  ENABLE supplement package (except for ABLE), the "UK Advanced Cryptics
-Dictionary" (UKACD), the list of signature words in from YAWL package,
+Dictionary" (UKACD), the list of signature words from the YAWL package,
  and the 10,196 places list from the MWords package.
  
  The ENABLE package, mainted by M\Cooper <thegrendel@theriver.com>,
@@ -417,18 +609,38 @@ following copyright:
    There are no other restrictions: I would like to see the list
    distributed as widely as possible.
  
-The 95 level includes the 354,984 single words and 256,772 compound
-words from the MWords package, ABLE.LST from the ENABLE Supplement,
-and some additional words found in my part-of-speech database that
-were not found anywhere else.
+The 95 level includes the 354,984 single words, 256,772 compound
+words, 4,946 female names and the 3,897 male names, and 21,986 names
+from the MWords package, ABLE.LST from the ENABLE Supplement, and some
+additional words found in my part-of-speech database that were not
+found anywhere else.
  
  Accent information was taken from UKACD.
  
-My VARCON package was used to create the American, British, and
-Canadian word list. 
+The VarCon package was used to create the American, British, Canadian,
+and Australian word list.  It is under the following copyright:
+
+  Copyright 2000-2016 by Kevin Atkinson
+
+  Permission to use, copy, modify, distribute and sell this array, the
+  associated software, and its documentation for any purpose is hereby
+  granted without fee, provided that the above copyright notice appears
+  in all copies and that both that copyright notice and this permission
+  notice appear in supporting documentation. Kevin Atkinson makes no
+  representations about the suitability of this array for any
+  purpose. It is provided "as is" without express or implied warranty.
+
+  Copyright 2016 by Benjamin Titze
+
+  Permission to use, copy, modify, distribute and sell this array, the
+  associated software, and its documentation for any purpose is hereby
+  granted without fee, provided that the above copyright notice appears
+  in all copies and that both that copyright notice and this permission
+  notice appear in supporting documentation. Benjamin Titze makes no
+  representations about the suitability of this array for any
+  purpose. It is provided "as is" without express or implied warranty.
  
-Since the original word lists used used in the VARCON package came
-from the Ispell distribution they are under the Ispell copyright:
+  Since the original words lists come from the Ispell distribution:
  
    Copyright 1993, Geoff Kuenning, Granada Hills, CA
    All rights reserved.
@@ -451,18 +663,18 @@ from the Ispell distribution they are under the Ispell copyright:
       products derived from this software without specific prior
       written permission.
  
-  THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS
-  IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
-  FOR A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL GEOFF
-  KUENNING OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
-  INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
-  BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-  LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
-  CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
-  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
-  ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-  POSSIBILITY OF SUCH DAMAGE.
+  THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
+  ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+  IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+  ARE DISCLAIMED.  IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
+  FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+  DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+  OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+  HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+  LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+  OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+  SUCH DAMAGE.
+
  
  The variant word lists were created from a list of variants found in
  the 12dicts supplement package as well as a list of variants I created
@@ -473,40 +685,49 @@ appropriate directory under the r/ directory.
  
  FUTURE PLANS:
  
-There is a very nice frequency analyse of the BNC corpus done by
-Adam Kilgarriff.  Unlike Brain's word lists the BNC lists include part
-of speech information.  I plan on somehow using these lists as Adam
-Kilgarriff has given me the OK to use it in SCOWL.  These lists will
-greatly reduce the problem of inflected forms of a word appearing at
-different levels due to the part-of-speech information.
+The process of "sort"s, "comm"s, and Perl scripts to combine the many
+word lists and separate out the variant information is inexact and
+error prone.  The whole things needs to be rewritten to deal with
+words in terms of lemmas.  When the exact lemma is not known a best
+guess should be made.  I'm not sure what form this should be in.  I
+originally thought this should be some sort of database, but maybe I
+should just slurp all that data into memory and process it in one
+giant perl script.  With the amount of memory available these days (at
+least 2 GB, often 4 GB or more) this should not really be a problem.
+
+In addition, there is a very nice frequency analyze of the BNC corpus
+done by Adam Kilgarriff.  Unlike Brian's word lists the BNC lists
+include part of speech information.  I plan on somehow using these
+lists as Adam Kilgarriff has given me the OK to use it in SCOWL.
+These lists will greatly reduce the problem of inflected forms of a
+word appearing at different levels due to the part-of-speech
+information.
  
-I also plan on perhaps putting the data in a database and use SQL
-queries to create the wordlists instead of tons of "sort"s, "comm"s,
-and Perl scripts.
+There is frequency information for some other corpus such as COCA
+(Corpus of Contemporary American English) and ANS (American National
+Corpus) which I might also be able to use.  The former will require
+permission, and the latter is of questionable quality.
  
  RECREATING THE WORD LISTS:
  
  In order to recreate the word lists you need a modern version of Perl,
  bash, the traditional set of shell utilities, a system that supports
-symbolic links, and quite possibly GNU Make.  Once you have downloaded
-all the necessary raw data in the r/ directory you should be able to
-type "rm final/* && make all" and the word lists in the final/
-directory should be recreated.  If you have any problems fell free to
-contact me; however, unless you are interested in improving the
-scripts used, I will likely ignore you as there should be little need
-for anyone not interested in improving the word list to do so.
+symbolic links, and quite possibly GNU Make.  The easiest way to
+recreate the word lists is to checkout the corresponding Git version
+(see the version string at the start of the file) and simply type
+"make" (see http://wordlist.aspell.net).  You can try to download all
+the pieces manually, but this method is not no longer tested nor
+supported.
  
  The src/ directory contains the numerous scripts used in the creation
  of the final product. 
  
-The r/ directory contains the raw data used to
-create the final product.  In order for the scripts to work various
-word lists and databases need to be created and put into this
-directory.  See the README file in the r/ directory for more
-information.
+The r/ directory contains the raw data used to create the final
+product.  If you checkout from Git this directory should be populated
+automatically for you.  If you insist on doing it the hard way see the
+README file in the r/ directory for more information.
  
  The l/ directory contains symbolic links used by the actual scripts.
  
  Finally, the working/ directory is where all the intermittent files go
  that are not specific to one source.
-