Imported Upstream version 2015.08.24

[deb_pkgs/scowl.git] / README.in
diff --git a/README.in b/README.in

new file mode 100644 (file)

index 0000000..a631930
--- /dev/null
+++ b/README.in
@@ -0,0 +1,371 @@
+Spell Checking Oriented Word Lists (SCOWL)
+@`if [ "$SCOWL_VERSION" ]; then echo -n "Version $SCOWL_VERSION"; fi`
+@`git log --pretty=format:'%cd [%h]' -n 1 --`
+by Kevin Atkinson (kevina@gnu.org)
+
+The SCOWL is a collection of word lists split up in various sizes, and
+other categories, intended to be suitable for use in spell checkers.
+However, I am sure it will have numerous other uses as well.
+
+The latest version can be found at http://wordlist.aspell.net/.
+
+The directory final/ contains the actual word lists broken up into
+various sizes and categories.  The r/ directory contains Readmes from
+the various sources used to create this package.
+
+The misc/ contains a small list of taboo words, see the README file
+for more info.  The speller/ directory contains scripts for creating
+spelling dictionaries for Aspell and Hunspell.
+
+The other directories contain the necessary information to recreate the
+word lists from the raw data.  Unless you are interested in improving the
+words lists you should not need to worry about what's here.  See the
+section on recreating the words lists for more information on what's
+there.
+
+Except for the special word lists the files follow the following
+naming convention:
+  <spelling category>-<sub-category>.<size>
+Where the spelling category is one of
+  english, american, british, british_z, canadian, 
+  variant_1, variant_2, variant_3,
+  british_variant_1, british_variant_2,
+  canadian_variant_1, canadian_variant_2,
+Sub-category is one of
+  abbreviations, contractions, proper-names, upper, words
+And size is one of
+  10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large), 
+  80 (huge), 95 (insane)
+The special word lists follow are in the following format:
+  special-<description>.<size>
+Where description is one of:
+  roman-numerals, hacker
+
+The perl script "mk-list" can be used to create a word list of the
+desired size, its usage is:
+  ./mk-list [-f] [-v#] <spelling categories> <size>
+where <spelling categories> is one of the above spelling categories
+(the english and special categories are automatically included as well
+as all sub-categories) and <size> is the desired size.  The
+"-v" option can be used to also include the appropriate
+variants file up to level '#'.  The normal output will be a sorted
+word list.  If you rather see what files will be included, use the
+"-f" option.
+
+When manually combining the words lists the "english" spelling
+category should be used as well as one of "american", "british",
+"british_z" (british with ize spelling), or "canadian".  Great care
+has been taken so that only one spelling for any particular word
+is included in the main list (with some minor exceptions).  When two
+variants were considered equal I randomly picked one for inclusion in
+the main word list.  Unfortunately this means that my choice in how to
+spell a word may not match your choice.  If this is the case you can
+try including one of the "variant_1" spelling categories which
+includes most variants which are considered almost equal.  The
+"variant_1" spelling category corresponds mostly to American variants,
+while the "british_variant_1" and "canadian_variant_1" are for British
+and Canadian variants, respectively.  The "variant_2" spelling
+categories include variants which are also generally considered
+acceptable, and "variant_3" contains variants which are seldom used
+and may not even be considered correct.  There is no
+"british_variant_3" or "canadian_variant_3" spelling category since
+the distinction would be almost meaningless.
+
+The "abbreviation" category includes abbreviations and acronyms which
+are not also normal words. The "contractions" category should be self
+explanatory. The "upper" category includes upper case words and proper
+names which are common enough to appear in a typical dictionary. The
+"proper-names" category includes all the additional uppercase words.
+Finally the "words" category contains all the normal English words.
+
+To give you an idea of what the words in the various sizes look like
+here is a sample of 25 random words found only in that size:
+
+@`src/rand-samples | iconv -f iso-8859-1 -t utf-8`
+
+And here is a count on the number of words in each spelling category
+(american + english spelling category):
+
+@`src/count`
+
+(The "Words" column does not include the name count.)
+
+Size 35 is the recommended small size, 50 the medium and 70 the large.
+For spell checking I recommend using 60.  Sizes 70 and below contain
+words found in most dictionaries while the 80 size contains all the
+strange and unusual words people like to use in word games such as
+Scrabble (TM).  While a lot of the words in the 80 size are not
+used very often, they are all generally considered valid words in the
+English language.  The 95 contains just about every English word in
+existence and then some.  Many of the words at the 95 level will
+probably not be considered valid English words by most people.  I use
+the 60 size for the English dictionary for Aspell, and I don't
+recommend anyone use levels above 70 for spell checking.  Levels above
+70 contain rarely used words which can hide misspellings of similar
+more commonly used words.  For example the word "ort" can hide a
+common typo of "or".  No one should need to use a size larger than 80,
+the 95 size is labeled insane for a reason.
+
+Accents are present on certain words such as café in iso8859-1 format.
+
+CHANGES:
+
+From Version 2015.05.18 to 2015.08.24 (Aug 24, 2015)
+
+  Various new words.
+
+From Version 2015.04.24 to 2015.05.18 (May 18, 2015)
+
+  Added some new words found to have a high frequency in the COCA
+  corpus.  (http://corpus.byu.edu/coca/).
+
+  Fix en spelling suggestions for 'alot' and 'exersize' in hunspell
+  dictionary (upstreamed from the changes made in Firefox).
+
+From Version 2015.02.15 to 2015.04.24 (April 24, 2015)
+
+  Added some new words.
+
+  Convert hunspell dictionary to UTF-8 in order to handle smart
+  quotes correctly.
+
+From Version 2015.01.28 to 2015.02.15 (February 15, 2015)
+
+  Added a large number of neologisms (newly invented words)
+  such as "selfie" and "smartwatch" thanks to Alan Beale.
+
+  Various other new words.
+
+  Clean up the special-hacker category by removing some words that
+  didn't exist in the Google Book's Corpus (1980 - 2008) and
+  originated from the "Unofficial Jargon File Word Lists".
+
+From Version 2014.11.17 to 2015.01.28 (January 28, 2015)
+
+  Various new words, many from analyzing the Google Book's Corpus
+  (1980 - 2008).  See http://app.aspell.net/lookup-freq.
+
+  Moved some uncommon words that can easily hide a misspelling of a
+  more common word to level 70.  (calender, adrenalin and Joesph)
+
+  Removed several -er and -est forms from adjectives that were so
+  uncommon that they were not found anywhere is the Google Book's
+  Corpus (1980 - 2008).
+
+From Version 2014.08.11.1 to 2014.11.17 (November 17, 2014)
+
+  Various new words.
+
+  Fix typo in Hunspell readme.
+
+From Version 2014.08.11 to 2014.08.11.1 (August 13, 2014)
+
+  Forgot to mention this important change from 7.1 to 2014.08.11:
+
+    Shifted the variant levels up by one: variant_0 is now variant_1,
+    variant_1 is now variant_2, and variant_2 is now variant_3.
+
+  Other minor fixes in this README.
+
+  No changes to the contents of the lists.
+
+From Revision 7.1 to Version 2014.08.11 (August 11, 2014)
+
+  Added some missing possessive forms.
+
+  Added some new words and proper names.
+
+  Clean up the categories (words, upper, proper-names etc) so that they
+  are more accurate.
+
+  Convert documentation to UTF-8.  For now, the wordlist are still in
+  ISO-8859-1 to prevent compatibility problems.
+
+  Add schema and scripts for creating a SQLite database from SCOWL.
+  Add some utility and library functions using them.  This database is
+  used by the new web app's (http://app.aspel.net/lookup & create).
+
+  Enhance speller/make-hunspell-dict.  The biggest improvement is that
+  it that it now generates several more dictionaries in addition to
+  the official ones.  These additional dictionaries are ones for
+  British English and larger dictionaries that include up to SCOWL
+  size 70.
+
+From Revision 7 to 7.1 (January 6, 2011)
+
+  Updated to revision 5.1 of Varcon which corrected several errors.
+
+  Fixed various problems with the variant processing which corrected a
+  few more errors.
+
+  Added several now common proper names and some other words now
+  in common use.
+
+  Include misc/ and speller/ directory which were in SVN but left
+  out of the release tarball.
+
+  Other minor fixes, including some fixes to the taboo word lists.
+
+From Revision 6 to 7 (December 27, 2010)
+
+  Updated to revision 5.0 of Varcon which corrected many errors,
+  especially in the British and Canadian spelling categories.  Also
+  added new spelling categories for the British and Canadian spelling
+  variants and separated them out from the main variant_* categories.
+  
+  Moved Moby names lists (3897male.nam 4946fema.len 21986na.mes) to 95
+  level since they contain too many errors and rare names.
+
+  Moved frequently class 0 from Brian Kelk's Wordlist from 
+  level 60 to 70, and also filter it with level 80 due to, too many
+  misspellings.
+
+  Many other minor fixes.
+
+From Revision 5 to 6 (August 10, 2004)
+
+  Updated to version 4.0 of the 12dicts package.
+
+  Included the 3esl, 2of4brif, and 5desk list from the new 12dicts
+  package.  The 3esl was included in the 40 size, the 2of4brif in the
+  55 size and the 5desk in the 70 size.
+
+  Removed the Ispell word list as it was a source of too many errors.
+  This eliminated the 65 size.
+
+  Removed clause 4 from the Ispell copyright with permission of Geoff
+  Kuenning.
+
+  Updated to version 4.1 of VarCon.
+
+  Added the "british_z" spelling category which is British using the
+  "ize" spelling.
+
+From Revision 4a to 5 (January 3, 2002)
+
+  Added variants that were not really spelling variants (such as
+  forwards) back into the main list.
+
+  Fixed a bug which caused variants of words to incorrectly appear in
+  the non-variant lists.
+
+  Moved rarely used inflections of a word into higher number lists.
+
+From 7.1
+
+  Shifted the variant levels so that level 0 is now 1, level 1 now 2,
+  and level 2 now 3.
+
+  Added other inflections of a words based on the following criteria
+    If the word is in the base form: only include that word.
+    If the word is in a plural form: include the base word and the plural
+    If the word is a verb form (other than plural):  include all verb forms
+    If the word is an ad* form: include all ad* forms
+    If the word is in a possessive form: also include the non-possessive
+
+  Updated to the latest version of many of the source dictionaries.
+
+  Removed the DEC Word List due to the questionable licence and
+  because removing it will not seriously decrease the quality of SCOWL
+  (there are a few less proper names).  
+
+From Revision 4 to 4a (April 4, 2001)
+
+  Reran the scripts on a never version of AGID (3a) which fixes a bug
+  which caused some common words to be improperly marked as variants.
+
+From Revision 3 to 4 (January 28, 2001)
+
+  Split the variant "spelling category" up into 3 different levels.
+  
+  Added words in the Ispell word list at the 65 level.
+
+  Other changes due to using more recent versions of various sources
+  included a more accurate version of AGID thanks to the work of
+  Alan Beale
+
+From Revision 2 to 3 (August 18, 2000)
+
+  Renamed special-unix-terms to special-hacker and added a large
+  number of commonly used words within the hacker (not cracker)
+  community.
+
+  Added a couple more signature words including "newbie".
+
+  Minor changes due to changes in the inflection database.
+
+From Revision 1 to 2 (August 5, 2000)
+
+  Moved the male and female name lists from the mwords package and the
+  DEC name lists form the 50 level to the 60 level and moved Alan's
+  name list from the 60 level to the 50 level.  Also added the top
+  1000 male, female, and last names from the 1990 Census report to the
+  50 level.  This reduced the number of names in the 50 level from
+  17,000 to 7,000.
+
+  Added a large number of Uppercase words to the 50 level.
+
+  Properly accented the possessive form of some words.
+
+  Minor other changes due to changes in my raw data files which have
+  not been released yet.  Email if you are interested in these files.
+
+COPYRIGHT, SOURCES, and CREDITS:
+
+@`cat Copyright`
+
+The variant word lists were created from a list of variants found in
+the 12dicts supplement package as well as a list of variants I created
+myself.
+
+The Readmes for the various packages used can be found in the
+appropriate directory under the r/ directory.
+
+FUTURE PLANS:
+
+The process of "sort"s, "comm"s, and Perl scripts to combine the many
+word lists and separate out the variant information is inexact and
+error prone.  The whole things needs to be rewritten to deal with
+words in terms of lemmas.  When the exact lemma is not known a best
+guess should be made.  I'm not sure what form this should be in.  I
+originally thought this should be some sort of database, but maybe I
+should just slurp all that data into memory and process it in one
+giant perl script.  With the amount of memory available these days (at
+least 2 GB, often 4 GB or more) this should not really be a problem.
+
+In addition, there is a very nice frequency analyze of the BNC corpus
+done by Adam Kilgarriff.  Unlike Brian's word lists the BNC lists
+include part of speech information.  I plan on somehow using these
+lists as Adam Kilgarriff has given me the OK to use it in SCOWL.
+These lists will greatly reduce the problem of inflected forms of a
+word appearing at different levels due to the part-of-speech
+information.
+
+There is frequency information for some other corpus such as COCA
+(Corpus of Contemporary American English) and ANS (American National
+Corpus) which I might also be able to use.  The former will require
+permission, and the latter is of questionable quality.
+
+RECREATING THE WORD LISTS:
+
+In order to recreate the word lists you need a modern version of Perl,
+bash, the traditional set of shell utilities, a system that supports
+symbolic links, and quite possibly GNU Make.  The easiest way to
+recreate the word lists is to checkout the corresponding Git version
+(see the version string at the start of the file) and simply type
+"make" (see http://wordlist.aspell.net).  You can try to download all
+the pieces manually, but this method is not no longer tested nor
+supported.
+
+The src/ directory contains the numerous scripts used in the creation
+of the final product. 
+
+The r/ directory contains the raw data used to create the final
+product.  If you checkout from Git this directory should be populated
+automatically for you.  If you insist on doing it the hard way see the
+README file in the r/ directory for more information.
+
+The l/ directory contains symbolic links used by the actual scripts.
+
+Finally, the working/ directory is where all the intermittent files go
+that are not specific to one source.