X-Git-Url: https://git.donarmstrong.com/?a=blobdiff_plain;f=README.in;fp=README.in;h=a6319301c5e79534a060420b8451e303ef99c2ed;hb=b13ea8a082364672c6de2b010e558211ff52ec9a;hp=0000000000000000000000000000000000000000;hpb=01534a94130c1f5a3a230cf4fe18365a235ba271;p=deb_pkgs%2Fscowl.git diff --git a/README.in b/README.in new file mode 100644 index 0000000..a631930 --- /dev/null +++ b/README.in @@ -0,0 +1,371 @@ +Spell Checking Oriented Word Lists (SCOWL) +@`if [ "$SCOWL_VERSION" ]; then echo -n "Version $SCOWL_VERSION"; fi` +@`git log --pretty=format:'%cd [%h]' -n 1 --` +by Kevin Atkinson (kevina@gnu.org) + +The SCOWL is a collection of word lists split up in various sizes, and +other categories, intended to be suitable for use in spell checkers. +However, I am sure it will have numerous other uses as well. + +The latest version can be found at http://wordlist.aspell.net/. + +The directory final/ contains the actual word lists broken up into +various sizes and categories. The r/ directory contains Readmes from +the various sources used to create this package. + +The misc/ contains a small list of taboo words, see the README file +for more info. The speller/ directory contains scripts for creating +spelling dictionaries for Aspell and Hunspell. + +The other directories contain the necessary information to recreate the +word lists from the raw data. Unless you are interested in improving the +words lists you should not need to worry about what's here. See the +section on recreating the words lists for more information on what's +there. + +Except for the special word lists the files follow the following +naming convention: + -. +Where the spelling category is one of + english, american, british, british_z, canadian, + variant_1, variant_2, variant_3, + british_variant_1, british_variant_2, + canadian_variant_1, canadian_variant_2, +Sub-category is one of + abbreviations, contractions, proper-names, upper, words +And size is one of + 10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large), + 80 (huge), 95 (insane) +The special word lists follow are in the following format: + special-. +Where description is one of: + roman-numerals, hacker + +The perl script "mk-list" can be used to create a word list of the +desired size, its usage is: + ./mk-list [-f] [-v#] +where is one of the above spelling categories +(the english and special categories are automatically included as well +as all sub-categories) and is the desired size. The +"-v" option can be used to also include the appropriate +variants file up to level '#'. The normal output will be a sorted +word list. If you rather see what files will be included, use the +"-f" option. + +When manually combining the words lists the "english" spelling +category should be used as well as one of "american", "british", +"british_z" (british with ize spelling), or "canadian". Great care +has been taken so that only one spelling for any particular word +is included in the main list (with some minor exceptions). When two +variants were considered equal I randomly picked one for inclusion in +the main word list. Unfortunately this means that my choice in how to +spell a word may not match your choice. If this is the case you can +try including one of the "variant_1" spelling categories which +includes most variants which are considered almost equal. The +"variant_1" spelling category corresponds mostly to American variants, +while the "british_variant_1" and "canadian_variant_1" are for British +and Canadian variants, respectively. The "variant_2" spelling +categories include variants which are also generally considered +acceptable, and "variant_3" contains variants which are seldom used +and may not even be considered correct. There is no +"british_variant_3" or "canadian_variant_3" spelling category since +the distinction would be almost meaningless. + +The "abbreviation" category includes abbreviations and acronyms which +are not also normal words. The "contractions" category should be self +explanatory. The "upper" category includes upper case words and proper +names which are common enough to appear in a typical dictionary. The +"proper-names" category includes all the additional uppercase words. +Finally the "words" category contains all the normal English words. + +To give you an idea of what the words in the various sizes look like +here is a sample of 25 random words found only in that size: + +@`src/rand-samples | iconv -f iso-8859-1 -t utf-8` + +And here is a count on the number of words in each spelling category +(american + english spelling category): + +@`src/count` + +(The "Words" column does not include the name count.) + +Size 35 is the recommended small size, 50 the medium and 70 the large. +For spell checking I recommend using 60. Sizes 70 and below contain +words found in most dictionaries while the 80 size contains all the +strange and unusual words people like to use in word games such as +Scrabble (TM). While a lot of the words in the 80 size are not +used very often, they are all generally considered valid words in the +English language. The 95 contains just about every English word in +existence and then some. Many of the words at the 95 level will +probably not be considered valid English words by most people. I use +the 60 size for the English dictionary for Aspell, and I don't +recommend anyone use levels above 70 for spell checking. Levels above +70 contain rarely used words which can hide misspellings of similar +more commonly used words. For example the word "ort" can hide a +common typo of "or". No one should need to use a size larger than 80, +the 95 size is labeled insane for a reason. + +Accents are present on certain words such as café in iso8859-1 format. + +CHANGES: + +From Version 2015.05.18 to 2015.08.24 (Aug 24, 2015) + + Various new words. + +From Version 2015.04.24 to 2015.05.18 (May 18, 2015) + + Added some new words found to have a high frequency in the COCA + corpus. (http://corpus.byu.edu/coca/). + + Fix en spelling suggestions for 'alot' and 'exersize' in hunspell + dictionary (upstreamed from the changes made in Firefox). + +From Version 2015.02.15 to 2015.04.24 (April 24, 2015) + + Added some new words. + + Convert hunspell dictionary to UTF-8 in order to handle smart + quotes correctly. + +From Version 2015.01.28 to 2015.02.15 (February 15, 2015) + + Added a large number of neologisms (newly invented words) + such as "selfie" and "smartwatch" thanks to Alan Beale. + + Various other new words. + + Clean up the special-hacker category by removing some words that + didn't exist in the Google Book's Corpus (1980 - 2008) and + originated from the "Unofficial Jargon File Word Lists". + +From Version 2014.11.17 to 2015.01.28 (January 28, 2015) + + Various new words, many from analyzing the Google Book's Corpus + (1980 - 2008). See http://app.aspell.net/lookup-freq. + + Moved some uncommon words that can easily hide a misspelling of a + more common word to level 70. (calender, adrenalin and Joesph) + + Removed several -er and -est forms from adjectives that were so + uncommon that they were not found anywhere is the Google Book's + Corpus (1980 - 2008). + +From Version 2014.08.11.1 to 2014.11.17 (November 17, 2014) + + Various new words. + + Fix typo in Hunspell readme. + +From Version 2014.08.11 to 2014.08.11.1 (August 13, 2014) + + Forgot to mention this important change from 7.1 to 2014.08.11: + + Shifted the variant levels up by one: variant_0 is now variant_1, + variant_1 is now variant_2, and variant_2 is now variant_3. + + Other minor fixes in this README. + + No changes to the contents of the lists. + +From Revision 7.1 to Version 2014.08.11 (August 11, 2014) + + Added some missing possessive forms. + + Added some new words and proper names. + + Clean up the categories (words, upper, proper-names etc) so that they + are more accurate. + + Convert documentation to UTF-8. For now, the wordlist are still in + ISO-8859-1 to prevent compatibility problems. + + Add schema and scripts for creating a SQLite database from SCOWL. + Add some utility and library functions using them. This database is + used by the new web app's (http://app.aspel.net/lookup & create). + + Enhance speller/make-hunspell-dict. The biggest improvement is that + it that it now generates several more dictionaries in addition to + the official ones. These additional dictionaries are ones for + British English and larger dictionaries that include up to SCOWL + size 70. + +From Revision 7 to 7.1 (January 6, 2011) + + Updated to revision 5.1 of Varcon which corrected several errors. + + Fixed various problems with the variant processing which corrected a + few more errors. + + Added several now common proper names and some other words now + in common use. + + Include misc/ and speller/ directory which were in SVN but left + out of the release tarball. + + Other minor fixes, including some fixes to the taboo word lists. + +From Revision 6 to 7 (December 27, 2010) + + Updated to revision 5.0 of Varcon which corrected many errors, + especially in the British and Canadian spelling categories. Also + added new spelling categories for the British and Canadian spelling + variants and separated them out from the main variant_* categories. + + Moved Moby names lists (3897male.nam 4946fema.len 21986na.mes) to 95 + level since they contain too many errors and rare names. + + Moved frequently class 0 from Brian Kelk's Wordlist from + level 60 to 70, and also filter it with level 80 due to, too many + misspellings. + + Many other minor fixes. + +From Revision 5 to 6 (August 10, 2004) + + Updated to version 4.0 of the 12dicts package. + + Included the 3esl, 2of4brif, and 5desk list from the new 12dicts + package. The 3esl was included in the 40 size, the 2of4brif in the + 55 size and the 5desk in the 70 size. + + Removed the Ispell word list as it was a source of too many errors. + This eliminated the 65 size. + + Removed clause 4 from the Ispell copyright with permission of Geoff + Kuenning. + + Updated to version 4.1 of VarCon. + + Added the "british_z" spelling category which is British using the + "ize" spelling. + +From Revision 4a to 5 (January 3, 2002) + + Added variants that were not really spelling variants (such as + forwards) back into the main list. + + Fixed a bug which caused variants of words to incorrectly appear in + the non-variant lists. + + Moved rarely used inflections of a word into higher number lists. + +From 7.1 + + Shifted the variant levels so that level 0 is now 1, level 1 now 2, + and level 2 now 3. + + Added other inflections of a words based on the following criteria + If the word is in the base form: only include that word. + If the word is in a plural form: include the base word and the plural + If the word is a verb form (other than plural): include all verb forms + If the word is an ad* form: include all ad* forms + If the word is in a possessive form: also include the non-possessive + + Updated to the latest version of many of the source dictionaries. + + Removed the DEC Word List due to the questionable licence and + because removing it will not seriously decrease the quality of SCOWL + (there are a few less proper names). + +From Revision 4 to 4a (April 4, 2001) + + Reran the scripts on a never version of AGID (3a) which fixes a bug + which caused some common words to be improperly marked as variants. + +From Revision 3 to 4 (January 28, 2001) + + Split the variant "spelling category" up into 3 different levels. + + Added words in the Ispell word list at the 65 level. + + Other changes due to using more recent versions of various sources + included a more accurate version of AGID thanks to the work of + Alan Beale + +From Revision 2 to 3 (August 18, 2000) + + Renamed special-unix-terms to special-hacker and added a large + number of commonly used words within the hacker (not cracker) + community. + + Added a couple more signature words including "newbie". + + Minor changes due to changes in the inflection database. + +From Revision 1 to 2 (August 5, 2000) + + Moved the male and female name lists from the mwords package and the + DEC name lists form the 50 level to the 60 level and moved Alan's + name list from the 60 level to the 50 level. Also added the top + 1000 male, female, and last names from the 1990 Census report to the + 50 level. This reduced the number of names in the 50 level from + 17,000 to 7,000. + + Added a large number of Uppercase words to the 50 level. + + Properly accented the possessive form of some words. + + Minor other changes due to changes in my raw data files which have + not been released yet. Email if you are interested in these files. + +COPYRIGHT, SOURCES, and CREDITS: + +@`cat Copyright` + +The variant word lists were created from a list of variants found in +the 12dicts supplement package as well as a list of variants I created +myself. + +The Readmes for the various packages used can be found in the +appropriate directory under the r/ directory. + +FUTURE PLANS: + +The process of "sort"s, "comm"s, and Perl scripts to combine the many +word lists and separate out the variant information is inexact and +error prone. The whole things needs to be rewritten to deal with +words in terms of lemmas. When the exact lemma is not known a best +guess should be made. I'm not sure what form this should be in. I +originally thought this should be some sort of database, but maybe I +should just slurp all that data into memory and process it in one +giant perl script. With the amount of memory available these days (at +least 2 GB, often 4 GB or more) this should not really be a problem. + +In addition, there is a very nice frequency analyze of the BNC corpus +done by Adam Kilgarriff. Unlike Brian's word lists the BNC lists +include part of speech information. I plan on somehow using these +lists as Adam Kilgarriff has given me the OK to use it in SCOWL. +These lists will greatly reduce the problem of inflected forms of a +word appearing at different levels due to the part-of-speech +information. + +There is frequency information for some other corpus such as COCA +(Corpus of Contemporary American English) and ANS (American National +Corpus) which I might also be able to use. The former will require +permission, and the latter is of questionable quality. + +RECREATING THE WORD LISTS: + +In order to recreate the word lists you need a modern version of Perl, +bash, the traditional set of shell utilities, a system that supports +symbolic links, and quite possibly GNU Make. The easiest way to +recreate the word lists is to checkout the corresponding Git version +(see the version string at the start of the file) and simply type +"make" (see http://wordlist.aspell.net). You can try to download all +the pieces manually, but this method is not no longer tested nor +supported. + +The src/ directory contains the numerous scripts used in the creation +of the final product. + +The r/ directory contains the raw data used to create the final +product. If you checkout from Git this directory should be populated +automatically for you. If you insist on doing it the hard way see the +README file in the r/ directory for more information. + +The l/ directory contains symbolic links used by the actual scripts. + +Finally, the working/ directory is where all the intermittent files go +that are not specific to one source.