1 Spell Checking Oriented Word Lists (SCOWL)
2 Revision 7.1 (SVN Revision 161)
4 by Kevin Atkinson (kevina@gnu.org)
6 The SCOWL is a collection of word lists split up in various sizes, and
7 other categories, intended to be suitable for use in spell checkers.
8 However, I am sure it will have numerous other uses as well.
10 The latest version can be found at http://wordlist.sourceforge.net/.
12 The directory final/ contains the actual word lists broken up into
13 various sizes and categories. The r/ directory contains Readmes from
14 the various sources used to create this package.
16 The misc/ contains a small list of taboo words, see the README file
17 for more info. The speller/ directory contains scripts for creating
18 spelling dictionaries for Aspell and Hunspell.
20 The other directories contain the necessary information to recreate the
21 word lists from the raw data. Unless you are interested in improving the
22 words lists you should not need to worry about what's here. See the
23 section on recreating the words lists for more information on what's
26 Except for the special word lists the files follow the following
28 <spelling category>-<sub-category>.<size>
29 Where the spelling category is one of
30 english, american, british, british_z, canadian,
31 variant_0, varaint_1, variant_2,
32 british_variant_0, british_variant_1,
33 canadian_variant_0, canadian_variant_1,
34 Sub-category is one of
35 abbreviations, contractions, proper-names, upper, words
37 10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large),
38 80 (huge), 95 (insane)
39 The special word lists follow are in the following format:
40 special-<description>.<size>
41 Where description is one of:
42 roman-numerals, hacker
44 The perl script "mk-list" can be used to create a word list of the
45 desired size, it usage is:
46 ./mk-list [-f] [-v#] <spelling categories> <size>
47 where <spelling categories> is one of the above spelling categories
48 (the english and special categories are automatically included as well
49 as all sub-categories) and <size> is the desired desired size. The
50 "-v" option can be used to used to also include the appropriate
51 variants file up to level '#'. The normal output will be a sorted
52 word list. If you rather see what files will be included, use the
55 When manually combining the words lists the "english" spelling
56 category should be used as well as one of "american", "british",
57 "british_z" (british with ize spelling), or "canadian". Great care
58 has been taken so that that only one spelling for any particular word
59 is included in the main list (with some minor exceptions). When two
60 variants were considered equal I randomly picked one for inclusion in
61 the main word list. Unfortunately this means that my choice in how to
62 spell a word may not match your choice. If this is the case you can
63 try including one of the "variant_0" spelling categories which
64 includes most variants which are considered almost equal. The
65 "variant_0" spelling category corresponds mostly to American variants,
66 while the "british_variant_0" and "canadian_variant_0" are for British
67 and Canadian variants, respectively. The "variant_1" spelling
68 categories include variants which are also generally considered
69 acceptable, and "variant_2" contains variants which are seldom used
70 and may now even be considered correct. There is no
71 "british_variant_2" or "canadian_variant_2" spelling category since
72 the distinction would be almost meaningless.
74 The "abbreviation" category includes abbreviations and acronyms which
75 are not also normal words. The "contractions" category should be self
76 explanatory. The "upper" category includes upper case words and proper
77 names which are common enough to appear in a typical dictionary. The
78 "proper-names" category included all the additional uppercase words.
79 Final the "words" category contains all the normal English words.
81 To give you an idea of what the words in the various sizes look like
82 here is a sample of 25 random words found only in that size:
84 10: advertised agreeing artificial bucket changes closest currently finding
85 implications learning liable obvious partial peace planet preparing
86 produced regulations shortly tries under unnecessary vacations vast wind
88 20: accomplishes addict baffles blink chapel corrections depresses dripping
89 erased infant interfere launch nicking novels paranoid passport pursued
90 recruitment rectifying relaxed sixteen sundry tab undergone withdraws
92 35: adores affixes brisks caking conciliates decimates discretionary
93 dispatches forensics glorify gridiron healed hurling kelp massacring
94 necks pits placarding pyramids ratting recreates renovated sandals shirks
97 40: demoed dichotomy dilapidation disheveled ebullience estimable finagling
98 hemorrhoid lazily medalists mintiest motherboards ostracism pornographers
99 predilections remarries southbound steamrolled sympathizers tads tampons
100 tattletale upchucked vainly viscous
102 50: bootless brawler bulkhead canoeist declassifying farthings hake hectors
103 helpmate hermitage humanoid kitsch mercerize pawnshops pleasingly
104 retrorockets scurrilously solemnizes superficiality symbiosis tangelo
105 timetabling unenviable unmoral unreconstructed
107 55: beachfront bicarbonate caff campanologists execrably fab fightback
108 firebricks insipidity laboriousness megawatts mirthlessly misnames
109 nymphos photocell potholed psychoactive psychoanalytically schoolmarmish
110 simulacra subeditors supremo sweated turbocharges yogic
112 60: assayer banteringly besmeared brazer chromatin cremes deciliters
113 doubtfulness enshrinement ephemerally fibular globalist gypper
114 legitimatized mensch mopers oversea pantyliner paratyphoid redivide
115 rehabilitative salesladies sensualists superposition univalves
117 70: adactylous anticapitalist bezant bister boraginaceous civically cossacks
118 cousinly curricle dekaliter grippingly grugrus gurging hermaphroditism
119 levanted magnetizer nonapplicable panegyrists parametrize radomes
120 refilter ruinations teths truistic uts
122 80: bodikin buhrs covetiveness diarch disaccharidases drumbeater empusas
123 flyings hyperexcitability hyperpolarizations janizaries overwash
124 physiocrats postform postsecondary preambulate puzzlehead remixer
125 snoutier tetrathlons toothdrawing triff unaffectionate wearish yawy
127 95: actinophone aerobious anadenia biochemics chromatopathia ciclatouns
128 gaspiest guapinol hagigah interdorsal melanotekite minnicking
129 nonretrenchment overloftily oystriges peltandra retromaxillary
130 subterraqueous transphysically unconfidential unvalidating upspew
131 verminlike vetiveria yerth
133 And here is a count on the number of in each spelling category
134 (american + english spelling category):
136 Size Words Names Running Total %
137 10 4,427 15 4,442 0.7
138 20 8,122 0 12,564 1.9
139 35 37,251 224 50,039 7.7
140 40 6,802 503 57,344 8.8
141 50 24,505 15,455 97,304 14.9
142 55 6,555 0 103,859 15.9
143 60 13,633 775 118,267 18.1
144 70 35,507 7,747 161,521 24.8
145 80 143,791 33,293 338,605 51.9
146 95 227,056 86,814 652,475 100.0
148 (The "Words" column does not include the name count.)
150 Size 35 is the recommended small size, 50 the medium and 70 the large.
151 For spell checking I recommend using 60. Sizes 70 and below contain
152 words found in most dictionaries while the 80 size contains all the
153 strange and unusual words people like to use in word games such as
154 Scrabble (TM). While a lot of the the words in the 80 size are not
155 used very often, they are all generally considered valid words in the
156 English language. The 95 contains just about every English word in
157 existence and then some. Many of the words at the 95 level will
158 probably not be considered valid English words by most people. I use
159 the 60 size for the English dictionary for Aspell, and I don't
160 recommend anyone use levels above 70 for spell checking. Levels above
161 70 contain rarely used words which can hide misspellings of similar
162 more commonly used words. For example the word "ort" can hide a
163 common typo of "or". No one should need to use a size larger than 80,
164 the 95 size is labeled insane for a reason.
166 Accents are present on certain words such as café in iso8859-1 format.
170 From Revision 7 to 7.1 (January 6, 2011)
172 Updated to revision 5.1 of Varcon which corrected several errors.
174 Fixed various problems with the variant processing which corrected a
177 Added several now common proper names and some other words now
180 Include misc/ and speller/ directory which where in SVN but left
181 out of the release tarball.
183 Other minor fixes, including some fixes to the taboo word lists.
185 From Revision 6 to 7 (December 27, 2010)
187 Updated to revision 5.0 of Varcon which corrected many errors,
188 especially in the British and Canadian spelling categories. Also
189 added new spelling categories for the British and Canadian spelling
190 variants and separated them out from the main variant_* categories.
192 Moved Moby names lists (3897male.nam 4946fema.len 21986na.mes) to 95
193 level since they contain too many errors and rare names.
195 Moved frequently class 0 from Brian Kelk's Wordlist from
196 level 60 to 70, and also filter it with level 80 due to, too many
199 Many other minor fixes.
201 From Revision 5 to 6 (August 10, 2004)
203 Updated to version 4.0 of the 12dicts package.
205 Included the 3esl, 2of4brif, and 5desk list from the new 12dicts
206 package. The 3esl was included in the 40 size, the 2of4brif in the
207 55 size and the 5desk in the 70 size.
209 Removed the Ispell word list as it was a source of too many errors.
210 This eliminated the 65 size.
212 Removed clause 4 from the Ispell copyright with permission of Geoff
215 Updated to version 4.1 of VarCon.
217 Added the "british_z" spelling category which it British using the
220 From Revision 4a to 5 (January 3, 2002)
222 Added variants that were not really spelling variants (such as
223 forwards) back into the main list.
225 Fixed a bug which caused variants of words to incorrectly appear in
226 the non-variant lists.
228 Moved rarely used inflections of a word into higher number lists.
230 Added other inflections of a words based on the following criteria
231 If the word is in the base form: only include that word.
232 If the word is in a plural form: include the base word and the plural
233 If the word is a verb form (other than plural): include all verb forms
234 If the word is an ad* form: include all ad* forms
235 If the word is in a possessive form: also include the non-possessive
237 Updated to the latest version of many of the source dictionaries.
239 Removed the DEC Word List due to the questionable licence and
240 because removing it will not seriously decrease the quality of SCOWL
241 (there are a few less proper names).
243 From Revision 4 to 4a (April 4, 2001)
245 Reran the scripts on a never version of AGID (3a) which fixes a bug
246 which caused some common words to be improperly marked as variants.
248 From Revision 3 to 4 (January 28, 2001)
250 Split the variant "spelling category" up into 3 different levels.
252 Added words in the Ispell word list at the 65 level.
254 Other changes due to using more recent versions of various sources
255 included a more accurate version of AGID thanks to the word of
258 From Revision 2 to 3 (August 18, 2000)
260 Renamed special-unix-terms to special-hacker and added a large
261 number of commonly used words within the hacker (not cracker)
264 Added a couple more signature words including "newbie".
266 Minor changes due to changes in the inflection database.
268 From Revision 1 to 2 (August 5, 2000)
270 Moved the male and female name lists from the mwords package and the
271 DEC name lists form the 50 level to the 60 level and moved Alan's
272 name list from the 60 level to the 50 level. Also added the top
273 1000 male, female, and last names from the 1990 Census report to the
274 50 level. This reduced the number of names in the 50 level from
277 Added a large number of Uppercase words to the 50 level.
279 Properly accented the possessive form of some words.
281 Minor other changes due to changes in my raw data files which have
282 not been released yet. Email if you are interested in these files.
284 COPYRIGHT, SOURCES, and CREDITS:
286 The collective work is Copyright 2000-2011 by Kevin Atkinson as well
287 as any of the copyrights mentioned below:
289 Copyright 2000-2011 by Kevin Atkinson
291 Permission to use, copy, modify, distribute and sell these word
292 lists, the associated scripts, the output created from the scripts,
293 and its documentation for any purpose is hereby granted without fee,
294 provided that the above copyright notice appears in all copies and
295 that both that copyright notice and this permission notice appear in
296 supporting documentation. Kevin Atkinson makes no representations
297 about the suitability of this array for any purpose. It is provided
298 "as is" without express or implied warranty.
300 Alan Beale <biljir@pobox.com> also deserves special credit as he has,
301 in addition to providing the 12Dicts package and being a major
302 contributor to the ENABLE word list, given me an incredible amount of
303 feedback and created a number of special lists (those found in the
304 Supplement) in order to help improve the overall quality of SCOWL.
306 The 10 level includes the 1000 most common English words (according to
307 the Moby (TM) Words II [MWords] package), a subset of the 1000 most
308 common words on the Internet (again, according to Moby Words II), and
309 frequently class 16 from Brian Kelk's "UK English Wordlist
310 with Frequency Classification".
312 The MWords package was explicitly placed in the public domain:
314 The Moby lexicon project is complete and has
315 been place into the public domain. Use, sell,
316 rework, excerpt and use in any way on any platform.
318 Placing this material on internal or public servers is
319 also encouraged. The compiler is not aware of any
320 export restrictions so freely distribute world-wide.
322 You can verify the public domain status by contacting
326 Arcata, CA 95521-4884
331 The "UK English Wordlist With Frequency Classification" is also in the
334 Date: Sat, 08 Jul 2000 20:27:21 +0100
335 From: Brian Kelk <Brian.Kelk@cl.cam.ac.uk>
337 > I was wondering what the copyright status of your "UK English
338 > Wordlist With Frequency Classification" word list as it seems to
339 > be lacking any copyright notice.
341 There were many many sources in total, but any text marked
342 "copyright" was avoided. Locally-written documentation was one
343 source. An earlier version of the list resided in a filespace called
344 PUBLIC on the University mainframe, because it was considered public
347 Date: Tue, 11 Jul 2000 19:31:34 +0100
349 > So are you saying your word list is also in the public domain?
351 That is the intention.
353 The 20 level includes frequency classes 7-15 from Brian's word list.
355 The 35 level includes frequency classes 2-6 and words appearing in at
356 least 11 of 12 dictionaries as indicated in the 12Dicts package. All
357 words from the 12Dicts package have had likely inflections added via
358 my inflection database.
360 The 12Dicts package and Supplement is in the Public Domain.
362 The WordNet database, which was used in the creation of the
363 Inflections database, is under the following copyright:
365 This software and database is being provided to you, the LICENSEE,
366 by Princeton University under the following license. By obtaining,
367 using and/or copying this software and database, you agree that you
368 have read, understood, and will comply with these terms and
371 Permission to use, copy, modify and distribute this software and
372 database and its documentation for any purpose and without fee or
373 royalty is hereby granted, provided that you agree to comply with
374 the following copyright notice and statements, including the
375 disclaimer, and that the same appear on ALL copies of the software,
376 database and documentation, including modifications that you make
377 for internal use or for distribution.
379 WordNet 1.6 Copyright 1997 by Princeton University. All rights
382 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
383 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
384 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
385 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
386 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE
387 LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY
388 THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
390 The name of Princeton University or Princeton may not be used in
391 advertising or publicity pertaining to distribution of the software
392 and/or database. Title to copyright in this software, database and
393 any associated documentation shall at all times remain with
394 Princeton University and LICENSEE agrees to preserve same.
396 The 40 level includes words from Alan's 3esl list found in version 4.0
397 of his 12dicts package. Like his other stuff the 3esl list is also in the
400 The 50 level includes Brian's frequency class 1, words words appearing
401 in at least 5 of 12 of the dictionaries as indicated in the 12Dicts
402 package, and uppercase words in at least 4 of the previous 12
403 dictionaries. A decent number of proper names is also included: The
404 top 1000 male, female, and Last names from the 1990 Census report; a
405 list of names sent to me by Alan Beale; and a few names that I added
406 myself. Finally a small list of abbreviations not commonly found in
407 other word lists is included.
409 The name files form the Census report is a government document which I
410 don't think can be copyrighted.
412 The file special-jargon.50 uses common.lst and word.lst from the
413 "Unofficial Jargon File Word Lists" which is derived from "The Jargon
414 File". All of which is in the Public Domain. This file also contain
415 a few extra UNIX terms which are found in the file "unix-terms" in the
418 The 55 level includes words from Alan's 2of4brif list found in version
419 4.0 of his 12dicts package. Like his other stuff the 2of4brif is also
420 in the public domain.
422 The 60 level includes all words appearing in at least 2 of the 12
423 dictionaries as indicated by the 12Dicts package.
425 The 70 level includes Brian's frequency class 0 and the 74,550 common
426 dictionary words from the MWords package. The common dictionary words,
427 like those from the 12Dicts package, have had all likely inflections
428 added. The 70 level also included the 5desk list from version 4.0 of
429 the 12Dics package which is the public domain.
431 The 80 level includes the ENABLE word list, all the lists in the
432 ENABLE supplement package (except for ABLE), the "UK Advanced Cryptics
433 Dictionary" (UKACD), the list of signature words in from YAWL package,
434 and the 10,196 places list from the MWords package.
436 The ENABLE package, mainted by M\Cooper <thegrendel@theriver.com>,
437 is in the Public Domain:
439 The ENABLE master word list, WORD.LST, is herewith formally released
440 into the Public Domain. Anyone is free to use it or distribute it in
441 any manner they see fit. No fee or registration is required for its
442 use nor are "contributions" solicited (if you feel you absolutely
443 must contribute something for your own peace of mind, the authors of
444 the ENABLE list ask that you make a donation on their behalf to your
445 favorite charity). This word list is our gift to the Scrabble
446 community, as an alternate to "official" word lists. Game designers
447 may feel free to incorporate the WORD.LST into their games. Please
448 mention the source and credit us as originators of the list. Note
449 that if you, as a game designer, use the WORD.LST in your product,
450 you may still copyright and protect your product, but you may *not*
451 legally copyright or in any way restrict redistribution of the
452 WORD.LST portion of your product. This *may* under law restrict your
453 rights to restrict your users' rights, but that is only fair.
455 UKACD, by J Ross Beresford <ross@bryson.demon.co.uk>, is under the
458 Copyright (c) J Ross Beresford 1993-1999. All Rights Reserved.
460 The following restriction is placed on the use of this publication:
461 if The UK Advanced Cryptics Dictionary is used in a software package
462 or redistributed in any form, the copyright notice must be
463 prominently displayed and the text of this document must be included
466 There are no other restrictions: I would like to see the list
467 distributed as widely as possible.
469 The 95 level includes the 354,984 single words, 256,772 compound
470 words, 4,946 female names and the 3,897 male names, and 21,986 names
471 from the MWords package, ABLE.LST from the ENABLE Supplement, and some
472 additional words found in my part-of-speech database that were not
475 Accent information was taken from UKACD.
477 My VARCON package was used to create the American, British, and
480 Since the original word lists used used in the VARCON package came
481 from the Ispell distribution they are under the Ispell copyright:
483 Copyright 1993, Geoff Kuenning, Granada Hills, CA
486 Redistribution and use in source and binary forms, with or without
487 modification, are permitted provided that the following conditions
490 1. Redistributions of source code must retain the above copyright
491 notice, this list of conditions and the following disclaimer.
492 2. Redistributions in binary form must reproduce the above copyright
493 notice, this list of conditions and the following disclaimer in the
494 documentation and/or other materials provided with the distribution.
495 3. All modifications to the source code must be clearly marked as
496 such. Binary redistributions based on modified source code
497 must be clearly marked as modified versions in the documentation
498 and/or other materials provided with the distribution.
499 (clause 4 removed with permission from Geoff Kuenning)
500 5. The name of Geoff Kuenning may not be used to endorse or promote
501 products derived from this software without specific prior
504 THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS
505 IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
506 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
507 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL GEOFF
508 KUENNING OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
509 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
510 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
511 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
512 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
513 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
514 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
515 POSSIBILITY OF SUCH DAMAGE.
517 The variant word lists were created from a list of variants found in
518 the 12dicts supplement package as well as a list of variants I created
521 The Readmes for the various packages used can be found in the
522 appropriate directory under the r/ directory.
526 The process of "sort"s, "comm"s, and Perl scripts to combine the many
527 word lists and separate out the variant information is inexact and
528 error prone. The whole things needs to be rewritten to deal with
529 words in terms of lemmas. When the exact lemma is not known a best
530 guess should be made. I'm not sure what form this should be in. I
531 originally thought this should be some sort of database, but maybe I
532 should just slurp all that data into memory and process it in one
533 giant perl script. With the amount of memory available these days (at
534 least 2 GB, often 4 GB or more) this should not really be a problem.
536 In addition, there is a very nice frequency analyze of the BNC corpus
537 done by Adam Kilgarriff. Unlike Brain's word lists the BNC lists
538 include part of speech information. I plan on somehow using these
539 lists as Adam Kilgarriff has given me the OK to use it in SCOWL.
540 These lists will greatly reduce the problem of inflected forms of a
541 word appearing at different levels due to the part-of-speech
544 There is frequency information for some other corpus such as COCA
545 (Corpus of Contemporary American English) and ANS (American National
546 Corpus) which I might also be able to use. The formal will require
547 permission, and the latter is of questionable quality.
549 RECREATING THE WORD LISTS:
551 In order to recreate the word lists you need a modern version of Perl,
552 bash, the traditional set of shell utilities, a system that supports
553 symbolic links, and quite possibly GNU Make. The easiest way to
554 recreate the word lists is to checkout SVN revision 161 (or tag
555 scowl-7.1) and simply type "make" (see http://wordlist.sourceforge.net).
556 You can try to download all the pieces manually, but you may not get
557 the same result since the latest version of some parts used to create
558 SCOWL may not have been released yet.
560 The src/ directory contains the numerous scripts used in the creation
561 of the final product.
563 The r/ directory contains the raw data used to create the final
564 product. If you checkout from SVN this directory should be populated
565 automatically for you. If you insist on doing it the hard way see the
566 README file in the r/ directory for more information.
568 The l/ directory contains symbolic links used by the actual scripts.
570 Finally, the working/ directory is where all the intermittent files go
571 that are not specific to one source.