1 Spell Checking Oriented Word Lists (SCOWL)
3 Mon Apr 16 22:11:56 2018 -0400 [7bbe293]
4 by Kevin Atkinson (kevina@gnu.org)
6 The SCOWL is a collection of word lists split up in various sizes, and
7 other categories, intended to be suitable for use in spell checkers.
8 However, I am sure it will have numerous other uses as well.
10 The latest version can be found at http://wordlist.aspell.net/.
12 The directory final/ contains the actual word lists broken up into
13 various sizes and categories. The r/ directory contains Readmes from
14 the various sources used to create this package.
16 The misc/ contains a small list of taboo words, see the README file
17 for more info. The speller/ directory contains scripts for creating
18 spelling dictionaries for Aspell and Hunspell.
20 The other directories contain the necessary information to recreate the
21 word lists from the raw data. Unless you are interested in improving the
22 words lists you should not need to worry about what's here. See the
23 section on recreating the words lists for more information on what's
26 Except for the special word lists the files follow the following
28 <spelling category>-<sub-category>.<size>
29 Where the spelling category is one of
30 english, american, british, british_z, canadian, australian
31 variant_1, variant_2, variant_3,
32 british_variant_1, british_variant_2,
33 canadian_variant_1, canadian_variant_2,
34 australian_variant_1, australian_variant_2
35 Sub-category is one of
36 abbreviations, contractions, proper-names, upper, words
38 10, 20, 35 (small), 40, 50 (medium), 55, 60, 70 (large),
39 80 (huge), 95 (insane)
40 The special word lists follow are in the following format:
41 special-<description>.<size>
42 Where description is one of:
43 roman-numerals, hacker
45 The perl script "mk-list" can be used to create a word list of the
46 desired size, its usage is:
47 ./mk-list [-f] [-v#] <spelling categories> <size>
48 where <spelling categories> is one of the above spelling categories
49 (the english and special categories are automatically included as well
50 as all sub-categories) and <size> is the desired size. The
51 "-v" option can be used to also include the appropriate
52 variants file up to level '#'. The normal output will be a sorted
53 word list. If you rather see what files will be included, use the
56 When manually combining the words lists the "english" spelling
57 category should be used as well as one of "american", "british",
58 "british_z" (british with ize spelling), "canadian" or "australian".
59 Great care has been taken so that only one spelling for any particular
60 word is included in the main list (with some minor exceptions). When
61 two variants were considered equal I randomly picked one for inclusion
62 in the main word list. Unfortunately this means that my choice in how
63 to spell a word may not match your choice. If this is the case you
64 can try including one of the "variant_1" spelling categories which
65 includes most variants which are considered almost equal. The
66 "variant_1" spelling category corresponds mostly to American variants,
67 while the "british_variant_1", "canadian_variant_1" and
68 "australian_variant_1" are for British, Canadian and Australian
69 variants, respectively. The "variant_2" spelling categories include
70 variants which are also generally considered acceptable, and
71 "variant_3" contains variants which are seldom used and may not even
72 be considered correct. There is no "british_variant_3",
73 "canadian_variant_3" or "australian_variant_3" spelling category since
74 the distinction would be almost meaningless.
76 The "abbreviation" category includes abbreviations and acronyms which
77 are not also normal words. The "contractions" category should be self
78 explanatory. The "upper" category includes upper case words and proper
79 names which are common enough to appear in a typical dictionary. The
80 "proper-names" category includes all the additional uppercase words.
81 Finally the "words" category contains all the normal English words.
83 To give you an idea of what the words in the various sizes look like
84 here is a sample of 25 random words found only in that size:
86 10: able bidding bound built contact direct discouraged every experts flight
87 impose live lower mail negative plant repeat sorry spot strongly success
88 technique trouble workers yours
90 20: adapting astronomy asynchronous beer believer ceasing comedy comprised
91 daughter deletion ditch dripped gathers generalizations infinity
92 interacted orchestral padding petty risked stems struggled tiny tribes
95 35: amnesty annihilating authorship barged bathtubs curdled debilities
96 excusable founders glimpsing ladled lieutenants mobbed naturalness
97 naughtily obesity ogling pinpointing scabbing semester sirens soloed
98 soundly visuals witched
100 40: capitulation catamaran closeout cobblestones crosstown defector enamoring
101 fractionally homecoming honorably hypes impersonator incontinent mopeds
102 niggle nondenominational quads redeveloping retrial skydives slalom
103 speckle speedway uncharacteristically unzips
105 50: anthropomorphism bespeaks bodega colossi debauches disobediently drub
106 enervated headboard helot hex idealistically lambkin linebackers
107 minimization misruled pinafores popovers scratchiness spiny tiresomeness
108 underplaying wardrooms waxworks wrongness
110 55: anglophiles anticoagulant borehole choirboys commonality cutaway defogged
111 exeunt gatecrashing holdalls kvetch loughs photostat pitheads potholer
112 provincially ritualized roadshows rota themed toecaps townies uneatable
115 60: adventurousness anodizing austral beanstalk bemusedly chorea cotangents
116 counterspy crackheads fishponds floggers hairsplitter hombre homiletic
117 invalidism kepi provability redyes reifying riflers tarsus trainman
118 tritium unchains valetudinarian
120 70: acrophobic algor artel benny bilharziasis colocynth cotoneaster duckpond
121 electrostatically evulsion exegetics felly fictionist ironsides laminator
122 monandries nonhero pervious pily presbyopic relativizing squeteague
123 syringomyelia verifiers volleyer
125 80: boltings bouldering butcherings disaffirmances extrications exulceration
126 flybanes glossators moderations naething ois paddlefishes perimorphisms
127 phagocytose plicating protiums recanters redliner semple
128 septendecillionths shebeeners tegmentum tritubercular uncasked vacuates
130 95: bufonidae callityping cantingness cavish electrosensitive entayles
131 halcyonian hexagonial idoxuridines insorbent lowbrowism mantispid
132 overspeak predeclaring prerealized propheticalness pseudoracemism
133 reaedifyes selaginellales setover supercomplex theogeological unlasting
137 And here is a count on the number of words in each spelling category
138 (american + english spelling category):
140 Size Words Names Running Total %
141 10 4,426 13 4,439 0.7
142 20 8,128 0 12,567 1.9
143 35 37,259 222 50,048 7.6
144 40 6,853 491 57,392 8.7
145 50 25,226 18,316 100,934 15.3
146 55 6,489 0 107,423 16.3
147 60 14,425 846 122,694 18.7
148 70 35,328 7,899 165,921 25.2
149 80 144,216 33,367 343,504 52.2
150 95 227,661 86,633 657,798 100.0
153 (The "Words" column does not include the name count.)
155 Size 35 is the recommended small size, 50 the medium and 70 the large.
156 Sizes 70 and below contain words found in most dictionaries while the
157 80 size contains all the strange and unusual words people like to use
158 in word games such as Scrabble (TM). While a lot of the words in the
159 80 size are not used very often, they are all generally considered
160 valid words in the English language. The 95 contains just about every
161 English word in existence and then some. Many of the words at the 95
162 level will probably not be considered valid English words by most
165 For spell checking I recommend using size 60. This size is the
166 largest size that I am fairly confident does not contain any
167 misspellings or invalid words. In addition an effort is made to
168 exclude valid yet problematic words (such as "calender") from the 60
169 size that are likely to be a misspelling of a more common word. The
170 70 size is reasonable for those wanting a larger list and don't mind a
171 few errors. The 80 or larger sizes are not reasonable for spell
174 Accents are present on certain words such as café in iso8859-1 format.
178 From Version 2017.08.24 to 2018.04.16
182 Fix build problems on macOS.
184 From Version 2017.01.22 to 2017.08.24
188 From Version 2016.11.20 to 2017.01.22
192 From Version 2016.06.26 to 2016.11.20
194 New Australian spelling category thanks to the work of Benjamin
195 Titze (btitze@protonmail.ch)
199 From Version 2016.01.19 to 2016.06.26
203 Updated to Version 6.0.2 of 12dicts
207 From Version 2015.08.24 to 2016.01.19
211 Clarified README to indicate why the 60 size is the preferred size
214 Remove some very uncommon possessive forms.
216 Change "SET UTF8" to "SET UTF-8" in hunspell affix file.
218 From Version 2015.05.18 to 2015.08.24 (Aug 24, 2015)
222 From Version 2015.04.24 to 2015.05.18 (May 18, 2015)
224 Added some new words found to have a high frequency in the COCA
225 corpus. (http://corpus.byu.edu/coca/).
227 Fix en spelling suggestions for 'alot' and 'exersize' in hunspell
228 dictionary (upstreamed from the changes made in Firefox).
230 From Version 2015.02.15 to 2015.04.24 (April 24, 2015)
232 Added some new words.
234 Convert hunspell dictionary to UTF-8 in order to handle smart
237 From Version 2015.01.28 to 2015.02.15 (February 15, 2015)
239 Added a large number of neologisms (newly invented words)
240 such as "selfie" and "smartwatch" thanks to Alan Beale.
242 Various other new words.
244 Clean up the special-hacker category by removing some words that
245 didn't exist in the Google Book's Corpus (1980 - 2008) and
246 originated from the "Unofficial Jargon File Word Lists".
248 From Version 2014.11.17 to 2015.01.28 (January 28, 2015)
250 Various new words, many from analyzing the Google Book's Corpus
251 (1980 - 2008). See http://app.aspell.net/lookup-freq.
253 Moved some uncommon words that can easily hide a misspelling of a
254 more common word to level 70. (calender, adrenalin and Joesph)
256 Removed several -er and -est forms from adjectives that were so
257 uncommon that they were not found anywhere is the Google Book's
258 Corpus (1980 - 2008).
260 From Version 2014.08.11.1 to 2014.11.17 (November 17, 2014)
264 Fix typo in Hunspell readme.
266 From Version 2014.08.11 to 2014.08.11.1 (August 13, 2014)
268 Forgot to mention this important change from 7.1 to 2014.08.11:
270 Shifted the variant levels up by one: variant_0 is now variant_1,
271 variant_1 is now variant_2, and variant_2 is now variant_3.
273 Other minor fixes in this README.
275 No changes to the contents of the lists.
277 From Revision 7.1 to Version 2014.08.11 (August 11, 2014)
279 Added some missing possessive forms.
281 Added some new words and proper names.
283 Clean up the categories (words, upper, proper-names etc) so that they
286 Convert documentation to UTF-8. For now, the wordlist are still in
287 ISO-8859-1 to prevent compatibility problems.
289 Add schema and scripts for creating a SQLite database from SCOWL.
290 Add some utility and library functions using them. This database is
291 used by the new web app's (http://app.aspell.net/lookup & create).
293 Enhance speller/make-hunspell-dict. The biggest improvement is that
294 it that it now generates several more dictionaries in addition to
295 the official ones. These additional dictionaries are ones for
296 British English and larger dictionaries that include up to SCOWL
299 From Revision 7 to 7.1 (January 6, 2011)
301 Updated to revision 5.1 of Varcon which corrected several errors.
303 Fixed various problems with the variant processing which corrected a
306 Added several now common proper names and some other words now
309 Include misc/ and speller/ directory which were in SVN but left
310 out of the release tarball.
312 Other minor fixes, including some fixes to the taboo word lists.
314 From Revision 6 to 7 (December 27, 2010)
316 Updated to revision 5.0 of Varcon which corrected many errors,
317 especially in the British and Canadian spelling categories. Also
318 added new spelling categories for the British and Canadian spelling
319 variants and separated them out from the main variant_* categories.
321 Moved Moby names lists (3897male.nam 4946fema.len 21986na.mes) to 95
322 level since they contain too many errors and rare names.
324 Moved frequently class 0 from Brian Kelk's Wordlist from
325 level 60 to 70, and also filter it with level 80 due to, too many
328 Many other minor fixes.
330 From Revision 5 to 6 (August 10, 2004)
332 Updated to version 4.0 of the 12dicts package.
334 Included the 3esl, 2of4brif, and 5desk list from the new 12dicts
335 package. The 3esl was included in the 40 size, the 2of4brif in the
336 55 size and the 5desk in the 70 size.
338 Removed the Ispell word list as it was a source of too many errors.
339 This eliminated the 65 size.
341 Removed clause 4 from the Ispell copyright with permission of Geoff
344 Updated to version 4.1 of VarCon.
346 Added the "british_z" spelling category which is British using the
349 From Revision 4a to 5 (January 3, 2002)
351 Added variants that were not really spelling variants (such as
352 forwards) back into the main list.
354 Fixed a bug which caused variants of words to incorrectly appear in
355 the non-variant lists.
357 Moved rarely used inflections of a word into higher number lists.
359 Added other inflections of a words based on the following criteria
360 If the word is in the base form: only include that word.
361 If the word is in a plural form: include the base word and the plural
362 If the word is a verb form (other than plural): include all verb forms
363 If the word is an ad* form: include all ad* forms
364 If the word is in a possessive form: also include the non-possessive
366 Updated to the latest version of many of the source dictionaries.
368 Removed the DEC Word List due to the questionable licence and
369 because removing it will not seriously decrease the quality of SCOWL
370 (there are a few less proper names).
372 From Revision 4 to 4a (April 4, 2001)
374 Reran the scripts on a never version of AGID (3a) which fixes a bug
375 which caused some common words to be improperly marked as variants.
377 From Revision 3 to 4 (January 28, 2001)
379 Split the variant "spelling category" up into 3 different levels.
381 Added words in the Ispell word list at the 65 level.
383 Other changes due to using more recent versions of various sources
384 included a more accurate version of AGID thanks to the work of
387 From Revision 2 to 3 (August 18, 2000)
389 Renamed special-unix-terms to special-hacker and added a large
390 number of commonly used words within the hacker (not cracker)
393 Added a couple more signature words including "newbie".
395 Minor changes due to changes in the inflection database.
397 From Revision 1 to 2 (August 5, 2000)
399 Moved the male and female name lists from the mwords package and the
400 DEC name lists form the 50 level to the 60 level and moved Alan's
401 name list from the 60 level to the 50 level. Also added the top
402 1000 male, female, and last names from the 1990 Census report to the
403 50 level. This reduced the number of names in the 50 level from
406 Added a large number of Uppercase words to the 50 level.
408 Properly accented the possessive form of some words.
410 Minor other changes due to changes in my raw data files which have
411 not been released yet. Email if you are interested in these files.
413 COPYRIGHT, SOURCES, and CREDITS:
415 The collective work is Copyright 2000-2016 by Kevin Atkinson as well
416 as any of the copyrights mentioned below:
418 Copyright 2000-2016 by Kevin Atkinson
420 Permission to use, copy, modify, distribute and sell these word
421 lists, the associated scripts, the output created from the scripts,
422 and its documentation for any purpose is hereby granted without fee,
423 provided that the above copyright notice appears in all copies and
424 that both that copyright notice and this permission notice appear in
425 supporting documentation. Kevin Atkinson makes no representations
426 about the suitability of this array for any purpose. It is provided
427 "as is" without express or implied warranty.
429 Alan Beale <biljir@pobox.com> also deserves special credit as he has,
430 in addition to providing the 12Dicts package and being a major
431 contributor to the ENABLE word list, given me an incredible amount of
432 feedback and created a number of special lists (those found in the
433 Supplement) in order to help improve the overall quality of SCOWL.
435 The 10 level includes the 1000 most common English words (according to
436 the Moby (TM) Words II [MWords] package), a subset of the 1000 most
437 common words on the Internet (again, according to Moby Words II), and
438 frequently class 16 from Brian Kelk's "UK English Wordlist
439 with Frequency Classification".
441 The MWords package was explicitly placed in the public domain:
443 The Moby lexicon project is complete and has
444 been place into the public domain. Use, sell,
445 rework, excerpt and use in any way on any platform.
447 Placing this material on internal or public servers is
448 also encouraged. The compiler is not aware of any
449 export restrictions so freely distribute world-wide.
451 You can verify the public domain status by contacting
455 Arcata, CA 95521-4884
460 The "UK English Wordlist With Frequency Classification" is also in the
463 Date: Sat, 08 Jul 2000 20:27:21 +0100
464 From: Brian Kelk <Brian.Kelk@cl.cam.ac.uk>
466 > I was wondering what the copyright status of your "UK English
467 > Wordlist With Frequency Classification" word list as it seems to
468 > be lacking any copyright notice.
470 There were many many sources in total, but any text marked
471 "copyright" was avoided. Locally-written documentation was one
472 source. An earlier version of the list resided in a filespace called
473 PUBLIC on the University mainframe, because it was considered public
476 Date: Tue, 11 Jul 2000 19:31:34 +0100
478 > So are you saying your word list is also in the public domain?
480 That is the intention.
482 The 20 level includes frequency classes 7-15 from Brian's word list.
484 The 35 level includes frequency classes 2-6 and words appearing in at
485 least 11 of 12 dictionaries as indicated in the 12Dicts package. All
486 words from the 12Dicts package have had likely inflections added via
487 my inflection database.
489 The 12Dicts package and Supplement is in the Public Domain.
491 The WordNet database, which was used in the creation of the
492 Inflections database, is under the following copyright:
494 This software and database is being provided to you, the LICENSEE,
495 by Princeton University under the following license. By obtaining,
496 using and/or copying this software and database, you agree that you
497 have read, understood, and will comply with these terms and
500 Permission to use, copy, modify and distribute this software and
501 database and its documentation for any purpose and without fee or
502 royalty is hereby granted, provided that you agree to comply with
503 the following copyright notice and statements, including the
504 disclaimer, and that the same appear on ALL copies of the software,
505 database and documentation, including modifications that you make
506 for internal use or for distribution.
508 WordNet 1.6 Copyright 1997 by Princeton University. All rights
511 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
512 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
513 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
514 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
515 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE
516 LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY
517 THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
519 The name of Princeton University or Princeton may not be used in
520 advertising or publicity pertaining to distribution of the software
521 and/or database. Title to copyright in this software, database and
522 any associated documentation shall at all times remain with
523 Princeton University and LICENSEE agrees to preserve same.
525 The 40 level includes words from Alan's 3esl list found in version 4.0
526 of his 12dicts package. Like his other stuff the 3esl list is also in the
529 The 50 level includes Brian's frequency class 1, words appearing
530 in at least 5 of 12 of the dictionaries as indicated in the 12Dicts
531 package, and uppercase words in at least 4 of the previous 12
532 dictionaries. A decent number of proper names is also included: The
533 top 1000 male, female, and Last names from the 1990 Census report; a
534 list of names sent to me by Alan Beale; and a few names that I added
535 myself. Finally a small list of abbreviations not commonly found in
536 other word lists is included.
538 The name files form the Census report is a government document which I
539 don't think can be copyrighted.
541 The file special-jargon.50 uses common.lst and word.lst from the
542 "Unofficial Jargon File Word Lists" which is derived from "The Jargon
543 File". All of which is in the Public Domain. This file also contain
544 a few extra UNIX terms which are found in the file "unix-terms" in the
547 The 55 level includes words from Alan's 2of4brif list found in version
548 4.0 of his 12dicts package. Like his other stuff the 2of4brif is also
549 in the public domain.
551 The 60 level includes all words appearing in at least 2 of the 12
552 dictionaries as indicated by the 12Dicts package.
554 The 70 level includes Brian's frequency class 0 and the 74,550 common
555 dictionary words from the MWords package. The common dictionary words,
556 like those from the 12Dicts package, have had all likely inflections
557 added. The 70 level also included the 5desk list from version 4.0 of
558 the 12Dics package which is in the public domain.
560 The 80 level includes the ENABLE word list, all the lists in the
561 ENABLE supplement package (except for ABLE), the "UK Advanced Cryptics
562 Dictionary" (UKACD), the list of signature words from the YAWL package,
563 and the 10,196 places list from the MWords package.
565 The ENABLE package, mainted by M\Cooper <thegrendel@theriver.com>,
566 is in the Public Domain:
568 The ENABLE master word list, WORD.LST, is herewith formally released
569 into the Public Domain. Anyone is free to use it or distribute it in
570 any manner they see fit. No fee or registration is required for its
571 use nor are "contributions" solicited (if you feel you absolutely
572 must contribute something for your own peace of mind, the authors of
573 the ENABLE list ask that you make a donation on their behalf to your
574 favorite charity). This word list is our gift to the Scrabble
575 community, as an alternate to "official" word lists. Game designers
576 may feel free to incorporate the WORD.LST into their games. Please
577 mention the source and credit us as originators of the list. Note
578 that if you, as a game designer, use the WORD.LST in your product,
579 you may still copyright and protect your product, but you may *not*
580 legally copyright or in any way restrict redistribution of the
581 WORD.LST portion of your product. This *may* under law restrict your
582 rights to restrict your users' rights, but that is only fair.
584 UKACD, by J Ross Beresford <ross@bryson.demon.co.uk>, is under the
587 Copyright (c) J Ross Beresford 1993-1999. All Rights Reserved.
589 The following restriction is placed on the use of this publication:
590 if The UK Advanced Cryptics Dictionary is used in a software package
591 or redistributed in any form, the copyright notice must be
592 prominently displayed and the text of this document must be included
595 There are no other restrictions: I would like to see the list
596 distributed as widely as possible.
598 The 95 level includes the 354,984 single words, 256,772 compound
599 words, 4,946 female names and the 3,897 male names, and 21,986 names
600 from the MWords package, ABLE.LST from the ENABLE Supplement, and some
601 additional words found in my part-of-speech database that were not
604 Accent information was taken from UKACD.
606 The VarCon package was used to create the American, British, Canadian,
607 and Australian word list. It is under the following copyright:
609 Copyright 2000-2016 by Kevin Atkinson
611 Permission to use, copy, modify, distribute and sell this array, the
612 associated software, and its documentation for any purpose is hereby
613 granted without fee, provided that the above copyright notice appears
614 in all copies and that both that copyright notice and this permission
615 notice appear in supporting documentation. Kevin Atkinson makes no
616 representations about the suitability of this array for any
617 purpose. It is provided "as is" without express or implied warranty.
619 Copyright 2016 by Benjamin Titze
621 Permission to use, copy, modify, distribute and sell this array, the
622 associated software, and its documentation for any purpose is hereby
623 granted without fee, provided that the above copyright notice appears
624 in all copies and that both that copyright notice and this permission
625 notice appear in supporting documentation. Benjamin Titze makes no
626 representations about the suitability of this array for any
627 purpose. It is provided "as is" without express or implied warranty.
629 Since the original words lists come from the Ispell distribution:
631 Copyright 1993, Geoff Kuenning, Granada Hills, CA
634 Redistribution and use in source and binary forms, with or without
635 modification, are permitted provided that the following conditions
638 1. Redistributions of source code must retain the above copyright
639 notice, this list of conditions and the following disclaimer.
640 2. Redistributions in binary form must reproduce the above copyright
641 notice, this list of conditions and the following disclaimer in the
642 documentation and/or other materials provided with the distribution.
643 3. All modifications to the source code must be clearly marked as
644 such. Binary redistributions based on modified source code
645 must be clearly marked as modified versions in the documentation
646 and/or other materials provided with the distribution.
647 (clause 4 removed with permission from Geoff Kuenning)
648 5. The name of Geoff Kuenning may not be used to endorse or promote
649 products derived from this software without specific prior
652 THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND
653 ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
654 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
655 ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE
656 FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
657 DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
658 OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
659 HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
660 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
661 OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
665 The variant word lists were created from a list of variants found in
666 the 12dicts supplement package as well as a list of variants I created
669 The Readmes for the various packages used can be found in the
670 appropriate directory under the r/ directory.
674 The process of "sort"s, "comm"s, and Perl scripts to combine the many
675 word lists and separate out the variant information is inexact and
676 error prone. The whole things needs to be rewritten to deal with
677 words in terms of lemmas. When the exact lemma is not known a best
678 guess should be made. I'm not sure what form this should be in. I
679 originally thought this should be some sort of database, but maybe I
680 should just slurp all that data into memory and process it in one
681 giant perl script. With the amount of memory available these days (at
682 least 2 GB, often 4 GB or more) this should not really be a problem.
684 In addition, there is a very nice frequency analyze of the BNC corpus
685 done by Adam Kilgarriff. Unlike Brian's word lists the BNC lists
686 include part of speech information. I plan on somehow using these
687 lists as Adam Kilgarriff has given me the OK to use it in SCOWL.
688 These lists will greatly reduce the problem of inflected forms of a
689 word appearing at different levels due to the part-of-speech
692 There is frequency information for some other corpus such as COCA
693 (Corpus of Contemporary American English) and ANS (American National
694 Corpus) which I might also be able to use. The former will require
695 permission, and the latter is of questionable quality.
697 RECREATING THE WORD LISTS:
699 In order to recreate the word lists you need a modern version of Perl,
700 bash, the traditional set of shell utilities, a system that supports
701 symbolic links, and quite possibly GNU Make. The easiest way to
702 recreate the word lists is to checkout the corresponding Git version
703 (see the version string at the start of the file) and simply type
704 "make" (see http://wordlist.aspell.net). You can try to download all
705 the pieces manually, but this method is not no longer tested nor
708 The src/ directory contains the numerous scripts used in the creation
709 of the final product.
711 The r/ directory contains the raw data used to create the final
712 product. If you checkout from Git this directory should be populated
713 automatically for you. If you insist on doing it the hard way see the
714 README file in the r/ directory for more information.
716 The l/ directory contains symbolic links used by the actual scripts.
718 Finally, the working/ directory is where all the intermittent files go
719 that are not specific to one source.