X-Git-Url: https://git.donarmstrong.com/?a=blobdiff_plain;ds=sidebyside;f=6%2Fr%2Falt12dicts%2FREADME-orig;fp=6%2Fr%2Falt12dicts%2FREADME-orig;h=7bb518ad4f734cc9a4a9976d30a7bad3ce439de0;hb=d5cbd5b855d8157fe44cd0979c2e517b93fb5004;hp=0000000000000000000000000000000000000000;hpb=0ba1587a200ebb44aea6355ee9330a720e3ecde2;p=deb_pkgs%2Fscowl.git diff --git a/6/r/alt12dicts/README-orig b/6/r/alt12dicts/README-orig new file mode 100644 index 0000000..7bb518a --- /dev/null +++ b/6/r/alt12dicts/README-orig @@ -0,0 +1,644 @@ +Introduction + +12dicts is a collection of English word lists. It differs in several +important ways from most of the other free word lists you can download. + + * The 12dicts lists are oriented towards common words. If you're + looking for myriads of archaic, scientific or computer jargon words, + you should look elsewhere. + * The 12dicts lists have been rigorously checked for errors. (This is + not to say that they are error-free, merely that enough care has + been taken that errors are rather infrequent.) + * 12dicts contains a variety of lists, of different sizes and + characteristics. One size does not fit all. Because each list has + different characteristics, I do not recommend combining them, except + as noted below. + +Originally, 12dicts was composed of lists derived from a specific set of +12 source dictionaries. In addition to these "classic" lists, 12dicts +now includes lists derived from other sources. It would perhaps be +appropriate to rename 12dicts to something more generic, such as BAWL +(Beale's Assorted Word Lists), but I have not done so in order to +preserve continuity. + +A quick summary of the 12dicts lists and their characteristics is as +follows: + ++---------------------------------------------------------+ +| |3esl |6of12|2of12|2of4brif|5desk|2of12inf| +|---------------+-----+-----+-----+--------+-----+--------| +|Size |21877|32153|41236|60387 |61406|81520 | +|---------------+-----+-----+-----+--------+-----+--------| +|Abbreviations |Y |Y |N |N |N |N | +|---------------+-----+-----+-----+--------+-----+--------| +|Acronyms |Y |Y |N |N |Y |N | +|---------------+-----+-----+-----+--------+-----+--------| +|British English|N |N |N |Y |N |N | +|---------------+-----+-----+-----+--------+-----+--------| +|Hyphenations |Y |Y |Y |N |N |N | +|---------------+-----+-----+-----+--------+-----+--------| +|Inflections |N |N |N |Y |N |Y | +|---------------+-----+-----+-----+--------+-----+--------| +|Names |Y |Y |N |N |Y |N | +|---------------+-----+-----+-----+--------+-----+--------| +|Phrases |Y |Y |N |N |N |N | ++---------------------------------------------------------+ + +The remainder of this document is organized as follows: + + * This release + * The classic 12dicts lists + + The 6of12 and 2of12 lists + + The 2of12inf list + * The 3esl list + * The 2of4brif list + * The 5desk list + * How 12dicts came to be + * Conclusions + +This release + +This is release 4.0 of 12dicts, released Jan. 18, 2003. It differs from +previous versions by containing three additional lists which are not +derived from the "classic" 12dicts sources. Changes to the classic lists +are limited to error corrections. + +The classic 12dicts lists + +The 12dicts project began as the n-dicts projects, n being a variable +whose value finally stabilized as 12. The purpose of the project was to +create a list of words approximating the common core of the vocabulary +of American English. + +The methodology of the project was to record and correlate the words +listed in a number of small dictionaries. The number of dictionaries so +recorded is now 12, comprising 8 ESL (English as a Second Language) +dictionaries and 4 "desk dictionaries". The dictionaries chosen vary +widely by publisher, by style, by completeness and by depth. In this +version of 12dicts, all of them are dictionaries of American English +(three from British publishers). The smallest of them contains about +20,000 entries, and the largest 46,000. (All totaled, there are about +75,000 entries, many of which appear in only a single dictionary.) All +but two of them were published in the last seven years. + +The 6of12 and 2of12 lists + +I initially tried two different ways of winnowing the 12dicts data to +produce lists of common words. Both produced interesting results. One +list, the 6of12 list, contains all words and phrases listed in 6 of the +12 dictionaries. One way of describing this list is that it contains +those words and phrases which a (seeming) majority of lexicographers +believe are relevant to people learning English, and/or to everyday +usage. This list contains about 32,000 words and phrases. The other +list, the 2of12 list, is more inclusive in that it includes words listed +in as few as two of the source dictionaries, but less inclusive in that +it excludes items of various sorts, including multiword phrases, proper +names and abbreviations. This list contains about 41,000 words. It is +perhaps more suitable for use in areas like spell checking or word games +than the 6of12 list. (Honesty compels me to admit that neither of these +lists is, by itself, a good choice for spell checking, due to the +absence of inflections, proper names, Roman numerals, etc.) + +A third list, 2of12inf.txt, developed later, is of a rather different +character, and is discussed in a later section. + +A more precise description of the criteria by which the above lists were +composed is as follows: + +6of12 list word selection + + * The 6of12 list contains all non-excluded words and phrases which + appear in 6 or more of the source dictionaries. + * Prefixes and suffixes are excluded. Abbreviations are included; + however, if they are entirely lower-case and alphabetic, they are + terminated with a colon (":") so they can be easily distinguished + from regular words. + * Inflections of included words are not themselves included unless + they are separately defined or irregular. + * It sometimes occurs that a word is listed in several forms (e.g., + with and without hyphenation) in 6 or more dictionaries, even though + no single form is so listed. In this case, if one spelling is + clearly more accepted, this spelling and this spelling only is + listed. If all spellings seem equally accepted, one spelling has + been selected arbitrarily for inclusion. + * The 6of12 list contains a significant number of words which do not + meet either criterion 1 or 4 above. These words, sometimes called + "signature words", are discussed below. All of these words are + listed in at least one of the source dictionaries. + * In addition to the ":" suffix discussed above, other special suffix + characters are used to mark words with certain characteristics, as + discussed below. + +2of12 list word selection + + * The 2of12 list contains all non-excluded words which appear in at + least 2 of the source dictionaries. + * This list excludes capitalized words, multiword phrases, and + abbreviations, as well as prefixes and suffixes. It does not exclude + hyphenated words or contractions. If a word occurs in both a + hyphenated and an unhyphenated form, the unhyphenated form is + listed, even if the hyphenated form is generally preferred. + * The list excludes spellings which are considered (by a majority of + the dictionaries listing it) to be non-American usage. It also + excludes secondary spellings which are mentioned by fewer than four + of the source dictionaries. + * Inflections of included words are not themselves included unless + they are separately defined, or irregular. + * Several of the source dictionaries include listings for obscure + currencies, such as ringgit, khoum and ngwee. I was unable to regard + such words as part of the English "core vocabulary", and so I + required citation in over a third of the dictionaries for inclusion + of monetary units. A side-effect was the elimination of the word + lepton, which, in addition to its use in particle physics, is also + .01 Greek drachmas. + * This list also includes a small number of signature words, as + discussed below. + +Signature words + +As indicated, both lists have been augmented with words (and, in the +case of the 6of12 list, phrases) which fail to meet the formal +requirements for inclusion. In the case of the 6of12 list, 1024 words +were added (about 3 % of the total). These are all words which, in the +judgment of the compiler, are as familiar as many of the words which met +the criteria for inclusion. Examples of some of the sorts of words which +were added are: + + * Words of the same category as other included words. An example is + the astrological sign Cancer, which alone of all the astrological + signs fails to appear in 6 or more of the dictionaries. Similarly + added were the omitted holidays Thanksgiving and Christmas Eve. + * Vulgarities, sexual terms and insults. Some such words were already + included, but most of the source dictionaries were quite squeamish + about them. These words are very widely known indeed; I hold that + any list of "common" words which does not include the infamous + f-word is simply discredited thereby. Some may feel that it would + have been better to leave some or all of these terms unmentioned. + Nevertheless, the expression of blasphemy, unwarranted contempt and + perverse lust, whether in words or in deeds, is a very human trait. + Suppressing the evidence of these aspects of the human condition in + our language makes no more sense than excluding leprosy, gangrene + and dementia, no matter how unpleasant they may be to contemplate. + * Conventional conversational phrases so common as to be practically + invisible to native speakers. Examples are thank you, good night, + uh-huh, of course and gesundheit. + * Sports terminology, especially for football and baseball. (If I, who + am practically sports-blind, noticed this deficiency, it must be of + major proportions indeed.) + +Note that the signature words in the 6of12 list can be identified via +the suffix character "+", and eliminated if desired. + +A much smaller set of words (49) was added to the 2of12 list. These were +of two sorts: + + * Signature words from the 6of12 list which were not already present + in the 2of12 list, and which are not excluded due to being + abbreviations, phrases, etc. + * Inflections of irregular verbs not explicitly mentioned in 2 source + dictionaries, such as outfought and reheard. + +Annotations + +Some of the 6of12 list entries are annotated with a suffix character, +giving additional information about the associated word. The annotations +can be easily removed with an editor or script if they are unwanted. + +These annotations are: + +: The word is an otherwise unmarked abbreviation. This suffix may appear + in combination with another suffix. +& The word is primarily a non-American usage. +# The word is generally held to be a variant or less preferred form of + another word. +< This form of a word is held to be the primary form by fewer + dictionaries than some other form of the word. +^ This form of the word was selected arbitrarily from a set of variants, + none of which was clearly preferred. += Roughly, this indicates a "second class" word, as described below. ++ The word is a signature word. + +The reasons a word might be marked with the = annotation are: + + * The word is an inflection which was defined in the same entry as the + base word. + * The word is a derived word (-ly, -ness or -er/or) which was not + defined in a separate entry. + * The word appeared in a list of undefined words with a common prefix, + such as un- or re-. + +The words in the 2of12 list are not annotated. + +The 2of12inf list + +The 2of12inf list is of a rather different character from the two +original "classic" lists. Conceptually, it is simple. It consists of all +the words in the 2of12 list, plus their inflections, amounting to about +81,000 words. This list may be more useful than the other lists for +applications like word games. It was created to help Kevin Atkinson in +his Aspell and SCOWL projects (for which, follow this link). Unlike the +6of12 and 2of12 lists, this list is not based exclusively on the +contents of my 12 source dictionaries, and for this reason it has, I +feel, less authority than the other classic 12dicts lists. It also +probably has a significantly higher error rate than the other lists, for +reasons explained below. + +The criteria defining the 2of12inf list are as follows: + + * The 2of12inf list contains all non-excluded words which appear in at + least 2 of the source dictionaries. + * This list excludes capitalized words, multiword phrases, + abbreviations, contractions, hyphenated words and single-letter + words, as well as prefixes and suffixes. + * The list does not exclude secondary spellings, non-American usages + or monetary units. + * The list includes inflections of all included words. Any inflection + mentioned or clearly implied by any of the source dictionaries is + included (i.e., two citations are not required). Additionally, some + inflections have been added from other sources. + * Plurals of "uncountable" nouns were included, annotated with the "%" + suffix character. See below for an extended discussion of the + inclusion of these words. + * Signature words from the other lists, plus their inflections, were + added. No other signature words were added. + +Though the 2of12inf list still consists mostly of very common words, +criteria 3 through 5 above cause the 2of12inf list to contain a greater +proportion of unfamiliar and unusual words than the other classic +12dicts lists. + +The 2of12inf list was not derived directly from the 12 source +dictionaries. The starting point was a subset of Kevin Atkinson's AGID +list, a list of words, parts of speech and inflections derived from +public-domain sources, notably Moby Words and WordNet. (See the file +agid.txt in the 12dicts archive, which is a copy of the AGID "readme", +for more information on the antecedents of AGID.) 2of12inf was created +by a process of editing the AGID subset to remove spurious entries and +those which reflected a more esoteric English vocabulary than the other +12dicts lists, and to add inflections which AGID failed to identify. +This process required significantly less effort than would have been +needed to derive the list directly from the source dictionaries. +Unfortunately, a side effect of the process is that the result is likely +to be somewhat less reliable than the other 12dicts lists. In +particular, Moby Words is notoriously unreliable, and I find it unlikely +that I have successfully identified all the spurious inflections its use +has introduced. It is my hope in the future to release another edition +of 2of12inf which is not derived from AGID, and therefore not "infected" +by Moby Words. + +After the first version of the 2of12inf list was released, I replaced +one of the source dictionaries, officially an international dictionary +but in actuality rather British in its orientation, with a more American +dictionary by the same publisher. It was not practical (nor necessarily +desirable) for me to go through the list removing inflections endorsed +only by the superseded dictionary. For this reason, the 2of12inf list +has a slightly more international character than the other 12dicts +lists. It is not altogether clear that this is a bad thing. + +Selection of inflections + +Ideally, the 2of12inf list would contain only inflections listed in one +of the 12dicts source dictionaries. This proved not to be practical. The +reason for this has to do with the nature of these sources, which are +mostly ESL dictionaries. An ESL dictionary might well list the word +esophagus, but, because an English learner is unlikely to need to talk +about this organ in the plural, it will probably not bother to list the +plural form esophagi. For words of this sort, I therefore needed to +obtain their inflections from other sources. Obviously, the decisions on +when to include additional inflections were judgment calls, as were the +choices of which inflections to add. + +Adjectival inflections (comparatives and superlatives) proved to be an +especially annoying problem. Only 2 of my 12 source dictionaries +provided remotely reliable information of this sort. In fact, such +information is sparse and inconsistent in most dictionaries of any size. +I relied on a small set of additional dictionaries for this information, +which was mostly disjoint from the sources for plurals and verb forms. +Several of these sources were Scrabble(r)-related, and therefore +inclined to include forms of little plausibility such as iller/illest or +fertiler/fertilest. Accordingly, I ended up rejecting some of the +documented inflections on grounds of implausibility. I have no doubt +that, in the process, I made a number of errors of both inclusion and +exclusion and, in any case, many of the forms listed have no connection +with any of the 12dicts source dictionaries. + +One additional problem in the creation of the 2of12inf list was that of +"uncountable" nouns and their plurals. Some English dictionaries, +especially ESL dictionaries, as well as other linguistic sources attest +to the existence of nouns which cannot be counted, or used in the +plural. Examples of such nouns include mud, rayon, oregano, chess, +fairness, wisdom, aluminum, training, materialism and chickenpox. This +is an entirely commonsense notion, but a difficulty is the fact that the +boundary between the countable and the uncountable is extremely vague +and ill-defined. For example, the word coffee is ordinarily uncountable, +but not when ordering in a restaurant, as is the word symmetry, except +in physics or math. In general, it is possible to contrive a context +where use of the plural of any noun whatsoever is reasonable. + +An alternate position, therefore, is that in fact no nouns are +uncountable, and that any noun which is not already plural possesses a +plural. This position is especially useful in the context of word games, +where words such as zeals and anthraxes may produce large scores. For +this reason, the official Scrabble dictionaries list words such as +thens, onces and mankinds, which most people find rather implausible. +The fact that the 2of12inf list might well be useful in gaming contexts, +together with the fact that the boundary between countable and +uncountable nouns is so ill-defined, served as a powerful argument for +inclusion of all plural forms, whether commonly used or not, while its +derivation from ESL sources argued for including only the plurals of +countable nouns, however distinguished. + +In the end, I was unable to resolve this dilemma, and adopted a +compromise. The 2of12inf list includes all plurals, but with the plurals +of uncountable nouns marked, making it easy to remove them if they are +not wanted. That left the issue of how to establish countability. Six of +my source dictionaries included information on countability, which was +adequate to decide the status of most of the included nouns. As for the +rest, as usual, I used my best judgment. I will confess to occasionally +overriding the source dictionaries when I believed they were clearly +incorrect. (For instance, I chose not to mark the word hatreds as an +uncountable plural, in defiance of the opinion of all my sources, on the +grounds that it has been used in too many news stories from Bosnia to be +considered unusual.) It is interesting to note that most of the plurals +I added from auxiliary sources were of words considered uncountable. + +The difficulties listed above, and the fact that I was forced to +exercise personal judgment frequently in creating it, emphasizes a +fundamental difference between this list and the other classic 12dicts +lists. I have tried to make the 6of12 and 2of12 lists reflect only the +source dictionaries, and to keep my own judgments and opinions out of +the picture (except for my addition of signature words). This has proved +impossible to achieve for the 2of12inf list, which accordingly +represents a less authoritative and more arbitrary collection. +Additionally, the 2of12inf list has undergone less proofreading and +validation than the other lists, and I suspect the error rate is +considerably higher than the idealistic goal of 0.02 % I advocate +elsewhere in this document. Nevertheless, I hope it may prove to be of +some use and interest. + +I wish to offer my special thanks to Kevin Atkinson, for supplying me +with the AGID list, and for encouraging me to add the inflections. Of +course, any errors that remain in the 2of12inf list are my own +responsibility, and should not be blamed on Kevin, AGID, or even on +Moby. + +The 3esl list + +The 3esl list represents another attempt to produce an English "core +vocabulary" list. It is about 2/3 of the size of the 6of12 list, which +it resembles in terms of the sorts of words included. + +The 3esl list is a far more subjective list than any of the classic +12dicts lists. It was compiled from 3 small ESL dictionaries, using the +same criteria for eligibility as the 6of12 list. I started with a list +composed of all words from the smallest of the 3 sources, plus all words +contained in both of the others. This list was then edited in the +following ways: + + 1. I removed alternate spellings for included words, such as grey and + off-stage. I also removed very similar synonyms for the same + concept, for instance, removing cable television as a duplicate of + cable TV. + 2. I added one form of each word which would have been included if the + sources had agreed on spelling, such as shortchange and back seat. + 3. I removed some words which were present in the smallest of the + sources but seemed too esoteric, such as the symbols of chemical + elements. I did this only for words which were not present in the + other sources. + 4. I added some words which were present in only one of the two larger + sources, but which seemed appropriate to add. These words were + frequently of the sort added to the 6of12 list as signature words, + as well as some inflections that often function as words with + meanings of their own, such as comforting and notes. + +All of these changes were quite subjective in nature, and quite +numerous. Probably more than 10 % of the candidate words were added or +removed in this way. For this reason, it is pointless to speak of +signature words for this list; the composition of the list is too +arbitrary for the term to make any sense. (I will note that the list is +still not entirely arbitrary, as I added only words found in some form +in one of the sources, and removed no words present in two of the +sources other than duplicates. Thus, words like front page were not +added, no matter how familiar, and words such as lugubrious were not +removed, despite clearly not being part of any "core vocabulary".) + +Like the 6of12 list, the 3esl list marks lower-case abbreviations with a +":" suffix, to prevent them from being mistaken for regular English +words. + +One final note on this list. The 3esl list contains about 1500 words not +present in the 6of12 list. Because these two lists have the same rules +for the kinds of words included, one could easily combine the two to +produce a slightly larger list including a number of words whose +omission from 6of12 is rather surprising. Be warned that in a few cases, +the spelling chosen for words with multiple spellings is different in +the two lists, and I would recommend that the duplicates be removed. +(I'll be happy to provide a list of the duplicates if anyone wants one.) + +The 2of4brif list + +All of the classic 12dicts lists are unabashedly oriented towards +American English. I've received a few expressions of interest in a +British English list. The result is the 2of4brif list. This list was +compiled from 4 large "international" ESL dictionaries, published by +British publishers. To this American, they are more British than they +are international; quite possibly, they seem more American than +international to British readers. It is interesting to note that, +although there were only a third as many sources for this list as for +the 12dicts lists, these dictionaries resembled each other far more +closely than their American counterparts, which could mean that the +2of4brif list is as good an approximation of a "core" British English +vocabulary as the 6of12 list is for American English. (Or, alternately, +it may simply mean that my choice of sources was too narrow.) + +This criteria for inclusion in this list were basically those of the +2of12inf list. In particular, inflections are included for all words, +but hyphenated words, contractions, phrases, proper names and +abbreviations are all excluded. One important difference between the two +is the way in which inflections were determined for inclusion. The +2of12inf list includes some inflections found in one (or even none) of +its sources. Further, as discussed in detail above, it includes plurals +for words which are not normally considered to have plurals. The +2of4brif list differs in both of these regards. It includes only +inflections endorsed by two or more of the sources, specifically +excluding any plural forms for nouns listed as uncountable. + +The 2of4brif list includes no signature words as such. I made a small +number of adjustments for consistency, such as making sure that -ise and +-ize spellings were equally represented, and adding plurals for ordinal +numbers. (Why fourteenth would be defined as a fraction, but not +seventeenth, I must simply regard as a mystery.) These edits were so +few, and so clearly harmless, that I have not marked them. + +Prospective users of the 2of4brif list should realize that it was +compiled by an American. If my sources contained any glaring errors (and +most dictionaries have a few), I might well not have noticed, and +perpetuated them in the list. The fact that two citations were required +is some protection against such an event, but no guarantee. + +As the 2of4brif list is very similar in makeup to the 2of12inf list, a +user who wants a larger, more international list than either could +reasonably merge the two. If you do this, you should remove the unusual +plurals (marked with a "%") from the 2of12inf list in the process, for +consistency. + +The 5desk list + +I created the 5desk list in an attempt to do a better /usr/dict/words +(about which I offer many harsh criticisms elsewhere in this document). +The sorts of words admitted are the same sorts that /usr/dict/words +contains. Though somewhat larger in size than most versions of /usr/dict +/words, this is still a short word list, striving for inclusion of words +one is likely to encounter rather than the complete jargon of every +possible scientific, artistic or occult endeavor. + +5desk was assembled primarily from five "desk dictionaries". It was +augmented by words from five minor sources, including a "vocabulary +builder" and a collection of proper names. The list excludes prefixes, +suffixes, phrases, hyphenated words, contractions and most abbreviations +and acronyms. There was no requirement for multiple listings; all +qualifying words from each of the sources were included. Inflections of +included words were not included themselves except when irregular, or +separately defined. Variant and non-American spellings were not +excluded, and no signature words were added. + +Words commonly considered to be abbreviations/acronyms were included if +they contained at least one upper case character, and were defined with +an explicit part of speech. This excluded items like Mr and Feb, which +are abbreviations in the classic sense, but allowed words like DNA and +ATM, which are used far more frequently than that which they abbreviate. +While there is a trend in modern dictionaries to list such words as +nouns (or occasionally verbs, adverbs, etc.), it is a trend in progress, +and rather inconsistently applied. For this reason, the set of such +words in the 5desk list is somewhat incoherent, including SPCA but not +PETA, AIDS but not SIDS, KGB but not CIA, and PDQ but not ASAP. + +One class of commonly-used words is regrettably absent from the 5desk +list, because I was unable to find a satisfactory source for them. This +is the class of commercial names such as Exxon, Tylenol, Pepsi and Chevy +. This is probably forgivable, as this class of names is as ephemeral +and transitory as teenage slang. The one-time household words Kool, +Ovaltine, Philco and Ipana serve now only as answers to trivia +questions, with modern wonders like Starbucks, Google, Ritalin and TiVo +taking their place on the tongues of the trendy. + +The 5desk list has clearly moved beyond any "core vocabulary" concept. +It includes quite esoteric words (ogee, pleonastic), very uncommon +spellings (thiamine, yuppy), and obscure geographical and historical +names (Paricutin, Nevelson). Like /usr/dict/words, it is frequently +inconsistent and arbitrary, but I hope at the least I have avoided +including spelling errors, and overlooking the stuff of everyday +conversation. Perhaps it will be useful as a compromise between basic +lists such as 3esl, and truly massive lists like Mendel Cooper's ENABLE. + +How 12dicts came to be + +It may have occurred to some to wonder about how something like the +n-dicts project came to be (though I assume that anyone who bothers to +download this archive must already have some idea that such a project +could be of interest). + +Some years ago, there was a post to the sci.crypt Usenet newsgroup, on +the subject of creating PGP passphrases using randomly selected entries +from a supplied list of very short words. (If this sounds interesting, +follow this link for an expanded version of the post.) The word list, +which was extracted from /usr/dict/words on some UNIX system, seemed to +me ill-suited to its intended purpose. It included arcane acronyms ( +bstj, fmc), misspellings (diety, ouvre) and words of amazing obscurity ( +bhoy, kombu). I decided I could do better (and eventually did). This +caused me to start downloading English word lists, of which there are +many, from the Internet. I was not impressed by the overall quality of +these lists, and the few which were high-quality were all-inclusive, +burying the everyday words under a mountain of archaisms and esoterica. +The flaws of the vast majority of these lists are worth recounting: + + * Failure to proofread. Many of these lists are littered with + misspellings and typos, sometimes approaching gibberish. (I presume, + for instance, that the bizarre string nondploe, which was found in a + purported Scrabble word list, is a typo for something more or less + legitimate, but I have no idea what.) Working on my own lists has + helped me understand that 100 % accuracy is a very demanding goal, + seldom actually achieved, but I still feel it reasonable to expect + no more than 1 or 2 errors per 10,000 words. + * Acceptance of completely undocumented lazy spellings, such as + bullseye and courtmartial. + * Failure to respect capitalization. + * Failure to distinguish abbreviations from other entries. + * Treating esoteric computer jargon, and especially UNIX jargon, as + everyday English. (Beware any list which includes bitblt, emacs, + inode or lvalue.) + * Apparently random word selection. For instance, the most common + version of /usr/dicts/words contains a large set of apparently + randomly chosen personal names (uncapitalized, of course, and + missing wanda, marge, polly and sid). + * Inconsistent inflection. Some lists include all inflections of their + vocabulary, while others include only singulars and infinitives. + Either policy is fine, and has its advantages. I am personally very + annoyed when inflected forms appear at random. I find this generally + happens when a compiler merges several lists with different + characteristics, with no attempt to reconcile their divergent + styles. + * Omission of everyday words. I've seen a purported general-purpose + list that includes bremsstrahlung, yet omits log and beer. Or that + includes saxophone but not sax, and rhinoceros but not rhino. Of + course, due to my original purpose in seeking out common short + words, I found this especially annoying. + +One result of my frustration with this situation was my working with +Mendel Cooper on ENABLE (for further information, check out this link), +which was close to unique in having an active caretaker, one clearly +concerned with quality, and in being oriented towards American rather +than British English. But ENABLE is an all-encompassing list and, even +if it had been complete at the time I started my search for a list of +common words, it would not have been what I wanted for that reason. + +I finally decided that only starting from scratch with a systematic +approach was likely to get me what I was looking for, and that +dictionaries intended for non-native speakers of English were the best +possible source for words that are in some cases so familiar that we +never think of them. This has led to the 12dicts lists, which I hope +have managed to avoid the flaws recited above. + +(I should acknowledge one form of inconsistency exhibited by the 12dicts +lists, which is that sometimes related words are spelled inconsistently. +For instance, the 2of12 list contains both broadminded and +broad-mindedness. This generally occurs as a result of the methodology +used to build the lists. In the case of broadminded, only one source +dictionary listed broadmindedness, which was therefore excluded. I felt +unequal to trying to correct these inconsistencies, some of which are +real and not mere artifacts of 12dicts, such as the contrast between +self-conscious and unselfconscious.) + +Conclusions + +When I released the first version of 12dicts in 1999, I assumed I was +done with it. It hasn't worked out that way. Before I declare it +finished for a second time, there are a few more things I'd like to +accomplish. + + * As mentioned above, I would like to rework the 2of12inf list to + remove the dependency on the Moby lists. + * As may be seen by inspecting the table of file characteristics, the + 12dicts files now form a spectrum of word lists, with contents + ranging from the extremely common to the mildly esoteric. I would + like to extend the spectrum further by applying the 12dicts + methodology to dictionaries of larger size. Whether I will ever get + the time for a project this large remains to be seen. If it ever + comes to pass, it will probably be released separately from 12dicts + itself, as anything larger than the 5desk list will be too large to + even pretend to represent a "core English" vocabulary. (Even the + 5desk list itself is too large for that purpose.) + * It is possible that in the future the "n" of n-dicts will increase + again, but, in fact, consideration of an additional dictionary now + generally ends with the discovery that its vocabulary matches + 12dicts pretty closely. At the very least, this phenomenon gives me + hope that the 12dicts lists have now fulfilled their basic purpose. + +The 12dicts lists were compiled by Alan Beale. I explicitly release them +to the public domain, but request acknowledgment of their use. +(Actually, the dependency of the 2of12inf list on AGID prevents its +release into the public domain. However, I do not impose any additional +requirements on its use beyond those imposed by AGID and its sources, as +described in agid.txt.) Feel free to send comments, suggestions, +inquiries and/or large sums of money to me at biljir@pobox.com. If you +find 12dicts useful, I'd love to hear about it.