X-Git-Url: https://git.donarmstrong.com/?p=deb_pkgs%2Fscowl.git;a=blobdiff_plain;f=current%2Fr%2Falt12dicts%2FREADME-orig;fp=current%2Fr%2Falt12dicts%2FREADME-orig;h=0000000000000000000000000000000000000000;hp=7bb518ad4f734cc9a4a9976d30a7bad3ce439de0;hb=b13ea8a082364672c6de2b010e558211ff52ec9a;hpb=01534a94130c1f5a3a230cf4fe18365a235ba271 diff --git a/current/r/alt12dicts/README-orig b/current/r/alt12dicts/README-orig deleted file mode 100644 index 7bb518a..0000000 --- a/current/r/alt12dicts/README-orig +++ /dev/null @@ -1,644 +0,0 @@ -Introduction - -12dicts is a collection of English word lists. It differs in several -important ways from most of the other free word lists you can download. - - * The 12dicts lists are oriented towards common words. If you're - looking for myriads of archaic, scientific or computer jargon words, - you should look elsewhere. - * The 12dicts lists have been rigorously checked for errors. (This is - not to say that they are error-free, merely that enough care has - been taken that errors are rather infrequent.) - * 12dicts contains a variety of lists, of different sizes and - characteristics. One size does not fit all. Because each list has - different characteristics, I do not recommend combining them, except - as noted below. - -Originally, 12dicts was composed of lists derived from a specific set of -12 source dictionaries. In addition to these "classic" lists, 12dicts -now includes lists derived from other sources. It would perhaps be -appropriate to rename 12dicts to something more generic, such as BAWL -(Beale's Assorted Word Lists), but I have not done so in order to -preserve continuity. - -A quick summary of the 12dicts lists and their characteristics is as -follows: - -+---------------------------------------------------------+ -| |3esl |6of12|2of12|2of4brif|5desk|2of12inf| -|---------------+-----+-----+-----+--------+-----+--------| -|Size |21877|32153|41236|60387 |61406|81520 | -|---------------+-----+-----+-----+--------+-----+--------| -|Abbreviations |Y |Y |N |N |N |N | -|---------------+-----+-----+-----+--------+-----+--------| -|Acronyms |Y |Y |N |N |Y |N | -|---------------+-----+-----+-----+--------+-----+--------| -|British English|N |N |N |Y |N |N | -|---------------+-----+-----+-----+--------+-----+--------| -|Hyphenations |Y |Y |Y |N |N |N | -|---------------+-----+-----+-----+--------+-----+--------| -|Inflections |N |N |N |Y |N |Y | -|---------------+-----+-----+-----+--------+-----+--------| -|Names |Y |Y |N |N |Y |N | -|---------------+-----+-----+-----+--------+-----+--------| -|Phrases |Y |Y |N |N |N |N | -+---------------------------------------------------------+ - -The remainder of this document is organized as follows: - - * This release - * The classic 12dicts lists - + The 6of12 and 2of12 lists - + The 2of12inf list - * The 3esl list - * The 2of4brif list - * The 5desk list - * How 12dicts came to be - * Conclusions - -This release - -This is release 4.0 of 12dicts, released Jan. 18, 2003. It differs from -previous versions by containing three additional lists which are not -derived from the "classic" 12dicts sources. Changes to the classic lists -are limited to error corrections. - -The classic 12dicts lists - -The 12dicts project began as the n-dicts projects, n being a variable -whose value finally stabilized as 12. The purpose of the project was to -create a list of words approximating the common core of the vocabulary -of American English. - -The methodology of the project was to record and correlate the words -listed in a number of small dictionaries. The number of dictionaries so -recorded is now 12, comprising 8 ESL (English as a Second Language) -dictionaries and 4 "desk dictionaries". The dictionaries chosen vary -widely by publisher, by style, by completeness and by depth. In this -version of 12dicts, all of them are dictionaries of American English -(three from British publishers). The smallest of them contains about -20,000 entries, and the largest 46,000. (All totaled, there are about -75,000 entries, many of which appear in only a single dictionary.) All -but two of them were published in the last seven years. - -The 6of12 and 2of12 lists - -I initially tried two different ways of winnowing the 12dicts data to -produce lists of common words. Both produced interesting results. One -list, the 6of12 list, contains all words and phrases listed in 6 of the -12 dictionaries. One way of describing this list is that it contains -those words and phrases which a (seeming) majority of lexicographers -believe are relevant to people learning English, and/or to everyday -usage. This list contains about 32,000 words and phrases. The other -list, the 2of12 list, is more inclusive in that it includes words listed -in as few as two of the source dictionaries, but less inclusive in that -it excludes items of various sorts, including multiword phrases, proper -names and abbreviations. This list contains about 41,000 words. It is -perhaps more suitable for use in areas like spell checking or word games -than the 6of12 list. (Honesty compels me to admit that neither of these -lists is, by itself, a good choice for spell checking, due to the -absence of inflections, proper names, Roman numerals, etc.) - -A third list, 2of12inf.txt, developed later, is of a rather different -character, and is discussed in a later section. - -A more precise description of the criteria by which the above lists were -composed is as follows: - -6of12 list word selection - - * The 6of12 list contains all non-excluded words and phrases which - appear in 6 or more of the source dictionaries. - * Prefixes and suffixes are excluded. Abbreviations are included; - however, if they are entirely lower-case and alphabetic, they are - terminated with a colon (":") so they can be easily distinguished - from regular words. - * Inflections of included words are not themselves included unless - they are separately defined or irregular. - * It sometimes occurs that a word is listed in several forms (e.g., - with and without hyphenation) in 6 or more dictionaries, even though - no single form is so listed. In this case, if one spelling is - clearly more accepted, this spelling and this spelling only is - listed. If all spellings seem equally accepted, one spelling has - been selected arbitrarily for inclusion. - * The 6of12 list contains a significant number of words which do not - meet either criterion 1 or 4 above. These words, sometimes called - "signature words", are discussed below. All of these words are - listed in at least one of the source dictionaries. - * In addition to the ":" suffix discussed above, other special suffix - characters are used to mark words with certain characteristics, as - discussed below. - -2of12 list word selection - - * The 2of12 list contains all non-excluded words which appear in at - least 2 of the source dictionaries. - * This list excludes capitalized words, multiword phrases, and - abbreviations, as well as prefixes and suffixes. It does not exclude - hyphenated words or contractions. If a word occurs in both a - hyphenated and an unhyphenated form, the unhyphenated form is - listed, even if the hyphenated form is generally preferred. - * The list excludes spellings which are considered (by a majority of - the dictionaries listing it) to be non-American usage. It also - excludes secondary spellings which are mentioned by fewer than four - of the source dictionaries. - * Inflections of included words are not themselves included unless - they are separately defined, or irregular. - * Several of the source dictionaries include listings for obscure - currencies, such as ringgit, khoum and ngwee. I was unable to regard - such words as part of the English "core vocabulary", and so I - required citation in over a third of the dictionaries for inclusion - of monetary units. A side-effect was the elimination of the word - lepton, which, in addition to its use in particle physics, is also - .01 Greek drachmas. - * This list also includes a small number of signature words, as - discussed below. - -Signature words - -As indicated, both lists have been augmented with words (and, in the -case of the 6of12 list, phrases) which fail to meet the formal -requirements for inclusion. In the case of the 6of12 list, 1024 words -were added (about 3 % of the total). These are all words which, in the -judgment of the compiler, are as familiar as many of the words which met -the criteria for inclusion. Examples of some of the sorts of words which -were added are: - - * Words of the same category as other included words. An example is - the astrological sign Cancer, which alone of all the astrological - signs fails to appear in 6 or more of the dictionaries. Similarly - added were the omitted holidays Thanksgiving and Christmas Eve. - * Vulgarities, sexual terms and insults. Some such words were already - included, but most of the source dictionaries were quite squeamish - about them. These words are very widely known indeed; I hold that - any list of "common" words which does not include the infamous - f-word is simply discredited thereby. Some may feel that it would - have been better to leave some or all of these terms unmentioned. - Nevertheless, the expression of blasphemy, unwarranted contempt and - perverse lust, whether in words or in deeds, is a very human trait. - Suppressing the evidence of these aspects of the human condition in - our language makes no more sense than excluding leprosy, gangrene - and dementia, no matter how unpleasant they may be to contemplate. - * Conventional conversational phrases so common as to be practically - invisible to native speakers. Examples are thank you, good night, - uh-huh, of course and gesundheit. - * Sports terminology, especially for football and baseball. (If I, who - am practically sports-blind, noticed this deficiency, it must be of - major proportions indeed.) - -Note that the signature words in the 6of12 list can be identified via -the suffix character "+", and eliminated if desired. - -A much smaller set of words (49) was added to the 2of12 list. These were -of two sorts: - - * Signature words from the 6of12 list which were not already present - in the 2of12 list, and which are not excluded due to being - abbreviations, phrases, etc. - * Inflections of irregular verbs not explicitly mentioned in 2 source - dictionaries, such as outfought and reheard. - -Annotations - -Some of the 6of12 list entries are annotated with a suffix character, -giving additional information about the associated word. The annotations -can be easily removed with an editor or script if they are unwanted. - -These annotations are: - -: The word is an otherwise unmarked abbreviation. This suffix may appear - in combination with another suffix. -& The word is primarily a non-American usage. -# The word is generally held to be a variant or less preferred form of - another word. -< This form of a word is held to be the primary form by fewer - dictionaries than some other form of the word. -^ This form of the word was selected arbitrarily from a set of variants, - none of which was clearly preferred. -= Roughly, this indicates a "second class" word, as described below. -+ The word is a signature word. - -The reasons a word might be marked with the = annotation are: - - * The word is an inflection which was defined in the same entry as the - base word. - * The word is a derived word (-ly, -ness or -er/or) which was not - defined in a separate entry. - * The word appeared in a list of undefined words with a common prefix, - such as un- or re-. - -The words in the 2of12 list are not annotated. - -The 2of12inf list - -The 2of12inf list is of a rather different character from the two -original "classic" lists. Conceptually, it is simple. It consists of all -the words in the 2of12 list, plus their inflections, amounting to about -81,000 words. This list may be more useful than the other lists for -applications like word games. It was created to help Kevin Atkinson in -his Aspell and SCOWL projects (for which, follow this link). Unlike the -6of12 and 2of12 lists, this list is not based exclusively on the -contents of my 12 source dictionaries, and for this reason it has, I -feel, less authority than the other classic 12dicts lists. It also -probably has a significantly higher error rate than the other lists, for -reasons explained below. - -The criteria defining the 2of12inf list are as follows: - - * The 2of12inf list contains all non-excluded words which appear in at - least 2 of the source dictionaries. - * This list excludes capitalized words, multiword phrases, - abbreviations, contractions, hyphenated words and single-letter - words, as well as prefixes and suffixes. - * The list does not exclude secondary spellings, non-American usages - or monetary units. - * The list includes inflections of all included words. Any inflection - mentioned or clearly implied by any of the source dictionaries is - included (i.e., two citations are not required). Additionally, some - inflections have been added from other sources. - * Plurals of "uncountable" nouns were included, annotated with the "%" - suffix character. See below for an extended discussion of the - inclusion of these words. - * Signature words from the other lists, plus their inflections, were - added. No other signature words were added. - -Though the 2of12inf list still consists mostly of very common words, -criteria 3 through 5 above cause the 2of12inf list to contain a greater -proportion of unfamiliar and unusual words than the other classic -12dicts lists. - -The 2of12inf list was not derived directly from the 12 source -dictionaries. The starting point was a subset of Kevin Atkinson's AGID -list, a list of words, parts of speech and inflections derived from -public-domain sources, notably Moby Words and WordNet. (See the file -agid.txt in the 12dicts archive, which is a copy of the AGID "readme", -for more information on the antecedents of AGID.) 2of12inf was created -by a process of editing the AGID subset to remove spurious entries and -those which reflected a more esoteric English vocabulary than the other -12dicts lists, and to add inflections which AGID failed to identify. -This process required significantly less effort than would have been -needed to derive the list directly from the source dictionaries. -Unfortunately, a side effect of the process is that the result is likely -to be somewhat less reliable than the other 12dicts lists. In -particular, Moby Words is notoriously unreliable, and I find it unlikely -that I have successfully identified all the spurious inflections its use -has introduced. It is my hope in the future to release another edition -of 2of12inf which is not derived from AGID, and therefore not "infected" -by Moby Words. - -After the first version of the 2of12inf list was released, I replaced -one of the source dictionaries, officially an international dictionary -but in actuality rather British in its orientation, with a more American -dictionary by the same publisher. It was not practical (nor necessarily -desirable) for me to go through the list removing inflections endorsed -only by the superseded dictionary. For this reason, the 2of12inf list -has a slightly more international character than the other 12dicts -lists. It is not altogether clear that this is a bad thing. - -Selection of inflections - -Ideally, the 2of12inf list would contain only inflections listed in one -of the 12dicts source dictionaries. This proved not to be practical. The -reason for this has to do with the nature of these sources, which are -mostly ESL dictionaries. An ESL dictionary might well list the word -esophagus, but, because an English learner is unlikely to need to talk -about this organ in the plural, it will probably not bother to list the -plural form esophagi. For words of this sort, I therefore needed to -obtain their inflections from other sources. Obviously, the decisions on -when to include additional inflections were judgment calls, as were the -choices of which inflections to add. - -Adjectival inflections (comparatives and superlatives) proved to be an -especially annoying problem. Only 2 of my 12 source dictionaries -provided remotely reliable information of this sort. In fact, such -information is sparse and inconsistent in most dictionaries of any size. -I relied on a small set of additional dictionaries for this information, -which was mostly disjoint from the sources for plurals and verb forms. -Several of these sources were Scrabble(r)-related, and therefore -inclined to include forms of little plausibility such as iller/illest or -fertiler/fertilest. Accordingly, I ended up rejecting some of the -documented inflections on grounds of implausibility. I have no doubt -that, in the process, I made a number of errors of both inclusion and -exclusion and, in any case, many of the forms listed have no connection -with any of the 12dicts source dictionaries. - -One additional problem in the creation of the 2of12inf list was that of -"uncountable" nouns and their plurals. Some English dictionaries, -especially ESL dictionaries, as well as other linguistic sources attest -to the existence of nouns which cannot be counted, or used in the -plural. Examples of such nouns include mud, rayon, oregano, chess, -fairness, wisdom, aluminum, training, materialism and chickenpox. This -is an entirely commonsense notion, but a difficulty is the fact that the -boundary between the countable and the uncountable is extremely vague -and ill-defined. For example, the word coffee is ordinarily uncountable, -but not when ordering in a restaurant, as is the word symmetry, except -in physics or math. In general, it is possible to contrive a context -where use of the plural of any noun whatsoever is reasonable. - -An alternate position, therefore, is that in fact no nouns are -uncountable, and that any noun which is not already plural possesses a -plural. This position is especially useful in the context of word games, -where words such as zeals and anthraxes may produce large scores. For -this reason, the official Scrabble dictionaries list words such as -thens, onces and mankinds, which most people find rather implausible. -The fact that the 2of12inf list might well be useful in gaming contexts, -together with the fact that the boundary between countable and -uncountable nouns is so ill-defined, served as a powerful argument for -inclusion of all plural forms, whether commonly used or not, while its -derivation from ESL sources argued for including only the plurals of -countable nouns, however distinguished. - -In the end, I was unable to resolve this dilemma, and adopted a -compromise. The 2of12inf list includes all plurals, but with the plurals -of uncountable nouns marked, making it easy to remove them if they are -not wanted. That left the issue of how to establish countability. Six of -my source dictionaries included information on countability, which was -adequate to decide the status of most of the included nouns. As for the -rest, as usual, I used my best judgment. I will confess to occasionally -overriding the source dictionaries when I believed they were clearly -incorrect. (For instance, I chose not to mark the word hatreds as an -uncountable plural, in defiance of the opinion of all my sources, on the -grounds that it has been used in too many news stories from Bosnia to be -considered unusual.) It is interesting to note that most of the plurals -I added from auxiliary sources were of words considered uncountable. - -The difficulties listed above, and the fact that I was forced to -exercise personal judgment frequently in creating it, emphasizes a -fundamental difference between this list and the other classic 12dicts -lists. I have tried to make the 6of12 and 2of12 lists reflect only the -source dictionaries, and to keep my own judgments and opinions out of -the picture (except for my addition of signature words). This has proved -impossible to achieve for the 2of12inf list, which accordingly -represents a less authoritative and more arbitrary collection. -Additionally, the 2of12inf list has undergone less proofreading and -validation than the other lists, and I suspect the error rate is -considerably higher than the idealistic goal of 0.02 % I advocate -elsewhere in this document. Nevertheless, I hope it may prove to be of -some use and interest. - -I wish to offer my special thanks to Kevin Atkinson, for supplying me -with the AGID list, and for encouraging me to add the inflections. Of -course, any errors that remain in the 2of12inf list are my own -responsibility, and should not be blamed on Kevin, AGID, or even on -Moby. - -The 3esl list - -The 3esl list represents another attempt to produce an English "core -vocabulary" list. It is about 2/3 of the size of the 6of12 list, which -it resembles in terms of the sorts of words included. - -The 3esl list is a far more subjective list than any of the classic -12dicts lists. It was compiled from 3 small ESL dictionaries, using the -same criteria for eligibility as the 6of12 list. I started with a list -composed of all words from the smallest of the 3 sources, plus all words -contained in both of the others. This list was then edited in the -following ways: - - 1. I removed alternate spellings for included words, such as grey and - off-stage. I also removed very similar synonyms for the same - concept, for instance, removing cable television as a duplicate of - cable TV. - 2. I added one form of each word which would have been included if the - sources had agreed on spelling, such as shortchange and back seat. - 3. I removed some words which were present in the smallest of the - sources but seemed too esoteric, such as the symbols of chemical - elements. I did this only for words which were not present in the - other sources. - 4. I added some words which were present in only one of the two larger - sources, but which seemed appropriate to add. These words were - frequently of the sort added to the 6of12 list as signature words, - as well as some inflections that often function as words with - meanings of their own, such as comforting and notes. - -All of these changes were quite subjective in nature, and quite -numerous. Probably more than 10 % of the candidate words were added or -removed in this way. For this reason, it is pointless to speak of -signature words for this list; the composition of the list is too -arbitrary for the term to make any sense. (I will note that the list is -still not entirely arbitrary, as I added only words found in some form -in one of the sources, and removed no words present in two of the -sources other than duplicates. Thus, words like front page were not -added, no matter how familiar, and words such as lugubrious were not -removed, despite clearly not being part of any "core vocabulary".) - -Like the 6of12 list, the 3esl list marks lower-case abbreviations with a -":" suffix, to prevent them from being mistaken for regular English -words. - -One final note on this list. The 3esl list contains about 1500 words not -present in the 6of12 list. Because these two lists have the same rules -for the kinds of words included, one could easily combine the two to -produce a slightly larger list including a number of words whose -omission from 6of12 is rather surprising. Be warned that in a few cases, -the spelling chosen for words with multiple spellings is different in -the two lists, and I would recommend that the duplicates be removed. -(I'll be happy to provide a list of the duplicates if anyone wants one.) - -The 2of4brif list - -All of the classic 12dicts lists are unabashedly oriented towards -American English. I've received a few expressions of interest in a -British English list. The result is the 2of4brif list. This list was -compiled from 4 large "international" ESL dictionaries, published by -British publishers. To this American, they are more British than they -are international; quite possibly, they seem more American than -international to British readers. It is interesting to note that, -although there were only a third as many sources for this list as for -the 12dicts lists, these dictionaries resembled each other far more -closely than their American counterparts, which could mean that the -2of4brif list is as good an approximation of a "core" British English -vocabulary as the 6of12 list is for American English. (Or, alternately, -it may simply mean that my choice of sources was too narrow.) - -This criteria for inclusion in this list were basically those of the -2of12inf list. In particular, inflections are included for all words, -but hyphenated words, contractions, phrases, proper names and -abbreviations are all excluded. One important difference between the two -is the way in which inflections were determined for inclusion. The -2of12inf list includes some inflections found in one (or even none) of -its sources. Further, as discussed in detail above, it includes plurals -for words which are not normally considered to have plurals. The -2of4brif list differs in both of these regards. It includes only -inflections endorsed by two or more of the sources, specifically -excluding any plural forms for nouns listed as uncountable. - -The 2of4brif list includes no signature words as such. I made a small -number of adjustments for consistency, such as making sure that -ise and --ize spellings were equally represented, and adding plurals for ordinal -numbers. (Why fourteenth would be defined as a fraction, but not -seventeenth, I must simply regard as a mystery.) These edits were so -few, and so clearly harmless, that I have not marked them. - -Prospective users of the 2of4brif list should realize that it was -compiled by an American. If my sources contained any glaring errors (and -most dictionaries have a few), I might well not have noticed, and -perpetuated them in the list. The fact that two citations were required -is some protection against such an event, but no guarantee. - -As the 2of4brif list is very similar in makeup to the 2of12inf list, a -user who wants a larger, more international list than either could -reasonably merge the two. If you do this, you should remove the unusual -plurals (marked with a "%") from the 2of12inf list in the process, for -consistency. - -The 5desk list - -I created the 5desk list in an attempt to do a better /usr/dict/words -(about which I offer many harsh criticisms elsewhere in this document). -The sorts of words admitted are the same sorts that /usr/dict/words -contains. Though somewhat larger in size than most versions of /usr/dict -/words, this is still a short word list, striving for inclusion of words -one is likely to encounter rather than the complete jargon of every -possible scientific, artistic or occult endeavor. - -5desk was assembled primarily from five "desk dictionaries". It was -augmented by words from five minor sources, including a "vocabulary -builder" and a collection of proper names. The list excludes prefixes, -suffixes, phrases, hyphenated words, contractions and most abbreviations -and acronyms. There was no requirement for multiple listings; all -qualifying words from each of the sources were included. Inflections of -included words were not included themselves except when irregular, or -separately defined. Variant and non-American spellings were not -excluded, and no signature words were added. - -Words commonly considered to be abbreviations/acronyms were included if -they contained at least one upper case character, and were defined with -an explicit part of speech. This excluded items like Mr and Feb, which -are abbreviations in the classic sense, but allowed words like DNA and -ATM, which are used far more frequently than that which they abbreviate. -While there is a trend in modern dictionaries to list such words as -nouns (or occasionally verbs, adverbs, etc.), it is a trend in progress, -and rather inconsistently applied. For this reason, the set of such -words in the 5desk list is somewhat incoherent, including SPCA but not -PETA, AIDS but not SIDS, KGB but not CIA, and PDQ but not ASAP. - -One class of commonly-used words is regrettably absent from the 5desk -list, because I was unable to find a satisfactory source for them. This -is the class of commercial names such as Exxon, Tylenol, Pepsi and Chevy -. This is probably forgivable, as this class of names is as ephemeral -and transitory as teenage slang. The one-time household words Kool, -Ovaltine, Philco and Ipana serve now only as answers to trivia -questions, with modern wonders like Starbucks, Google, Ritalin and TiVo -taking their place on the tongues of the trendy. - -The 5desk list has clearly moved beyond any "core vocabulary" concept. -It includes quite esoteric words (ogee, pleonastic), very uncommon -spellings (thiamine, yuppy), and obscure geographical and historical -names (Paricutin, Nevelson). Like /usr/dict/words, it is frequently -inconsistent and arbitrary, but I hope at the least I have avoided -including spelling errors, and overlooking the stuff of everyday -conversation. Perhaps it will be useful as a compromise between basic -lists such as 3esl, and truly massive lists like Mendel Cooper's ENABLE. - -How 12dicts came to be - -It may have occurred to some to wonder about how something like the -n-dicts project came to be (though I assume that anyone who bothers to -download this archive must already have some idea that such a project -could be of interest). - -Some years ago, there was a post to the sci.crypt Usenet newsgroup, on -the subject of creating PGP passphrases using randomly selected entries -from a supplied list of very short words. (If this sounds interesting, -follow this link for an expanded version of the post.) The word list, -which was extracted from /usr/dict/words on some UNIX system, seemed to -me ill-suited to its intended purpose. It included arcane acronyms ( -bstj, fmc), misspellings (diety, ouvre) and words of amazing obscurity ( -bhoy, kombu). I decided I could do better (and eventually did). This -caused me to start downloading English word lists, of which there are -many, from the Internet. I was not impressed by the overall quality of -these lists, and the few which were high-quality were all-inclusive, -burying the everyday words under a mountain of archaisms and esoterica. -The flaws of the vast majority of these lists are worth recounting: - - * Failure to proofread. Many of these lists are littered with - misspellings and typos, sometimes approaching gibberish. (I presume, - for instance, that the bizarre string nondploe, which was found in a - purported Scrabble word list, is a typo for something more or less - legitimate, but I have no idea what.) Working on my own lists has - helped me understand that 100 % accuracy is a very demanding goal, - seldom actually achieved, but I still feel it reasonable to expect - no more than 1 or 2 errors per 10,000 words. - * Acceptance of completely undocumented lazy spellings, such as - bullseye and courtmartial. - * Failure to respect capitalization. - * Failure to distinguish abbreviations from other entries. - * Treating esoteric computer jargon, and especially UNIX jargon, as - everyday English. (Beware any list which includes bitblt, emacs, - inode or lvalue.) - * Apparently random word selection. For instance, the most common - version of /usr/dicts/words contains a large set of apparently - randomly chosen personal names (uncapitalized, of course, and - missing wanda, marge, polly and sid). - * Inconsistent inflection. Some lists include all inflections of their - vocabulary, while others include only singulars and infinitives. - Either policy is fine, and has its advantages. I am personally very - annoyed when inflected forms appear at random. I find this generally - happens when a compiler merges several lists with different - characteristics, with no attempt to reconcile their divergent - styles. - * Omission of everyday words. I've seen a purported general-purpose - list that includes bremsstrahlung, yet omits log and beer. Or that - includes saxophone but not sax, and rhinoceros but not rhino. Of - course, due to my original purpose in seeking out common short - words, I found this especially annoying. - -One result of my frustration with this situation was my working with -Mendel Cooper on ENABLE (for further information, check out this link), -which was close to unique in having an active caretaker, one clearly -concerned with quality, and in being oriented towards American rather -than British English. But ENABLE is an all-encompassing list and, even -if it had been complete at the time I started my search for a list of -common words, it would not have been what I wanted for that reason. - -I finally decided that only starting from scratch with a systematic -approach was likely to get me what I was looking for, and that -dictionaries intended for non-native speakers of English were the best -possible source for words that are in some cases so familiar that we -never think of them. This has led to the 12dicts lists, which I hope -have managed to avoid the flaws recited above. - -(I should acknowledge one form of inconsistency exhibited by the 12dicts -lists, which is that sometimes related words are spelled inconsistently. -For instance, the 2of12 list contains both broadminded and -broad-mindedness. This generally occurs as a result of the methodology -used to build the lists. In the case of broadminded, only one source -dictionary listed broadmindedness, which was therefore excluded. I felt -unequal to trying to correct these inconsistencies, some of which are -real and not mere artifacts of 12dicts, such as the contrast between -self-conscious and unselfconscious.) - -Conclusions - -When I released the first version of 12dicts in 1999, I assumed I was -done with it. It hasn't worked out that way. Before I declare it -finished for a second time, there are a few more things I'd like to -accomplish. - - * As mentioned above, I would like to rework the 2of12inf list to - remove the dependency on the Moby lists. - * As may be seen by inspecting the table of file characteristics, the - 12dicts files now form a spectrum of word lists, with contents - ranging from the extremely common to the mildly esoteric. I would - like to extend the spectrum further by applying the 12dicts - methodology to dictionaries of larger size. Whether I will ever get - the time for a project this large remains to be seen. If it ever - comes to pass, it will probably be released separately from 12dicts - itself, as anything larger than the 5desk list will be too large to - even pretend to represent a "core English" vocabulary. (Even the - 5desk list itself is too large for that purpose.) - * It is possible that in the future the "n" of n-dicts will increase - again, but, in fact, consideration of an additional dictionary now - generally ends with the discovery that its vocabulary matches - 12dicts pretty closely. At the very least, this phenomenon gives me - hope that the 12dicts lists have now fulfilled their basic purpose. - -The 12dicts lists were compiled by Alan Beale. I explicitly release them -to the public domain, but request acknowledgment of their use. -(Actually, the dependency of the 2of12inf list on AGID prevents its -release into the public domain. However, I do not impose any additional -requirements on its use beyond those imposed by AGID and its sources, as -described in agid.txt.) Feel free to send comments, suggestions, -inquiries and/or large sums of money to me at biljir@pobox.com. If you -find 12dicts useful, I'd love to hear about it.