1 Automatically Generated Inflection Database (AGID)
6 Copyright 2000-2003 by Kevin Atkinson <kevina@gnu.org>
8 The file "infl.txt" is an automatically created database of the
9 inflected forms of words from a rather large word list.
11 The latest version can be found at http://aspell.sourceforge.net/wl/.
13 Entries are in the following form.
15 <word><sp><pos>[?]:<sp><inflected forms>
16 <word> := [[A-Za-z']]+
17 <sp> := <literal space>
19 <inflected forms> := <inflected form><sp>|<sp>...<sp>|<sp><inflected form>
20 <inflected form> := <individual entry>,<sp>...,<sp><individual entry>
21 <individual entry> := <word><word tags>[<sp><variant level>][<sp>{<explanation>}]
22 <word tags> := [~][<][!][?]
23 <explanation> := [<explanation text>][:<distinguishing number>]
24 <explanation text> := [[A-Za-z'_/]]+
26 where stuff between [ ] is optional, stuff between [[ ]] indicate a
27 range of possible characters for that entry. If a [[ ]] is followed by
28 a + it means the entry can consist of one or more characters in
29 that range. { } are literal.
31 A typical entry will look like
33 WORD V: WORDed, WORed 2, WORD {EXPL} | WORDing, WORing 2 | WORDs
35 <pos> is V for verb, N for noun, or A or adjective or adverb.
36 If <pos> is followed by a ? that means that the part-of-speech was not
37 in the part-of-speech database however the inflected forms of the word
38 where found in the word list.
40 The inflected forms are in the following order for verbs (except for
42 <past tense> [<past participle>] <-ing form> <-s form>
43 and for adjective or adverbs:
44 <-er form> <-est form>
45 Each form is seperated by a ' | '.
49 <past 1st & 3d singular> <2d singular, plural, past subjunctive>
50 <past participle> <present participle> <present 1st singular>
51 <2d singular> <3d singular> <plural present>
53 <past & past participle> <present participle> <present participle>
54 <present 1st & 3d singular> <2d singular> <plural present>
56 An absence of a variant level implies a variant level of 0. Two words
57 with the same whole number variant level are considered almost equal
58 with a slight preference given to the entry with a lower number. A
59 whole number variant level of 1 indicates a less preferred form of the
60 word. A whole number variant level of 2 indicates any number of
61 things. It could mean that it is from an archaic use of the word, or
62 a variant that is hardly ever used or for an extremely obscure meaning
63 of the word, or finally it could mean that the word looked like it
64 could possibly be a inflected form of the base word but I could not
65 find any evidence for them. If two words have the same variant level
66 and explanation it means that both inflections were found and the
67 script was not sure which one to use.
69 Sometimes the inflected form to use depends on the meaning of the
70 word. If this is the case the two entries will have different
71 explanations. If the distinction can be made in a few words it is
72 given with underscores (_) replacing spaces. Otherwise the two
73 entries will have different distinguishing numbers.
75 A < after a word means that there is a good change that this is an
76 inflected form of the word, a ~ after a word means that there is a
77 slight chance. A ! after a word indicates that the word is likely an
78 inflections of a similar word (generally one ending in e) and not the
79 current word. A ? after a word means that the word was not in the
80 word list but if it was it would be considered an inflected form of
83 This verson is now almost as accurate as Alan Beale's 2of12id file
84 distributed with the "Unofficial Alternate 12 Dicts Package" for the
85 base words which have an entry in 2of12id.txt with a few notable
86 exceptions. The most obvious one is the "person" entry. Alan Beale
87 considers, based on what his sources have told him, that "persons" is
88 the proper plural for "person" and "people" is considered a variant.
89 I however disagree and decided to consider "people" the primary form
90 and "persons" as the sligtly less perfered variant based on my own
91 experence and http://www.quinion.com/words/usagenotes/un-person.htm
94 The normal plural of person was persons ... However, there is
95 evidence from Chaucer onwards that some writers chose to use people
96 as a plural for person, not only in the generalised sense of 'an
97 uncountable or indistinct mass of individuals' but also in specific
98 countable cases. ... Though persons survives, it does so largely in
99 formal or legal contexts ...From the evidence, it seems that the
100 trend towards using people instead of persons is accelerating and
101 that it may not be so long before persons vanishes from the language
102 except in certain set phrases.
104 I considered making "persons" a variant (level 1), but I decided
105 against it as "persons" is for the most part perfectly acceptable and
106 probably considered the proper plural to use by some.
108 I also considered the -people ending the primary form for all words
109 ending in -person such as salesperson and the -persons entry the
110 slightly less preferred variant in spite of what 2of12id.txt said.
112 In some cases a variant of level 2 is listed in AGID where it is not
113 listed at all in 2of12id. In general this means that the script came
114 up with the possibility and, in spite it not being listed in 2of12id,
115 it seams logical to me.
117 The final case occurs when a word has two or more -s inflections used
118 as both noun and verb forms, and these forms would have different
119 variant levels in 2of12id. For example:
120 ditto N: dittos, dittoes 1
121 ditto V: dittoed | dittoing | dittos, dittoes 0.1
122 For purely technical reasons and because I do not feel that it matters
123 too much I have made the variant levels for the -s forms the same. For
124 example the ditto entries became:
125 ditto N: dittos, dittoes 0.1
126 ditto V: dittoed | dittoing | dittos, dittoes 0.1
127 The choice of the variant levels I used is somewhat arbitrary but I in
128 general went with the lower level.
130 Fell free to send me corrections to correct any of these questionable
131 words. I am mostly interested in the preferred form of the word when
132 the script was not able to decide or words marked with < or ~ that are
133 valid inflected forms of the words.
135 Also included in this version are the files "variant_0.lst",
136 "variant_1.lst", "variant_2.lst", and "variant.tab". The files
137 "variant_#.lst" include all of the inflected forms at the given level
138 found in infl.txt which are not generally considered to be some other
139 common word. The file variant.tab contains a cross reference of all
140 alternate forms of inflected form of words. The file variant-wroot.tab
141 is like variant.tab except that it also included the root form of the
144 Words are in mixed case but all accents have been striped thus words
145 like café are instead cafe.
147 The file "variant" contains a list of alternate inflections.
149 The file "irregular" contains extra information where a noun or verb
150 has irregular inflected forms.
152 The file "dontuse" contains a list of words not to consider an
153 inflected form of a word if more than one inflected form of a word is
156 The files "prefixes" and "suffixes" contains a list of common prefixes
157 and suffixes respectfully. These files are used by the script to
158 produce inflected forms for words that end in a word in the
159 "irregular" file. If the beginning appears in the word list or the
160 prefixes file and the ending appears in the irregular file I also
161 consider <prefix>+<irregular inflections>. If the prefix is 3 letters
162 or more OR appears in the prefixes file and the suffix is 4 letters or
163 more OR appears in the suffixes file I consider it the most likely
164 choice, otherwise I consider it as a possible candidate but not the
167 The file "make-infl" is the actual Perl script used to create the
170 The file "find-var" is the Perl script used to create the variant
171 lists and cross reference file.
173 The file "make-all" was used to create the word list used by the script.
177 From Revision 3a to 4 (January 2, 2003)
179 Added variant-wroot.tab
180 Update find-var script to also produce variant-wroot.tab.
182 From Revision 3 to 3a (April 04, 2001)
184 Fixed a bug in the find-var script which caused some common
185 words which are variants for one usage of a word but not
186 variants for any other common usage to improperly appear in
189 From Revision 2 to 3 (January 28, 2001)
191 Changed the format of infl.txt to something which is slightly harder
192 to read but a lot less ambiguous and easier to parse.
194 Update various files, including the actual script, so that the
195 output that is almost as accurate of Alan Beale 2of12id.txt
197 Eliminated Moby Words and ABLE from the word list used by the script
198 to give more accurate results.
200 From Revision 1 to 2 (August 18, 2000)
202 Classified variants as either almost equal, also used, or
205 The / is now used to indicate equal variants. "/?" is now used to
206 mean what "/" used to be.
208 Lots of additional rules added which greatly improved the results.
210 COPYRIGHT AND SOURCE:
212 The final product is under the following copyright, as well as any
213 copyrights mentioned below.
215 Copyright 2000-2003 by Kevin Atkinson
217 Permission to use, copy, modify, distribute and sell this database,
218 the associated scripts, the output created form the scripts and its
219 documentation for any purpose is hereby granted without fee,
220 provided that the above copyright notice appears in all copies and
221 that both that copyright notice and this permission notice appear in
222 supporting documentation. Kevin Atkinson makes no representations
223 about the suitability of this array for any purpose. It is provided
224 "as is" without express or implied warranty.
226 The part-of-speech database is taken from Alan Beale 2of12id
227 and the WordNet database which is under the following copyright:
229 This software and database is being provided to you, the LICENSEE, by
230 Princeton University under the following license. By obtaining, using
231 and/or copying this software and database, you agree that you have
232 read, understood, and will comply with these terms and conditions.:
234 Permission to use, copy, modify and distribute this software and
235 database and its documentation for any purpose and without fee or
236 royalty is hereby granted, provided that you agree to comply with
237 the following copyright notice and statements, including the disclaimer,
238 and that the same appear on ALL copies of the software, database and
239 documentation, including modifications that you make for internal
240 use or for distribution.
242 WordNet 1.6 Copyright 1997 by Princeton University. All rights reserved.
244 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
245 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
246 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
247 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
248 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
249 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
250 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
253 The name of Princeton University or Princeton may not be used in
254 advertising or publicity pertaining to distribution of the software
255 and/or database. Title to copyright in this software, database and
256 any associated documentation shall at all times remain with
257 Princeton University and LICENSEE agrees to preserve same.
259 Alan Beale 2of12id.txt is indirectly derived from the Moby part-of-speech
260 database and the WordNet database. The Moby part-of-speech is in the
263 The Moby lexicon project is complete and has
264 been place into the public domain. Use, sell,
265 rework, excerpt and use in any way on any platform.
267 Placing this material on internal or public servers is
268 also encouraged. The compiler is not aware of any
269 export restrictions so freely distribute world-wide.
271 You can verify the public domain status by contacting
275 Arcata, CA 95521-4884
281 The word list used is a combination of several word list:
283 1) The ENABLE2K word lists which is in the public domain:
285 The ENABLE master word list, WORD.LST, is herewith formally
286 released into the Public Domain. Anyone is free to use it or
287 distribute it in any manner they see fit. No fee or registration
288 is required for its use nor are "contributions" solicited (if you
289 feel you absolutely must contribute something for your own peace
290 of mind, the authors of the ENABLE list ask that you make a
291 donation on their behalf to your favorite charity). This word
292 list is our gift to the Scrabble community, as an alternate to
293 "official" word lists. Game designers may feel free to
294 incorporate the WORD.LST into their games. Please mention the
295 source and credit us as originators of the list. Note that if
296 you, as a game designer, use the WORD.LST in your product, you
297 may still copyright and protect your product, but you may *not*
298 legally copyright or in any way restrict redistribution of the
299 WORD.LST portion of your product. This *may* under law restrict
300 your rights to restrict your users' rights, but that is only
303 2) All of the word lists except ABLE.LST in the ENABLE2K Supplemnt
306 2DICTS.LST ALSO.LST LETTERS.LST OSPDADD.LST UCACR.LST
307 LCACR.LST NOPOS.LST PLURALS.LST UPPER.LST
309 All of these word lists are also in the public domain.
311 3) The list of signature words from the YAWL package which is in the
314 4) The UK Advanced Cryptics Dictionary which in under the following
317 Copyright (c) J Ross Beresford 1993-1999. All Rights Reserved.
319 The following restriction is placed on the use of this
320 publication: if The UK Advanced Cryptics Dictionary is used
321 in a software package or redistributed in any form, the
322 copyright notice must be prominently displayed and the text
323 of this document must be included verbatim.
325 There are no other restrictions: I would like to see the
326 list distributed as widely as possible.
328 5) Some extra words found in the Part-Of-Speech database that was not
329 found in any of the above word lists.
331 6) Words found in the Jargon File Word List package, available at
332 http://aspell.sourceforge.net/wl/, which is in the Public Domain.
334 7) Words in 2of12id.txt not in any of the word lists above. 2of12id is
335 indirectly derived from all the above sources and most of the word
336 lists from the Moby Words package:
338 10196pla.ces 113809of.fic 21986na.mes 256772co.mpo 354984si.ngl
339 3897male.nam 4160offi.cia 4946fema.len 6213acro.nym 74550com.mon
341 The Moby Word package, like the Part-Of-Speech database is in the
344 8) And finally some extra words that I added myself. These words can be
345 found in the file "extra-words"
347 The "dontuse", "irregular", and "variant" file was created by me
348 (Kevin Atkinson) from numerous sources.