X-Git-Url: https://git.donarmstrong.com/?a=blobdiff_plain;f=r%2Fvarcon%2FREADME;fp=r%2Fvarcon%2FREADME;h=75fa22c3693bd720fcf981c4913bd49b11961d15;hb=b13ea8a082364672c6de2b010e558211ff52ec9a;hp=0000000000000000000000000000000000000000;hpb=01534a94130c1f5a3a230cf4fe18365a235ba271;p=deb_pkgs%2Fscowl.git diff --git a/r/varcon/README b/r/varcon/README new file mode 100644 index 0000000..75fa22c --- /dev/null +++ b/r/varcon/README @@ -0,0 +1,425 @@ +Variant Conversion Info (VarCon) + +Version 2015.08.24 + +Copyright 2000-2015 by Kevin Atkinson (kevina@gnu.org) + +This package contains information to convert between American, +British, and Canadian spellings and vocabulary as well and other +variant information. + +The latest version can be found at http://wordlist.aspell.net/. + +The main data file is varcon.txt. It contains information on the +preferred American, British, and Canadian spelling of a word as well +as other variant information. + +Each line contains a mapping between the various spellings of a word. +Words are tagged to indicate where the spelling is used, and each +word/tag pair is separated with a " / ". For example in the line: + A Cv: acknowledgment / Av B C: acknowledgement +"acknowledgment" and "acknowledgement" are two spellings of the same +word and "A", "Cv", "B", etc are the tags. Tags are separated by +spaces and the group of tags is separated from the word with a ": ". +Here, "acknowledgment" is the preferred American spelling (as +indicated by the "A") of the word, and "acknowledgement" is the +preferred Canadian and British spelling ("B" and "C"). However the +American spelling is sometimes used in Canada (as indicated by "Cv", +where the lowercase "v" indicated a variant form) and the British +spelling is sometimes used in America (as indicated the the "Av"). + +More generally each tag consists of a spelling category (for example +"A") followed possible by a variant indicator. The spelling +categories are as follows: + A: American + B: British "ise" spelling + Z: British "ize" spelling or OED preferred Spelling + C: Canadian + _: Other (Variant info based on American dictionaries, never used + with any of the above). +and the variants tags are as follows: + .: equal + v: variant + V: seldom used variant + -: possible variant, should generally not used + x: improper variant (should not use) + +The "." or equal variant tags are reserved for special cases when +there is little agreement between dictionaries or when I think the +dictionary is wrong. The "v" indicator is used for most words marked +as variants in the dictionary. However, some variants will be demoted +to a "V". For example, if the variant is marked as "also" by +Merriam-Webster, or also if only some dictionaries acknowledge the +existence the variant. "-" is used when the variant is generally not +listed is the dictionary but I could find some evidence of it use, or +when it is it marked as as a archaic spelling for the word. The "x" +is used when the spelling is almost generally considered a +misspelling, and is only included for completeness. + +If there are no tags with the 'Z' spelling category on the line than +'B' implies 'Z'. Similarly if there are no 'C' tags than 'Z' implies +'C'. + +For ease of reading and maintaining the data file, each line is +grouped in a cluster of closely related words. Each cluster is +uniquely identified by a headword, which is generally the American +spelling of word on the first line of the cluster. Each cluster is +started with a '#' and is followed by the headword with some +additional information after it. For example the cluster for +acknowledgment is: + # acknowledgment (level 35) + A Cv: acknowledgment / Av B C: acknowledgement + A Cv: acknowledgments / Av B C: acknowledgements + A Cv: acknowledgment's / Av B C: acknowledgement's +The "" tag will be explained latter, and "(level 35)" +indicate what level in SCOWL (see http://wordlist.sourceforge.net) +the headword is found in. The levels generally mean the following: + <= 35: Very common word + <= 70: Can be found in the dictionary + 80: Likely a valid word, can likely be found in an + unabridged dictionary + > 80: May not even be a legal word + +Sometimes the spelling of a word depends on the usage. If so the word +is listed more than once within a cluster, with any usage information +being indicated after a " | ". For example here is part of the cluster +for prize: + A B: prize | reward + A B: prizes | reward + A C: prize / B: prise | otherwise + A C: prizes / B: prises | otherwise +which indicated than the preferred spelling of prize is always with a +"z" when meaning a reward, but otherwise is spelled with a "s" is +British English. In the example above a brief definition of the word +is given, but often no such attempt is made, and the definition simply +consists of a number, for example: + A B: sake | :1 + A C: sake / Av B Cv: saki | :2 + +Sometimes part-of-speech (POS) info is given to help distinguish which +form is used. For example: + A B C: practice / AV Cv: practise | + A Cv: practice / AV B C: practise | +POS info is always given given in the form "" and if a definition +is also given the the POS info is always first. The POS tags used are as +follows: + : Noun + : Verb + : Adjective + : Adverb + +A "(-)" before the definition indicated a rarely used or archaic form +of a word, for example: + A B: bark | :1 + A: bark / Av B: barque | (-) ship + +A "--" indicates a note rather than definition. This is generally +used to indicate that the spelling of the plural form not depend on +the spelling of the root word, for example: + _: cabby / _.: cabbie + _: cabbies | -- plural + +Misc. notes on a particular form of a word are given after a "#" on +the same line. Misc. notes for the cluster are given at the end of +the cluster and are prefixed with "##", for example: + # coloration (level 50) + A B C: coloration / B. Cv: colouration + A B C: colorations / B. Cv: colourations + A B C: coloration's / B. Cv: colouration's + ## OED has coloration as the prefered spelling and discolouration as a + ## variant for British Engl or some reason +In the notes ODE (not to be confused with OED) stands for Oxford +Dictionary of English, "Ox" is used for any Oxford dictionary, and +"M-W" for Merriam-Webster. + +Earlier versions of varcon contained numerous errors. With version +5.0 massive effort has been made to correct many of these errors. +Clusters that have undergone some form of verification (and likely +correction) are marked with "". As of version 5.0, most +clusters with headwords word in common usage (SCOWL level 35 and +below) should now be checked, as well as many others. No effort was +made to check clusters with headwords in SCOWL level 80 and above; +many of those entries are unlikely to be in the dictionary anyway. + +The file variant-also.tab contains additional mappings between various +spellings of a word which are not yet in varcon.txt. No attempt is +made to distinguish the primary form of a word. The file +variant-infl.tab is like variant-also.tab except that it is created +automatically from the AGID inflection database. The file +variant-wroot.tab is like variant-infl.tab except that it also +included the root form of the word. + +The file voc.tab is similar to varcon.txt but converts between +vocabulary instead of spelling. Unlike varcon.tab it is a simple tab +separated file with the fields corresponding to the American, British, +and Canadian words. If more than one word if often used to describe +the same thing the words are separated with commas. The last column +contains additional notes on when the word is used. Unlike varcon.txt +it is generally not suitable for automatic conversion. + +The "make-variant" Perl script will combine varcon.txt, +variant-also.tab, and variant-infl.tab into one huge mapping and will +output the result to "variant.tab". If the "no-infl" option is given +than variant-infl.tab will not be included. + +The "split" script will split out the information in varcon.txt into +several word lists named as follows: + [-v][-uncommon].lst +where is one of: american, british, british_z, canadian, +common, or other. "common" is used for words which appear in +varcon.txt, yet are used in all versions of english, such as "prize", +and "other" is used for the "_" spelling category. The mapping from +the variant indicators in varcon.txt to the numeric variant level is +as follows: + v => 0 + V => 1 + - => 2 +"-uncommon" is used for forms marked with "(-)" as already described. + +The "translate" Perl script will translate a text file from one +spelling to another. Its usage is: + +translate [] + is any of + -?,-h,--help this screen + -m,--mark mark words where the translation is questionable + -i,--include include words where the translation is questionable + is the file name of the translation array, + defaults to "abbc.tab". + and are one of: american, british, british_z, or canadian. +british-ise and british-ize can also be used. + +Text is read in from standard input and is outputted to standard out. +Words are marked with a '?' before and after the questionable word +when the option is enabled. + +The file varcon.pm contains some library routines for parsing +varcon.txt and is used by many of the scripts above. + +If you discover any errors in these mappings or have suggestions for +additions please file a bug report at +https://github.com/kevina/wordlist/issues, or alternatively email me +directly at kevina@gnu.org, but I will likely tell you to file a bug +report so that I don't forget about it. + +SOURCE: + +These mappings were compiled from numerous sources. + +The abc.tab was originally created from the American and British word +lists found in the Ispell distribution and the Canadian word list +created by Garst R. Reese : + + What I have discovered is that Canadian is a modification of British. + Canadians use ize ization, izing izable like Americans, and gram instead + of gramme. The one exception I found was practise. It does not go to + practize. Otherwise they use British spelling. So, what I am currently + checking books with is a an edited version of British, where I have + changed all occurrences of ise to ize, isab to izab, isation to ization, + ising to izing, and gramme to gram except I allow programme, which is + sometimes proper unless you are talking about a computer program. I did + bunches of greps to be sure these substitutions would work as expected. + +Many other words have been added to abc.tab which were not in the +original Ispell word lists. + +Many different web sources were consulted when crating the tables. They +include: + + The American-British British-American Dictionary + http://www.peak.org/~jeremy/dictionary/dictionary.html + American and British Spelling Differences + http://www.peak.org/~jeremy/dictionary/spellcat.html + Dave (VE7CNV)'s Truly Canadian Dictionary of Canadian Spelling + http://www.luther.bc.ca/~dave7cnv/cdnspelling/cdnspelling.html + Canadian Spelling Convention + http://imej.wfu.edu/articles/1999/1/02/demo/tutorial/canas.html + Cornerstone's Canadian English Page + http://www.web.net/cornerstone/cdneng.htm + Inter-Play Translation: British/Canadian/American Spelling + http://www.inter-play.com/translation/spel-ukus.htm + Inter-Play Translation: British/Canadian/American Vocabulary + http://www.inter-play.com/translation/voc-ukus.htm + +As well as several online dictionaries: + + Marriam-Webster: http://www.m-w.com/ + American Heritage: http://www.bartleby.com/61/ + Cambridge (ESL): http://dictionary.cambridge.org/ + +In version 5.0 a massive effort to correct the numerous errors in +VarCon was done. The primary sources used for verification where: + + Marriam-Webster: http://www.m-w.com/ + Free version of Oxford Dictionaries Online: + http://www.oxforddictionaries.com/ + Oxford dictionaries available via Oxford Reference Online + (subscription service, http://www.oxfordreference.com/): + The New Oxford American Dictionary (2nd edition, 2006) + and sometimes: The Oxford American Dictionary of Current English (2002) + The Concise Oxford English Dictionary (11th edition revised, 2008) + and sometimes: The Oxford Dictionary of English (2nd edition revised, 2005) + The Canadian Oxford Dictionary (2004) + +I also used Tysto UK vs US spelling list available at: + http://www.tysto.com/articles05/q1/20050324uk-us.shtml +to make sure I didn't leave out any information in VarCon, however any +additions from his lists where verified using the dictionaries +mentioned above as his lists contained numerous errors (such as +including archaic spellings of words) + +I also made indirect use of Luke's Canadian, British and American +Spelling page available at: + http://www.lukemastin.com/testing/spelling/cgi-bin/database.cgi?database=spelling +but only to perform some initial verification, in the end I rechecked +his data using the dictionaries above. (However, his data is, by far, +more accurate than Tysto's) + +CHANGELOG: + +From 2014.02.15 to 2015.08.24 (Aug 24, 2015) + + - Added entry for Koran/Koranic. + + - Tweaked "adviser" cluster. + + - Fix formatting problems. + +From 2015.01.28 to 2014.02.15 (February 15, 2015) + + - Various new entries + +From 2014.11.17 to 2015.01.28 (January 28, 2015) + + - Minor adjustments to a few entries (analytic, amid) + + - Added entry for shareable + + - Remove a junk entry (ted/taed). + +From 2014.08.11 to 2014.11.17 (November 17, 2014) + + - Fix typos in README + + - Enhancement to VarCon translate script. It will now, by default, + filter clusters with a SCOWL level > 80. This behavior can be + controlled with the new "--thresh" option. + + - Remove a few junk entries. + +From Revision 5.1 to Version 2014.08.11 (August 8, 2014) + + - Various corrections. Most of them minor. Two notable exceptions: + + - Added an entry for furor as the correct British spelling is furore + + - Fixed racket entries as Canadians still use racquet even + though it is a British English (at least according to the + Oxford dictionaries) + + - Other minor changes. + +From Revision 5.0 to Revision 5.1 (January 6, 2010) + + - Corrected numerous errors after running various forms + of verification on varcon.txt. + + - Reordered the clusters in varcon.txt so that they are + mostly in alphabetic order based on the headword. + +From Revision 4.1 to Revision 5.0 (December 27, 2010) + + - Completely new format for the main table which, in addition to + providing the preferred spelling of a word for various forms of + English, also records variant and other information. To reflect + this change, the name of the file was renamed from abbc.tab to + varcon.txt. + + - Massive effort to verify the variant information against + authoritative sources (mainly Oxford dictionaries). Most entries + for common words (SCOWL level 35 and below) have been checked + against at least a British and Canadian dictionary. + + - Added variant information for numerous other words, even when + there is no difference between the various forms on English. + + - Other changes corresponding to the new format. + +From Revision 4 to Revision 4.1 (August 10, 2004) + + - Fixed various errors in abbc.tab + + - Removed clause 4 from the Ispell copyright with permission of Geoff + Kuenning. + +From Revision 3 to Revision 4 (August 7, 2004) + + - Added a column to "abc.tab" for the British "ize" spelling and + renamed the file to abbc.tab. + - Added verb forms of prize/prise to abbc.tab, removed from + variant-also.tab + +From Revision 2 to Revision 3 (January 2, 2003) + + - Added an option for not including variant-infl.tab for the + make-variant perl script + - Added the file variant-wroot.tab + - Added a few entries given to me by Francis Bond and Edward Betts + +From Revision 1 to Revision 2 (January 27, 2001) + + - Removed all "B" markers because I could not find any evidence for + them + - Corrected a few Canadian entries, especially those with the "B" + markers + - Added some more entries by trying fixed changes (such as ize to + ise) to words in SCOWL and hand-checking over the ones with semi-common + words in them. + - Added variant-infl.tab + +COPYRIGHT: + +Copyright 2000-2010 by Kevin Atkinson + +Permission to use, copy, modify, distribute and sell this array, the +associated software, and its documentation for any purpose is hereby +granted without fee, provided that the above copyright notice appears +in all copies and that both that copyright notice and this permission +notice appear in supporting documentation. Kevin Atkinson makes no +representations about the suitability of this array for any +purpose. It is provided "as is" without express or implied warranty. + +Since the original words lists come from the Ispell distribution: + +Copyright 1993, Geoff Kuenning, Granada Hills, CA +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + +1. Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. +2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. +3. All modifications to the source code must be clearly marked as + such. Binary redistributions based on modified source code + must be clearly marked as modified versions in the documentation + and/or other materials provided with the distribution. +(clause 4 removed with permission from Geoff Kuenning) +5. The name of Geoff Kuenning may not be used to endorse or promote + products derived from this software without specific prior + written permission. + +THIS SOFTWARE IS PROVIDED BY GEOFF KUENNING AND CONTRIBUTORS ``AS IS'' AND +ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE +ARE DISCLAIMED. IN NO EVENT SHALL GEOFF KUENNING OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS +OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) +HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT +LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +SUCH DAMAGE.