r/uk-freq-class/notes.txt

   1      UK English Wordlist With Frequency Classification
   2
   3 This wordlist is primarily intended to be useful for
   4 checking spelling. Editorial policy is conservative.
   5
   6 Principal omissions:
   7
   8    -   words requiring a capital letter
   9    -   abbreviations
  10    -   slang
  11
  12 Colloquialisms and archaisms are generally excluded. A rare
  13 word similar to a common word may be excluded. Both -ise and
  14 -ize spellings are included.
  15
  16 The character set is: lowercase letters, hyphen, apostrophe.
  17 Words which can be spelt with accents occur here in their
  18 plain form.
  19
  20 If this wordlist is to be used with ispell the following
  21 lines may be appropriate for the affix file:
  22
  23    boundarychars [---]
  24    boundarychars '
  25    wordchars [a-z] [A-Z]
  26
  27 The commonest words are labelled 16 and the least common 0.
  28
  29 Coverage of common words should be good, but note the
  30 categories excluded.
  31
  32                         Brian Kelk bck22@bckelk.uklinux.net
  33                         April 2002
  34
  35
  36 Here are bits of a brief conversation I had with the author:
  37
  38 From: Brian Kelk <Brian.Kelk@cl.cam.ac.uk>
  39 Date: Sat, 08 Jul 2000 20:27:21 +0100
  40
  41 > I was wondering what the copyright status of your "UK English Wordlist
  42 > With Frequency Classification" word list as it seems to be lacking any
  43 > copyright notice.  Also, how did you arrive at the "Frequency
  44 > Classification".
  45
  46 There were many many sources in total, but any text marked
  47 "copyright" was avoided. Locally-written documentation was one
  48 source. An earlier version of the list resided in a filespace
  49 called PUBLIC on the University mainframe, because it was
  50 considered public domain.
  51
  52 Briefly about frequency: rather than counting occurrences of
  53 a word this classification is more along the lines of counting
  54 the number of texts in which the word occurs. That way you
  55 get some noise immunity, which you very much need. It's based
  56 on maybe 5-10 million words of text on the Cambridge mainframe
  57 in the 1980s. I had in mind that it might be useful for ranking
  58 possible corrections ...
  59
  60 Date: Tue, 11 Jul 2000 19:31:34 +0100
  61
  62 > So are you saying your word list is also in the public domain?
  63
  64 That is the intention.
  65
  66
  67
  68