7.1/r/census/nam_meth.txt

   1 10/95\r
   2 \r
   3 DOCUMENTATION AND METHODOLOGY FOR FREQUENTLY OCCURRING NAMES IN THE U.S.--1990\r
   4 \r
   5 The Census Bureau has a primary obligation to protect the\r
   6 confidentiality of individual responses to the Census.  As part\r
   7 of this confidentiality commitment, the Census Bureau does not\r
   8 currently release individual census questionnaires (or any other\r
   9 information that could identify an individual) until 72 years\r
  10 after a Decennial Census was taken.  In 1992, the Census Bureau\r
  11 released the 1920 Census schedules to National Archives.  In\r
  12 fact, the Census Bureau is so concerned about confidentiality\r
  13 that name has not been entered into the basic internal electronic data\r
  14 used to tabulate census results.\r
  15 \r
  16 However, there have been numerous demands for sunmary data on the\r
  17 frequency of surnames for genealogical reasons.  Similar interest\r
  18 has arisen for the frequency of first names by sex.  This data\r
  19 set attempts to satisfy these demands while still providing\r
  20 utmost confidentiality of individual results.\r
  21 \r
  22 \r
  23 BACKGROUND\r
  24 \r
  25 In the summer of 1990, immediately following the 1990 Decennial\r
  26 Census, the United States Census Bureau conducted a large scale\r
  27 survey to measure undercount in the 1990 Census.  This\r
  28 independent post Census operation (the 1990 Post-Enumeration\r
  29 Survey--PES) collected items of demographic data (race, sex, age\r
  30 and NAME) from 377,000 persons living in 165,000 housing units in\r
  31 5,300 predefined blocks (or block clusters). \r
  32 \r
  33 The information acquired from this independent (PES) operation\r
  34 was matched against actual 1990 Census records for persons living\r
  35 in those same 5300 blocks plus additional surrounding ring\r
  36 blocks.  The PES blocks plus the surrounding ring--"the Search\r
  37 Area"--contained 7.2 million census records replete with name. \r
  38 It is this Search Area data set that provides the impetus for the\r
  39 three name files at this internet site.\r
  40   \r
  41 FILE FORMATS\r
  42 \r
  43 In July 1995, the Census Bureau placed abridged summary\r
  44 information from the Search Area on its internet site.  Selected\r
  45 data from these files have appeared in the print media with the\r
  46 citation "source--Census Bureau".  Since the documentation\r
  47 accompanying the original release of this information was\r
  48 sketchy, we are supplying additional explanatory material about\r
  49 the limitations of these data.\r
  50 \r
  51 \r
  52 Each of the three files, (dist.all.last), (dist. male.first), and\r
  53 (dist female.first) contain four items of data.  The four items\r
  54 are:    \r
  55 \r
  56          (1).  A "Name"\r
  57          (2).  Frequency in percent\r
  58          (3).  Cumulative Frequency in percent \r
  59          (4).  Rank\r
  60 \r
  61 In the file (dist.all.last) one entry appears as:\r
  62 \r
  63         MOORE       0.312       5.312       9\r
  64 \r
  65 In our Search Area sample, MOORE ranks 9th in terms of frequency. \r
  66 5.312 percent of the sample population is covered by MOORE and\r
  67 the 8 names occurring more frequently than MOORE.  The surname,\r
  68 MOORE, is possessed by 0.312 percent of our population sample.\r
  69 \r
  70 \r
  71 EDITING\r
  72 \r
  73 Producing that summary line for the name MOORE required a great\r
  74 deal of program editing.  For example, we immediately realized\r
  75 that it was necessary to convert the entries MOORE JR, MOORE SR,\r
  76 and MOORE III in the last name field to MOORE.  For purposes of\r
  77 consistency we also converted entries such as MOORE JONES or\r
  78 MOORE-JONES to MOORE.\r
  79 \r
  80 In addition to those rather simplistic edits, we also examined\r
  81 each name entry for the possibility of an inversion. (eg: a first\r
  82 name appearing in the last name field and a last name placed in\r
  83 the first name area).  Consider a 2 person household with the\r
  84 entries MOORE   ROBERT, and MOORE   CAROLYN in the name fields. \r
  85 From our sample name universe, we can empirically determine the\r
  86 probability that the inversion (ROBERT   MOORE and CAROLYN  \r
  87 MOORE) as a far greater probability of being "right" than the\r
  88 keyed entry.  When the probability that the odds of an inversion\r
  89 attained odds of 10,000 to 1, the inversion was done.  \r
  90 \r
  91 Many names can be inverted and sound absolutely right.  For\r
  92 example, there is absolutely no reason to suspect that HENRY  \r
  93 THOMAS is wrong and THOMAS    HENRY is preferable.  However, if\r
  94 HENRY  THOMAS had a spouse listed as HENRY   MARTHA  and a female\r
  95 child named HENRY   SUSAN, that additional information suffices\r
  96 to invert the name field for the entire family.\r
  97 \r
  98 For first names, we considered concatenating entries but finally\r
  99 decided against it.  Among males the combinations JOHN PAUL and\r
 100 JOSE LUIS in the first name field were far more frequent than any\r
 101 other set of spaced names.  We could possibly have formed them as\r
 102 JOHNPAUL and JOSELUIS.  As a result the male first names JOHN and\r
 103 JOSE may be marginally overstated. \r
 104 \r
 105 The one name that is most affected by our decision not to\r
 106 concatenate is the grand old name of MARY.  The entries  (MARY\r
 107 ANN, MARY BETH, MARY CATHERINE, MARY ELLEN, MARY FRANCES, MARY\r
 108 GRACE  etc) wind up as MARY.  MARY may or may be the most common\r
 109 first name among American women, but our decision to avoid\r
 110 concatenation did add a significant number of MARY's.\r
 111                       \r
 112 Finally, we came to the conclusion that the existence of a single\r
 113 letter (an initial perhaps?) appearing in either the first or\r
 114 last name field would not qualify as a name; but an initial in\r
 115 one field would not disqualify the other name field.  For example\r
 116 the 19th century financier (J      P  MORGAN) has a valid last\r
 117 name but the letter J does not meet these standards for a first\r
 118 name.  MUHAMMED    X is an example of a acceptable first name\r
 119 with an unusable surname.\r
 120 \r
 121 \r
 122 MISSING DATA\r
 123 \r
 124 Although the search area contains 7.2 million persons, almost 15\r
 125 percent of those persons do not provide enough information to\r
 126 form a name.  In the previous paragraph we provided the situation\r
 127 where we decided that a single a single letter would not\r
 128 constitute a name.  Other situation are listed below.\r
 129 \r
 130 \r
 131      1. The respondents did not enter a "name" at the top of page\r
 132 2 of the 1990 Census form, even though names might have appeared\r
 133 on the roster of page 1.  A name must appear at the top of page 2\r
 134 for the name to be keyed. \r
 135 \r
 136      2. The respondent may have inadvertently left sex (gender)\r
 137 off his or her Census record.  In that instance we accept the\r
 138 last name, but we have no "certain" way of placing the first name\r
 139 in the male or the female file.  We do not assume that JENNIFER\r
 140 without a sex designator would be female even though common sense\r
 141 suggests that this is indeed the case.\r
 142 \r
 143      3. A family may have put down a last name for the\r
 144 householder but not for any other household member.   We may have\r
 145 the following family JOHN    SMITH,   MARY    (blank),  JOHN JR   \r
 146 (blank),  ROBERT     (blank),  JENNIFER      (blank),  SUSAN  \r
 147 (blank).  In that family we have a first and last name for\r
 148 householder John, but first names only for the remaining 5 family\r
 149 members.\r
 150 \r
 151      4. The keyed name may not follow acceptable form.  Some\r
 152 examples of invalid entries in either the first or last name\r
 153 field are: BABY   GIRL,  MR    JONES,   DR    BROWN,  FILIPINO \r
 154 FEMALE.  \r
 155 \r
 156 Each of these situations are responsible for limiting our\r
 157 original sample of 7.2 million person records down to its present\r
 158 size of 6.3 million.  The actual number of person records making\r
 159 up the unabridged files are: \r
 160 \r
 161      File Name             Valid Records      Unique Names  \r
 162                          \r
 163   1. dist.all.last          6.290,251            88,799 \r
 164   2. dist.female.first      3,184,399             4,275\r
 165   3. dist.male.first.       3,003,954             1.219 \r
 166 \r
 167 For purposes of both confidentiality and elimination of data\r
 168 noise we restricted the number of unique names available at this\r
 169 internet site to the minimum number of entries that contain 90\r
 170 percent of the population in that data file.  There is an\r
 171 extremely small chance that an individual with a truly "unique"\r
 172 name could have been captured in sample, and is far more likely\r
 173 for surnames than for first names.   A second basis for limiting\r
 174 entries is that a smattering of entries exist because of the\r
 175 combination of bad handwriting coupled with poor typing. \r
 176 Consider the entry JOSEHP in dist.male.first.  Although JOSEHP\r
 177 may be a name, it is much more likely that all of the JOSEHP\r
 178 entries are really miskeys of JOSEPH.\r
 179 \r
 180 \r
 181 LIMITATIONS\r
 182 \r
 183 For the names at the top of the distribution, (SMITH, JOHNSON,\r
 184 WILLIAMS, JONES etc),  or (MARY, PATRICIA, LINDA, BARBARA etc) or\r
 185 (JAMES. JOHN, ROBERT, MICHAEL etc) the data speaks for \r
 186 themselves.  However as the sample thins, one might draw\r
 187 conclusions about frequency that are not warranted.\r
 188 \r
 189 The PES sample intentionally over sampled both Blacks and\r
 190 Hispanics, and it is likely that the Search Area also contains an\r
 191 excess of these two groups.  Thus the frequently occurring\r
 192 surnames: GARCIA, RODRIGUEZ, GOMEZ and WASHINGTON as well as\r
 193 first names: JUAN, JOSE, GUADALUPE and WILLIE might attain higher\r
 194 rankings than their actual population numbers within the United\r
 195 States would warrant.  \r
 196 \r
 197 But the limitations due to sampling are much more noticeable when\r
 198 looking at rarely occurring names--especially surnames.  Consider\r
 199 a surname appearing 63 (out of 6.3 million entries) times in the\r
 200 file dist.all.last.  Here the frequency would appear as 0.001\r
 201 percent, but it is possible that that sample frequency may not be\r
 202 close to "truth".\r
 203 \r
 204 Ignoring clustering, (persons in the same household usually have\r
 205 the same surname) the coefficient of variation on a number of\r
 206 that magnitude would be approximately 12 percent.  But most\r
 207 people who do not live alone share a last name with other people\r
 208 in that household.  Thus the 63 persons with that rare name may\r
 209 be the result of 16 households, which would raise the coefficient\r
 210 of variation to approximately 25 percent.  \r
 211 \r
 212 But we are not done.  Even in the last years of the 20th century,\r
 213 families tend to live close to each other, and it is not\r
 214 impossible to conceive a situation where all Americans with a\r
 215 certain surname appear in sample.  Were that situation to occur\r
 216 it would be possible to overstate the frequency of that surname\r
 217 name by a factor of 40.  The number 40 arises because the number\r
 218 of Census records in the sample (6,290,251) is approximately one\r
 219 fortieth of the United States population.\r
 220 \r
 221 The fact that a name doesn't appear in these three files does not\r
 222 mean that it is non existent, only that it is reasonably rare.\r
 223 \r
 224 In conclusion we do realize that misleading frequencies are much\r
 225 less likely in  the files (dist.female first) and\r
 226 (dist.male.first).  Although fathers and one son may share a\r
 227 first name, brothers almost never share the same first name.\r
 228 \r
 229 \r
 230 ADDITIONAL INFORMATION\r
 231 \r
 232 Persons wanting or needing more information about the contents of\r
 233 these three files can contact David L. Word (dword@census.gov)\r
 234 301-457-2103 or Randy M. Klear (rklear@census.gov) 301-457-1727.\r
 235 \r
 236 \r