X-Git-Url: https://git.donarmstrong.com/?p=deb_pkgs%2Fscowl.git;a=blobdiff_plain;f=7.1%2Fr%2Fcensus%2Fnam_meth.txt;fp=7.1%2Fr%2Fcensus%2Fnam_meth.txt;h=d5ca8c65d38bef7d1008075a7db739583d34a1d9;hp=0000000000000000000000000000000000000000;hb=01534a94130c1f5a3a230cf4fe18365a235ba271;hpb=7b14ba883fb1046508c44be37b4c6ba5da5feacf diff --git a/7.1/r/census/nam_meth.txt b/7.1/r/census/nam_meth.txt new file mode 100644 index 0000000..d5ca8c6 --- /dev/null +++ b/7.1/r/census/nam_meth.txt @@ -0,0 +1,236 @@ +10/95 + +DOCUMENTATION AND METHODOLOGY FOR FREQUENTLY OCCURRING NAMES IN THE U.S.--1990 + +The Census Bureau has a primary obligation to protect the +confidentiality of individual responses to the Census. As part +of this confidentiality commitment, the Census Bureau does not +currently release individual census questionnaires (or any other +information that could identify an individual) until 72 years +after a Decennial Census was taken. In 1992, the Census Bureau +released the 1920 Census schedules to National Archives. In +fact, the Census Bureau is so concerned about confidentiality +that name has not been entered into the basic internal electronic data +used to tabulate census results. + +However, there have been numerous demands for sunmary data on the +frequency of surnames for genealogical reasons. Similar interest +has arisen for the frequency of first names by sex. This data +set attempts to satisfy these demands while still providing +utmost confidentiality of individual results. + + +BACKGROUND + +In the summer of 1990, immediately following the 1990 Decennial +Census, the United States Census Bureau conducted a large scale +survey to measure undercount in the 1990 Census. This +independent post Census operation (the 1990 Post-Enumeration +Survey--PES) collected items of demographic data (race, sex, age +and NAME) from 377,000 persons living in 165,000 housing units in +5,300 predefined blocks (or block clusters). + +The information acquired from this independent (PES) operation +was matched against actual 1990 Census records for persons living +in those same 5300 blocks plus additional surrounding ring +blocks. The PES blocks plus the surrounding ring--"the Search +Area"--contained 7.2 million census records replete with name. +It is this Search Area data set that provides the impetus for the +three name files at this internet site. + +FILE FORMATS + +In July 1995, the Census Bureau placed abridged summary +information from the Search Area on its internet site. Selected +data from these files have appeared in the print media with the +citation "source--Census Bureau". Since the documentation +accompanying the original release of this information was +sketchy, we are supplying additional explanatory material about +the limitations of these data. + + +Each of the three files, (dist.all.last), (dist. male.first), and +(dist female.first) contain four items of data. The four items +are: + + (1). A "Name" + (2). Frequency in percent + (3). Cumulative Frequency in percent + (4). Rank + +In the file (dist.all.last) one entry appears as: + + MOORE 0.312 5.312 9 + +In our Search Area sample, MOORE ranks 9th in terms of frequency. +5.312 percent of the sample population is covered by MOORE and +the 8 names occurring more frequently than MOORE. The surname, +MOORE, is possessed by 0.312 percent of our population sample. + + +EDITING + +Producing that summary line for the name MOORE required a great +deal of program editing. For example, we immediately realized +that it was necessary to convert the entries MOORE JR, MOORE SR, +and MOORE III in the last name field to MOORE. For purposes of +consistency we also converted entries such as MOORE JONES or +MOORE-JONES to MOORE. + +In addition to those rather simplistic edits, we also examined +each name entry for the possibility of an inversion. (eg: a first +name appearing in the last name field and a last name placed in +the first name area). Consider a 2 person household with the +entries MOORE ROBERT, and MOORE CAROLYN in the name fields. +From our sample name universe, we can empirically determine the +probability that the inversion (ROBERT MOORE and CAROLYN +MOORE) as a far greater probability of being "right" than the +keyed entry. When the probability that the odds of an inversion +attained odds of 10,000 to 1, the inversion was done. + +Many names can be inverted and sound absolutely right. For +example, there is absolutely no reason to suspect that HENRY +THOMAS is wrong and THOMAS HENRY is preferable. However, if +HENRY THOMAS had a spouse listed as HENRY MARTHA and a female +child named HENRY SUSAN, that additional information suffices +to invert the name field for the entire family. + +For first names, we considered concatenating entries but finally +decided against it. Among males the combinations JOHN PAUL and +JOSE LUIS in the first name field were far more frequent than any +other set of spaced names. We could possibly have formed them as +JOHNPAUL and JOSELUIS. As a result the male first names JOHN and +JOSE may be marginally overstated. + +The one name that is most affected by our decision not to +concatenate is the grand old name of MARY. The entries (MARY +ANN, MARY BETH, MARY CATHERINE, MARY ELLEN, MARY FRANCES, MARY +GRACE etc) wind up as MARY. MARY may or may be the most common +first name among American women, but our decision to avoid +concatenation did add a significant number of MARY's. + +Finally, we came to the conclusion that the existence of a single +letter (an initial perhaps?) appearing in either the first or +last name field would not qualify as a name; but an initial in +one field would not disqualify the other name field. For example +the 19th century financier (J P MORGAN) has a valid last +name but the letter J does not meet these standards for a first +name. MUHAMMED X is an example of a acceptable first name +with an unusable surname. + + +MISSING DATA + +Although the search area contains 7.2 million persons, almost 15 +percent of those persons do not provide enough information to +form a name. In the previous paragraph we provided the situation +where we decided that a single a single letter would not +constitute a name. Other situation are listed below. + + + 1. The respondents did not enter a "name" at the top of page +2 of the 1990 Census form, even though names might have appeared +on the roster of page 1. A name must appear at the top of page 2 +for the name to be keyed. + + 2. The respondent may have inadvertently left sex (gender) +off his or her Census record. In that instance we accept the +last name, but we have no "certain" way of placing the first name +in the male or the female file. We do not assume that JENNIFER +without a sex designator would be female even though common sense +suggests that this is indeed the case. + + 3. A family may have put down a last name for the +householder but not for any other household member. We may have +the following family JOHN SMITH, MARY (blank), JOHN JR +(blank), ROBERT (blank), JENNIFER (blank), SUSAN +(blank). In that family we have a first and last name for +householder John, but first names only for the remaining 5 family +members. + + 4. The keyed name may not follow acceptable form. Some +examples of invalid entries in either the first or last name +field are: BABY GIRL, MR JONES, DR BROWN, FILIPINO +FEMALE. + +Each of these situations are responsible for limiting our +original sample of 7.2 million person records down to its present +size of 6.3 million. The actual number of person records making +up the unabridged files are: + + File Name Valid Records Unique Names + + 1. dist.all.last 6.290,251 88,799 + 2. dist.female.first 3,184,399 4,275 + 3. dist.male.first. 3,003,954 1.219 + +For purposes of both confidentiality and elimination of data +noise we restricted the number of unique names available at this +internet site to the minimum number of entries that contain 90 +percent of the population in that data file. There is an +extremely small chance that an individual with a truly "unique" +name could have been captured in sample, and is far more likely +for surnames than for first names. A second basis for limiting +entries is that a smattering of entries exist because of the +combination of bad handwriting coupled with poor typing. +Consider the entry JOSEHP in dist.male.first. Although JOSEHP +may be a name, it is much more likely that all of the JOSEHP +entries are really miskeys of JOSEPH. + + +LIMITATIONS + +For the names at the top of the distribution, (SMITH, JOHNSON, +WILLIAMS, JONES etc), or (MARY, PATRICIA, LINDA, BARBARA etc) or +(JAMES. JOHN, ROBERT, MICHAEL etc) the data speaks for +themselves. However as the sample thins, one might draw +conclusions about frequency that are not warranted. + +The PES sample intentionally over sampled both Blacks and +Hispanics, and it is likely that the Search Area also contains an +excess of these two groups. Thus the frequently occurring +surnames: GARCIA, RODRIGUEZ, GOMEZ and WASHINGTON as well as +first names: JUAN, JOSE, GUADALUPE and WILLIE might attain higher +rankings than their actual population numbers within the United +States would warrant. + +But the limitations due to sampling are much more noticeable when +looking at rarely occurring names--especially surnames. Consider +a surname appearing 63 (out of 6.3 million entries) times in the +file dist.all.last. Here the frequency would appear as 0.001 +percent, but it is possible that that sample frequency may not be +close to "truth". + +Ignoring clustering, (persons in the same household usually have +the same surname) the coefficient of variation on a number of +that magnitude would be approximately 12 percent. But most +people who do not live alone share a last name with other people +in that household. Thus the 63 persons with that rare name may +be the result of 16 households, which would raise the coefficient +of variation to approximately 25 percent. + +But we are not done. Even in the last years of the 20th century, +families tend to live close to each other, and it is not +impossible to conceive a situation where all Americans with a +certain surname appear in sample. Were that situation to occur +it would be possible to overstate the frequency of that surname +name by a factor of 40. The number 40 arises because the number +of Census records in the sample (6,290,251) is approximately one +fortieth of the United States population. + +The fact that a name doesn't appear in these three files does not +mean that it is non existent, only that it is reasonably rare. + +In conclusion we do realize that misleading frequencies are much +less likely in the files (dist.female first) and +(dist.male.first). Although fathers and one son may share a +first name, brothers almost never share the same first name. + + +ADDITIONAL INFORMATION + +Persons wanting or needing more information about the contents of +these three files can contact David L. Word (dword@census.gov) +301-457-2103 or Randy M. Klear (rklear@census.gov) 301-457-1727. + +