Imported Upstream version 2015.08.24

[deb_pkgs/scowl.git] / 6 / r / census / nam_meth.txt
diff --git a/6/r/census/nam_meth.txt b/6/r/census/nam_meth.txt

deleted file mode 100644 (file)

index d5ca8c6..0000000
--- a/6/r/census/nam_meth.txt
+++ /dev/null
@@ -1,236 +0,0 @@
-10/95\r
-\r
-DOCUMENTATION AND METHODOLOGY FOR FREQUENTLY OCCURRING NAMES IN THE U.S.--1990\r
-\r
-The Census Bureau has a primary obligation to protect the\r
-confidentiality of individual responses to the Census.  As part\r
-of this confidentiality commitment, the Census Bureau does not\r
-currently release individual census questionnaires (or any other\r
-information that could identify an individual) until 72 years\r
-after a Decennial Census was taken.  In 1992, the Census Bureau\r
-released the 1920 Census schedules to National Archives.  In\r
-fact, the Census Bureau is so concerned about confidentiality\r
-that name has not been entered into the basic internal electronic data\r
-used to tabulate census results.\r
-\r
-However, there have been numerous demands for sunmary data on the\r
-frequency of surnames for genealogical reasons.  Similar interest\r
-has arisen for the frequency of first names by sex.  This data\r
-set attempts to satisfy these demands while still providing\r
-utmost confidentiality of individual results.\r
-\r
-\r
-BACKGROUND\r
-\r
-In the summer of 1990, immediately following the 1990 Decennial\r
-Census, the United States Census Bureau conducted a large scale\r
-survey to measure undercount in the 1990 Census.  This\r
-independent post Census operation (the 1990 Post-Enumeration\r
-Survey--PES) collected items of demographic data (race, sex, age\r
-and NAME) from 377,000 persons living in 165,000 housing units in\r
-5,300 predefined blocks (or block clusters). \r
-\r
-The information acquired from this independent (PES) operation\r
-was matched against actual 1990 Census records for persons living\r
-in those same 5300 blocks plus additional surrounding ring\r
-blocks.  The PES blocks plus the surrounding ring--"the Search\r
-Area"--contained 7.2 million census records replete with name. \r
-It is this Search Area data set that provides the impetus for the\r
-three name files at this internet site.\r
-  \r
-FILE FORMATS\r
-\r
-In July 1995, the Census Bureau placed abridged summary\r
-information from the Search Area on its internet site.  Selected\r
-data from these files have appeared in the print media with the\r
-citation "source--Census Bureau".  Since the documentation\r
-accompanying the original release of this information was\r
-sketchy, we are supplying additional explanatory material about\r
-the limitations of these data.\r
-\r
-\r
-Each of the three files, (dist.all.last), (dist. male.first), and\r
-(dist female.first) contain four items of data.  The four items\r
-are:    \r
-\r
-         (1).  A "Name"\r
-         (2).  Frequency in percent\r
-         (3).  Cumulative Frequency in percent \r
-         (4).  Rank\r
-\r
-In the file (dist.all.last) one entry appears as:\r
-\r
-        MOORE       0.312       5.312       9\r
-\r
-In our Search Area sample, MOORE ranks 9th in terms of frequency. \r
-5.312 percent of the sample population is covered by MOORE and\r
-the 8 names occurring more frequently than MOORE.  The surname,\r
-MOORE, is possessed by 0.312 percent of our population sample.\r
-\r
-\r
-EDITING\r
-\r
-Producing that summary line for the name MOORE required a great\r
-deal of program editing.  For example, we immediately realized\r
-that it was necessary to convert the entries MOORE JR, MOORE SR,\r
-and MOORE III in the last name field to MOORE.  For purposes of\r
-consistency we also converted entries such as MOORE JONES or\r
-MOORE-JONES to MOORE.\r
-\r
-In addition to those rather simplistic edits, we also examined\r
-each name entry for the possibility of an inversion. (eg: a first\r
-name appearing in the last name field and a last name placed in\r
-the first name area).  Consider a 2 person household with the\r
-entries MOORE   ROBERT, and MOORE   CAROLYN in the name fields. \r
-From our sample name universe, we can empirically determine the\r
-probability that the inversion (ROBERT   MOORE and CAROLYN  \r
-MOORE) as a far greater probability of being "right" than the\r
-keyed entry.  When the probability that the odds of an inversion\r
-attained odds of 10,000 to 1, the inversion was done.  \r
-\r
-Many names can be inverted and sound absolutely right.  For\r
-example, there is absolutely no reason to suspect that HENRY  \r
-THOMAS is wrong and THOMAS    HENRY is preferable.  However, if\r
-HENRY  THOMAS had a spouse listed as HENRY   MARTHA  and a female\r
-child named HENRY   SUSAN, that additional information suffices\r
-to invert the name field for the entire family.\r
-\r
-For first names, we considered concatenating entries but finally\r
-decided against it.  Among males the combinations JOHN PAUL and\r
-JOSE LUIS in the first name field were far more frequent than any\r
-other set of spaced names.  We could possibly have formed them as\r
-JOHNPAUL and JOSELUIS.  As a result the male first names JOHN and\r
-JOSE may be marginally overstated. \r
-\r
-The one name that is most affected by our decision not to\r
-concatenate is the grand old name of MARY.  The entries  (MARY\r
-ANN, MARY BETH, MARY CATHERINE, MARY ELLEN, MARY FRANCES, MARY\r
-GRACE  etc) wind up as MARY.  MARY may or may be the most common\r
-first name among American women, but our decision to avoid\r
-concatenation did add a significant number of MARY's.\r
-                      \r
-Finally, we came to the conclusion that the existence of a single\r
-letter (an initial perhaps?) appearing in either the first or\r
-last name field would not qualify as a name; but an initial in\r
-one field would not disqualify the other name field.  For example\r
-the 19th century financier (J      P  MORGAN) has a valid last\r
-name but the letter J does not meet these standards for a first\r
-name.  MUHAMMED    X is an example of a acceptable first name\r
-with an unusable surname.\r
-\r
-\r
-MISSING DATA\r
-\r
-Although the search area contains 7.2 million persons, almost 15\r
-percent of those persons do not provide enough information to\r
-form a name.  In the previous paragraph we provided the situation\r
-where we decided that a single a single letter would not\r
-constitute a name.  Other situation are listed below.\r
-\r
-\r
-     1. The respondents did not enter a "name" at the top of page\r
-2 of the 1990 Census form, even though names might have appeared\r
-on the roster of page 1.  A name must appear at the top of page 2\r
-for the name to be keyed. \r
-\r
-     2. The respondent may have inadvertently left sex (gender)\r
-off his or her Census record.  In that instance we accept the\r
-last name, but we have no "certain" way of placing the first name\r
-in the male or the female file.  We do not assume that JENNIFER\r
-without a sex designator would be female even though common sense\r
-suggests that this is indeed the case.\r
-\r
-     3. A family may have put down a last name for the\r
-householder but not for any other household member.   We may have\r
-the following family JOHN    SMITH,   MARY    (blank),  JOHN JR   \r
-(blank),  ROBERT     (blank),  JENNIFER      (blank),  SUSAN  \r
-(blank).  In that family we have a first and last name for\r
-householder John, but first names only for the remaining 5 family\r
-members.\r
-\r
-     4. The keyed name may not follow acceptable form.  Some\r
-examples of invalid entries in either the first or last name\r
-field are: BABY   GIRL,  MR    JONES,   DR    BROWN,  FILIPINO \r
-FEMALE.  \r
-\r
-Each of these situations are responsible for limiting our\r
-original sample of 7.2 million person records down to its present\r
-size of 6.3 million.  The actual number of person records making\r
-up the unabridged files are: \r
-\r
-     File Name             Valid Records      Unique Names  \r
-                         \r
-  1. dist.all.last          6.290,251            88,799 \r
-  2. dist.female.first      3,184,399             4,275\r
-  3. dist.male.first.       3,003,954             1.219 \r
-\r
-For purposes of both confidentiality and elimination of data\r
-noise we restricted the number of unique names available at this\r
-internet site to the minimum number of entries that contain 90\r
-percent of the population in that data file.  There is an\r
-extremely small chance that an individual with a truly "unique"\r
-name could have been captured in sample, and is far more likely\r
-for surnames than for first names.   A second basis for limiting\r
-entries is that a smattering of entries exist because of the\r
-combination of bad handwriting coupled with poor typing. \r
-Consider the entry JOSEHP in dist.male.first.  Although JOSEHP\r
-may be a name, it is much more likely that all of the JOSEHP\r
-entries are really miskeys of JOSEPH.\r
-\r
-\r
-LIMITATIONS\r
-\r
-For the names at the top of the distribution, (SMITH, JOHNSON,\r
-WILLIAMS, JONES etc),  or (MARY, PATRICIA, LINDA, BARBARA etc) or\r
-(JAMES. JOHN, ROBERT, MICHAEL etc) the data speaks for \r
-themselves.  However as the sample thins, one might draw\r
-conclusions about frequency that are not warranted.\r
-\r
-The PES sample intentionally over sampled both Blacks and\r
-Hispanics, and it is likely that the Search Area also contains an\r
-excess of these two groups.  Thus the frequently occurring\r
-surnames: GARCIA, RODRIGUEZ, GOMEZ and WASHINGTON as well as\r
-first names: JUAN, JOSE, GUADALUPE and WILLIE might attain higher\r
-rankings than their actual population numbers within the United\r
-States would warrant.  \r
-\r
-But the limitations due to sampling are much more noticeable when\r
-looking at rarely occurring names--especially surnames.  Consider\r
-a surname appearing 63 (out of 6.3 million entries) times in the\r
-file dist.all.last.  Here the frequency would appear as 0.001\r
-percent, but it is possible that that sample frequency may not be\r
-close to "truth".\r
-\r
-Ignoring clustering, (persons in the same household usually have\r
-the same surname) the coefficient of variation on a number of\r
-that magnitude would be approximately 12 percent.  But most\r
-people who do not live alone share a last name with other people\r
-in that household.  Thus the 63 persons with that rare name may\r
-be the result of 16 households, which would raise the coefficient\r
-of variation to approximately 25 percent.  \r
-\r
-But we are not done.  Even in the last years of the 20th century,\r
-families tend to live close to each other, and it is not\r
-impossible to conceive a situation where all Americans with a\r
-certain surname appear in sample.  Were that situation to occur\r
-it would be possible to overstate the frequency of that surname\r
-name by a factor of 40.  The number 40 arises because the number\r
-of Census records in the sample (6,290,251) is approximately one\r
-fortieth of the United States population.\r
-\r
-The fact that a name doesn't appear in these three files does not\r
-mean that it is non existent, only that it is reasonably rare.\r
-\r
-In conclusion we do realize that misleading frequencies are much\r
-less likely in  the files (dist.female first) and\r
-(dist.male.first).  Although fathers and one son may share a\r
-first name, brothers almost never share the same first name.\r
-\r
-\r
-ADDITIONAL INFORMATION\r
-\r
-Persons wanting or needing more information about the contents of\r
-these three files can contact David L. Word (dword@census.gov)\r
-301-457-2103 or Randy M. Klear (rklear@census.gov) 301-457-1727.\r
-\r
-\r