X-Git-Url: https://git.donarmstrong.com/?p=deb_pkgs%2Fscowl.git;a=blobdiff_plain;f=7.1%2Fr%2Fcensus%2Fnam_meth.txt;fp=7.1%2Fr%2Fcensus%2Fnam_meth.txt;h=d5ca8c65d38bef7d1008075a7db739583d34a1d9;hp=0000000000000000000000000000000000000000;hb=01534a94130c1f5a3a230cf4fe18365a235ba271;hpb=7b14ba883fb1046508c44be37b4c6ba5da5feacf

diff --git a/7.1/r/census/nam_meth.txt b/7.1/r/census/nam_meth.txt
new file mode 100644
index 0000000..d5ca8c6
--- /dev/null
+++ b/7.1/r/census/nam_meth.txt
@@ -0,0 +1,236 @@
+10/95
+
+DOCUMENTATION AND METHODOLOGY FOR FREQUENTLY OCCURRING NAMES IN THE U.S.--1990
+
+The Census Bureau has a primary obligation to protect the
+confidentiality of individual responses to the Census.  As part
+of this confidentiality commitment, the Census Bureau does not
+currently release individual census questionnaires (or any other
+information that could identify an individual) until 72 years
+after a Decennial Census was taken.  In 1992, the Census Bureau
+released the 1920 Census schedules to National Archives.  In
+fact, the Census Bureau is so concerned about confidentiality
+that name has not been entered into the basic internal electronic data
+used to tabulate census results.
+
+However, there have been numerous demands for sunmary data on the
+frequency of surnames for genealogical reasons.  Similar interest
+has arisen for the frequency of first names by sex.  This data
+set attempts to satisfy these demands while still providing
+utmost confidentiality of individual results.
+
+
+BACKGROUND
+
+In the summer of 1990, immediately following the 1990 Decennial
+Census, the United States Census Bureau conducted a large scale
+survey to measure undercount in the 1990 Census.  This
+independent post Census operation (the 1990 Post-Enumeration
+Survey--PES) collected items of demographic data (race, sex, age
+and NAME) from 377,000 persons living in 165,000 housing units in
+5,300 predefined blocks (or block clusters). 
+
+The information acquired from this independent (PES) operation
+was matched against actual 1990 Census records for persons living
+in those same 5300 blocks plus additional surrounding ring
+blocks.  The PES blocks plus the surrounding ring--"the Search
+Area"--contained 7.2 million census records replete with name. 
+It is this Search Area data set that provides the impetus for the
+three name files at this internet site.
+  
+FILE FORMATS
+
+In July 1995, the Census Bureau placed abridged summary
+information from the Search Area on its internet site.  Selected
+data from these files have appeared in the print media with the
+citation "source--Census Bureau".  Since the documentation
+accompanying the original release of this information was
+sketchy, we are supplying additional explanatory material about
+the limitations of these data.
+
+
+Each of the three files, (dist.all.last), (dist. male.first), and
+(dist female.first) contain four items of data.  The four items
+are:    
+
+         (1).  A "Name"
+         (2).  Frequency in percent
+         (3).  Cumulative Frequency in percent 
+         (4).  Rank
+
+In the file (dist.all.last) one entry appears as:
+
+        MOORE       0.312       5.312       9
+
+In our Search Area sample, MOORE ranks 9th in terms of frequency. 
+5.312 percent of the sample population is covered by MOORE and
+the 8 names occurring more frequently than MOORE.  The surname,
+MOORE, is possessed by 0.312 percent of our population sample.
+
+
+EDITING
+
+Producing that summary line for the name MOORE required a great
+deal of program editing.  For example, we immediately realized
+that it was necessary to convert the entries MOORE JR, MOORE SR,
+and MOORE III in the last name field to MOORE.  For purposes of
+consistency we also converted entries such as MOORE JONES or
+MOORE-JONES to MOORE.
+
+In addition to those rather simplistic edits, we also examined
+each name entry for the possibility of an inversion. (eg: a first
+name appearing in the last name field and a last name placed in
+the first name area).  Consider a 2 person household with the
+entries MOORE   ROBERT, and MOORE   CAROLYN in the name fields. 
+From our sample name universe, we can empirically determine the
+probability that the inversion (ROBERT   MOORE and CAROLYN  
+MOORE) as a far greater probability of being "right" than the
+keyed entry.  When the probability that the odds of an inversion
+attained odds of 10,000 to 1, the inversion was done.  
+
+Many names can be inverted and sound absolutely right.  For
+example, there is absolutely no reason to suspect that HENRY  
+THOMAS is wrong and THOMAS    HENRY is preferable.  However, if
+HENRY  THOMAS had a spouse listed as HENRY   MARTHA  and a female
+child named HENRY   SUSAN, that additional information suffices
+to invert the name field for the entire family.
+
+For first names, we considered concatenating entries but finally
+decided against it.  Among males the combinations JOHN PAUL and
+JOSE LUIS in the first name field were far more frequent than any
+other set of spaced names.  We could possibly have formed them as
+JOHNPAUL and JOSELUIS.  As a result the male first names JOHN and
+JOSE may be marginally overstated. 
+
+The one name that is most affected by our decision not to
+concatenate is the grand old name of MARY.  The entries  (MARY
+ANN, MARY BETH, MARY CATHERINE, MARY ELLEN, MARY FRANCES, MARY
+GRACE  etc) wind up as MARY.  MARY may or may be the most common
+first name among American women, but our decision to avoid
+concatenation did add a significant number of MARY's.
+                      
+Finally, we came to the conclusion that the existence of a single
+letter (an initial perhaps?) appearing in either the first or
+last name field would not qualify as a name; but an initial in
+one field would not disqualify the other name field.  For example
+the 19th century financier (J      P  MORGAN) has a valid last
+name but the letter J does not meet these standards for a first
+name.  MUHAMMED    X is an example of a acceptable first name
+with an unusable surname.
+
+
+MISSING DATA
+
+Although the search area contains 7.2 million persons, almost 15
+percent of those persons do not provide enough information to
+form a name.  In the previous paragraph we provided the situation
+where we decided that a single a single letter would not
+constitute a name.  Other situation are listed below.
+
+
+     1. The respondents did not enter a "name" at the top of page
+2 of the 1990 Census form, even though names might have appeared
+on the roster of page 1.  A name must appear at the top of page 2
+for the name to be keyed. 
+
+     2. The respondent may have inadvertently left sex (gender)
+off his or her Census record.  In that instance we accept the
+last name, but we have no "certain" way of placing the first name
+in the male or the female file.  We do not assume that JENNIFER
+without a sex designator would be female even though common sense
+suggests that this is indeed the case.
+
+     3. A family may have put down a last name for the
+householder but not for any other household member.   We may have
+the following family JOHN    SMITH,   MARY    (blank),  JOHN JR   
+(blank),  ROBERT     (blank),  JENNIFER      (blank),  SUSAN  
+(blank).  In that family we have a first and last name for
+householder John, but first names only for the remaining 5 family
+members.
+
+     4. The keyed name may not follow acceptable form.  Some
+examples of invalid entries in either the first or last name
+field are: BABY   GIRL,  MR    JONES,   DR    BROWN,  FILIPINO 
+FEMALE.  
+
+Each of these situations are responsible for limiting our
+original sample of 7.2 million person records down to its present
+size of 6.3 million.  The actual number of person records making
+up the unabridged files are: 
+
+     File Name             Valid Records      Unique Names  
+                         
+  1. dist.all.last          6.290,251            88,799 
+  2. dist.female.first      3,184,399             4,275
+  3. dist.male.first.       3,003,954             1.219 
+
+For purposes of both confidentiality and elimination of data
+noise we restricted the number of unique names available at this
+internet site to the minimum number of entries that contain 90
+percent of the population in that data file.  There is an
+extremely small chance that an individual with a truly "unique"
+name could have been captured in sample, and is far more likely
+for surnames than for first names.   A second basis for limiting
+entries is that a smattering of entries exist because of the
+combination of bad handwriting coupled with poor typing. 
+Consider the entry JOSEHP in dist.male.first.  Although JOSEHP
+may be a name, it is much more likely that all of the JOSEHP
+entries are really miskeys of JOSEPH.
+
+
+LIMITATIONS
+
+For the names at the top of the distribution, (SMITH, JOHNSON,
+WILLIAMS, JONES etc),  or (MARY, PATRICIA, LINDA, BARBARA etc) or
+(JAMES. JOHN, ROBERT, MICHAEL etc) the data speaks for 
+themselves.  However as the sample thins, one might draw
+conclusions about frequency that are not warranted.
+
+The PES sample intentionally over sampled both Blacks and
+Hispanics, and it is likely that the Search Area also contains an
+excess of these two groups.  Thus the frequently occurring
+surnames: GARCIA, RODRIGUEZ, GOMEZ and WASHINGTON as well as
+first names: JUAN, JOSE, GUADALUPE and WILLIE might attain higher
+rankings than their actual population numbers within the United
+States would warrant.  
+
+But the limitations due to sampling are much more noticeable when
+looking at rarely occurring names--especially surnames.  Consider
+a surname appearing 63 (out of 6.3 million entries) times in the
+file dist.all.last.  Here the frequency would appear as 0.001
+percent, but it is possible that that sample frequency may not be
+close to "truth".
+
+Ignoring clustering, (persons in the same household usually have
+the same surname) the coefficient of variation on a number of
+that magnitude would be approximately 12 percent.  But most
+people who do not live alone share a last name with other people
+in that household.  Thus the 63 persons with that rare name may
+be the result of 16 households, which would raise the coefficient
+of variation to approximately 25 percent.  
+
+But we are not done.  Even in the last years of the 20th century,
+families tend to live close to each other, and it is not
+impossible to conceive a situation where all Americans with a
+certain surname appear in sample.  Were that situation to occur
+it would be possible to overstate the frequency of that surname
+name by a factor of 40.  The number 40 arises because the number
+of Census records in the sample (6,290,251) is approximately one
+fortieth of the United States population.
+
+The fact that a name doesn't appear in these three files does not
+mean that it is non existent, only that it is reasonably rare.
+
+In conclusion we do realize that misleading frequencies are much
+less likely in  the files (dist.female first) and
+(dist.male.first).  Although fathers and one son may share a
+first name, brothers almost never share the same first name.
+
+
+ADDITIONAL INFORMATION
+
+Persons wanting or needing more information about the contents of
+these three files can contact David L. Word (dword@census.gov)
+301-457-2103 or Randy M. Klear (rklear@census.gov) 301-457-1727.
+
+