genetics/dbsnp_mirror.mdwn

   1 [[!meta title="Postgresql dbsnp Mirror"]]
   2
   3 Getting Files
   4 -------------
   5
   6 Assuming you are interested in humans (like I am) the following will
   7 download the databases and schemas for the current release of dbsnp.
   8 For humans, this is currently ≈60G, and takes a while to retrieve
   9 (about 24 hours or so).
  10
  11     rsync -rvP --include 'organism_*' --exclude '**' rsync://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/ .
  12     rsync -rvP rsync://ftp.ncbi.nlm.nih.gov/snp/database/shared_data .
  13     rsync -rvP rsync://ftp.ncbi.nlm.nih.gov/snp/database/shared_schema .
  14
  15 Preparing SQL Schemas
  16 ---------------------
  17
  18 If you're loading the human databases, I've already done the work for
  19 you. Simply use my
  20 [git repository](http://git.donarmstrong.com/dbsnp.git) (`git init
  21 db_snp_utils; git pull http://git.donarmstrong.com/dbsnp.git`).
  22
  23 Otherwise, you'll want to use
  24 [mssql_psql_conversion.pl](http://git.donarmstrong.com/?p=dbsnp.git;a=blob;f=utils/mssql_psql_conversion.pl;hb=HEAD)
  25 in my dbsnp git repository to convert your organism's schema into
  26 appropriate bits. Something like the following will get you close (you
  27 may still need to tweak the sql manually):
  28
  29     for a in {organism,data}_schema/*.sql; do
  30             ./mssql_psql_conversion.pl ${a} > ${a%%.sql}_postgrseql.sql;
  31         done;
  32
  33 Loading data
  34 ------------
  35
  36 Once the schema are correct, you want to load the schema and the data,
  37 then apply the constraints and indexes. This will take some time even
  38 on a fairly fast machine.
  39 [I would expect at least 2-3 days, unless you have exceptionally fast disks.]
  40
  41 I have included a script in the utils directory of the git repository
  42 above called
  43 [load_snp_data.sh](http://git.donarmstrong.com/?p=dbsnp.git;a=blob;f=utils/load_snp_data.sh;hb=HEAD),
  44 which applies the schema, loads the data, and then applies the indexes
  45 and constraints. It looks like the following:
  46
  47     psql -c 'DROP DATABASE snp';
  48     psql -c 'CREATE DATABASE snp';
  49
  50     DATA_DIR=/srv/ncbi/db_snp/
  51     SCHEMA_DIR=/srv/ncbi/db_snp_utils/schema
  52     UTIL_DIR=${SCHEMA_DIR}/../utils/
  53
  54     (cd ${SCHEMA_DIR}/shared_schema;
  55         cat dbSNP_main_table_postgresql.sql |psql snp;
  56     )
  57     (cd ${SCHEMA_DIR}/human_9606_schema;
  58         cat *_table_postgresql.sql|psql snp;
  59         ${UTIL_DIR}/human_gty1_indexes_creation.pl create trigger |psql snp;
  60     )
  61     (cd ${DATA_DIR}/shared_data;
  62         for a in $(find -type f -iname '*.bcp.gz' -printf '%f\n'|sort); do
  63         echo $a;
  64         zcat $a | perl -pe 's/\r/\\r/g' |psql snp -c "COPY ${a%%.bcp.gz} FROM STDIN WITH NULL ''";
  65         done;
  66     )
  67     (cd ${DATA_DIR}/organism_data;
  68         for a in $(find -type f -iname '*.bcp.gz' -printf '%f\n'|sort); do
  69         echo $a;
  70         zcat $a | perl -pe 's/\r/\\r/g' |psql snp -c "COPY ${a%%.bcp.gz} FROM STDIN WITH NULL ''";
  71         done;
  72     )
  73     (cd ${SCHEMA_DIR}/shared_schema;
  74         cat dbSNP_main_index_postgresql.sql dbSNP_main_constraint_postgresql.sql|psql snp;
  75     )
  76     (cd ${SCHEMA_DIR}/human_9606_schema;
  77         cat *_{index,constraint}_postgresql.sql|psql snp;
  78         ${UTIL_DIR}/human_gty1_indexes_creation.pl index |psql snp;
  79     )
  80
  81 Permissions on the database
  82 ---------------------------
  83
  84 Since I have my database on a server separate from the workstations
  85 (and other machines) that I often do work on, I need remote access to
  86 the database. To make this easy (and avoid having to hard code
  87 database details into the few dozen scripts I use), I created a
  88 postgresql service called *snp*.
  89
  90 An entry like this:
  91
  92     [snp]
  93      dbname=snp
  94      user=snpuser
  95      password=somepassword
  96      port=9212
  97      host=snpdb.donarmstrong.com
  98
  99 in the pg_service.conf file (in `/etc/postgresql-common` or
 100 `PGSYSCONFDIR`) will configure the service.
 101
 102 You then need to make sure that the database server is listening on
 103 the appropriate ip address (edit the database's `postgresql.conf`
 104 file), and that *snpuser* has select privileges and can connect.
 105 A line like the following in pg_hba.conf
 106
 107     host    snp             snpuser         192.168.0.0/24           md5
 108
 109 and the following sql will set that up for you.
 110
 111     CREATE USER snpuser WITH PASSWORD ('somepassword');
 112         GRANT SELECT ON ALL TABLES IN SCHEMA public TO snpuser;
 113
 114 Then, to test, you should be able to run:
 115
 116     psql "service=snp" -c 'SELECT * FROM snp LIMIT 5';
 117
 118 On another machine.
 119
 120 Querying the database
 121 ---------------------
 122
 123 Once the process above has finished, you can actually query the
 124 database. For example, to select information about rs17849502 with its
 125 chromosome and position, you do something like the following:
 126
 127     SELECT scpr.snp_id AS snp_id,
 128            scpr.chr AS chr,
 129            scpr.pos AS pos,
 130            scpr.orien AS orien,
 131            scl.allele AS ref,
 132            uv.var_str AS var_str,
 133            ruv.var_str AS rev_var_str,
 134            ml.locus_id AS uid,
 135            ml.locus_symbol AS symbol,
 136            gitn.gene_name AS description
 137            FROM snp s
 138              JOIN b135_snpchrposonref_37_3 scpr ON s.snp_id=scpr.snp_id
 139              JOIN b135_snpcontigloc_37_3 scl ON scpr.snp_id=scl.snp_id
 140              JOIN b135_contiginfo_37_3 ci ON scl.ctg_id = ci.ctg_id
 141              LEFT OUTER JOIN b132_snpcontiglocusid_37_1 ml ON s.snp_id=ml.snp_id
 142              LEFT OUTER JOIN geneidtoname gitn ON ml.locus_id=gitn.gene_id
 143              JOIN univariation uv ON s.univar_id=uv.univar_id
 144              JOIN univariation ruv ON uv.rev_univar_id=ruv.univar_id
 145     WHERE ci.group_term LIKE 'GRCh%' AND s.snp_id='17849502';
 146
 147 [I personally use this very query in a program called snp_info, which I'll probably share later.]