genetics/dbsnp_mirror.mdwn

   1 [[!meta title="Postgresql dbsnp Mirror"]]
   2
   3 Getting Files
   4 -------------
   5
   6 Assuming you are interested in humans (like I am) the following will
   7 download the databases and schemas for the current release of dbsnp.
   8 For humans, this is currently ≈60G, and takes a while to retrieve
   9 (about 24 hours or so).
  10
  11     for a in {organism,shared}_{data,schema}; do
  12         lftp -c "open ftp.ncbi.nlm.nih.gov; cd /snp/organisms/human_9606/database/; mirror $a";
  13     done;
  14
  15
  16 Preparing SQL Schemas
  17 ---------------------
  18
  19 If you're loading the human databases, I've already done the work for
  20 you. Simply use my
  21 [git repository](http://git.donarmstrong.com/dbsnp.git) (`git init
  22 db_snp_utils; git pull http://git.donarmstrong.com/dbsnp.git`).
  23
  24 Otherwise, you'll want to use
  25 [mssql_psql_conversion.pl](http://git.donarmstrong.com/?p=dbsnp.git;a=blob;f=utils/mssql_psql_conversion.pl;hb=HEAD)
  26 in my dbsnp git repository to convert your organism's schema into
  27 appropriate bits. Something like the following will get you close (you
  28 may still need to tweak the sql manually):
  29
  30     for a in {organism,data}_schema/*.sql; do
  31             ./mssql_psql_conversion.pl ${a} > ${a%%.sql}_postgrseql.sql;
  32         done;
  33
  34 Loading data
  35 ------------
  36
  37 Once the schema are correct, you want to load the schema and the data,
  38 then apply the constraints and indexes. This will take some time even
  39 on a fairly fast machine.
  40 [I would expect at least 2-3 days, unless you have exceptionally fast disks.]
  41
  42 I have included a script in the utils directory of the git repository
  43 above called
  44 [load_snp_data.sh](http://git.donarmstrong.com/?p=dbsnp.git;a=blob;f=utils/load_snp_data.sh;hb=HEAD),
  45 which applies the schema, loads the data, and then applies the indexes
  46 and constraints. It looks like the following:
  47
  48     psql -c 'DROP DATABASE snp';
  49     psql -c 'CREATE DATABASE snp';
  50
  51     DATA_DIR=/srv/ncbi/db_snp/
  52     SCHEMA_DIR=/srv/ncbi/db_snp_utils/schema
  53     UTIL_DIR=${SCHEMA_DIR}/../utils/
  54
  55     (cd ${SCHEMA_DIR}/shared_schema;
  56         cat dbSNP_main_table_postgresql.sql |psql snp;
  57     )
  58     (cd ${SCHEMA_DIR}/human_9606_schema;
  59         cat *_table_postgresql.sql|psql snp;
  60         ${UTIL_DIR}/human_gty1_indexes_creation.pl create trigger |psql snp;
  61     )
  62     (cd ${DATA_DIR}/shared_data;
  63         for a in $(find -type f -iname '*.bcp.gz' -printf '%f\n'|sort); do
  64         echo $a;
  65         zcat $a | perl -pe 's/\r/\\r/g' |psql snp -c "COPY ${a%%.bcp.gz} FROM STDIN WITH NULL ''";
  66         done;
  67     )
  68     (cd ${DATA_DIR}/organism_data;
  69         for a in $(find -type f -iname '*.bcp.gz' -printf '%f\n'|sort); do
  70         echo $a;
  71         zcat $a | perl -pe 's/\r/\\r/g' |psql snp -c "COPY ${a%%.bcp.gz} FROM STDIN WITH NULL ''";
  72         done;
  73     )
  74     (cd ${SCHEMA_DIR}/shared_schema;
  75         cat dbSNP_main_index_postgresql.sql dbSNP_main_constraint_postgresql.sql|psql snp;
  76     )
  77     (cd ${SCHEMA_DIR}/human_9606_schema;
  78         cat *_{index,constraint}_postgresql.sql|psql snp;
  79         ${UTIL_DIR}/human_gty1_indexes_creation.pl index |psql snp;
  80     )
  81
  82 Querying the database
  83 ---------------------
  84
  85 Once the process above has finished, you can actually query the
  86 database. For example, to select information about rs17849502 with its
  87 chromosome and position, you do something like the following:
  88
  89     SELECT scpr.snp_id AS snp_id,
  90            scpr.chr AS chr,
  91            scpr.pos AS pos,
  92            scpr.orien AS orien,
  93            scl.allele AS ref,
  94            uv.var_str AS var_str,
  95            ruv.var_str AS rev_var_str,
  96            ml.locus_id AS uid,
  97            ml.locus_symbol AS symbol,
  98            gitn.gene_name AS description
  99            FROM snp s
 100              JOIN b135_snpchrposonref_37_3 scpr ON s.snp_id=scpr.snp_id
 101              JOIN b135_snpcontigloc_37_3 scl ON scpr.snp_id=scl.snp_id
 102              JOIN b135_contiginfo_37_3 ci ON scl.ctg_id = ci.ctg_id
 103              LEFT OUTER JOIN b132_snpcontiglocusid_37_1 ml ON s.snp_id=ml.snp_id
 104              LEFT OUTER JOIN geneidtoname gitn ON ml.locus_id=gitn.gene_id
 105              JOIN univariation uv ON s.univar_id=uv.univar_id
 106              JOIN univariation ruv ON uv.rev_univar_id=ruv.univar_id
 107     WHERE ci.group_term LIKE 'GRCh%' AND s.snp_id='17849502';
 108
 109 [I personally use this very query in a program called snp_info, which I'll probably share later.]