Assuming you are interested in humans (like I am) the following will
download the databases and schemas for the current release of dbsnp.
-For humans, this is currently ≈60G, and takes about a while to retrieve.
-
- for a in {organism,shared}_{data,schema}; do
- lftp -c "open ftp.ncbi.nlm.nih.gov; cd /snp/organisms/human_9606/database/; mirror $a";
- done;
+For humans, this is currently ≈60G, and takes a while to retrieve
+(about 24 hours or so).
+ rsync -rvP --include 'organism_**' --exclude '**' rsync://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/database/ .
+ rsync -rvP rsync://ftp.ncbi.nlm.nih.gov/snp/database/shared_data .
+ rsync -rvP rsync://ftp.ncbi.nlm.nih.gov/snp/database/shared_schema .
Preparing SQL Schemas
---------------------
${UTIL_DIR}/human_gty1_indexes_creation.pl index |psql snp;
)
+Permissions on the database
+---------------------------
+
+Since I have my database on a server separate from the workstations
+(and other machines) that I often do work on, I need remote access to
+the database. To make this easy (and avoid having to hard code
+database details into the few dozen scripts I use), I created a
+postgresql service called *snp*.
+
+An entry like this:
+
+ [snp]
+ dbname=snp
+ user=snpuser
+ password=somepassword
+ port=9212
+ host=snpdb.donarmstrong.com
+
+in the pg_service.conf file (in `/etc/postgresql-common` or
+`PGSYSCONFDIR`) will configure the service.
+
+You then need to make sure that the database server is listening on
+the appropriate ip address (edit the database's `postgresql.conf`
+file), and that *snpuser* has select privileges and can connect.
+A line like the following in pg_hba.conf
+
+ host snp snpuser 192.168.0.0/24 md5
+
+and the following sql will set that up for you.
+
+ CREATE USER snpuser WITH PASSWORD ('somepassword');
+ GRANT SELECT ON ALL TABLES IN SCHEMA public TO snpuser;
+
+Then, to test, you should be able to run:
+
+ psql "service=snp" -c 'SELECT * FROM snp LIMIT 5';
+
+On another machine.
+
Querying the database
---------------------
database. For example, to select information about rs17849502 with its
chromosome and position, you do something like the following:
- SELECT scpr.snp_id AS snp_id,
- scpr.chr AS chr,
- scpr.pos AS pos,
- scpr.orien AS orien,
- scl.allele AS ref,
- uv.var_str AS var_str,
- ruv.var_str AS rev_var_str,
- ml.locus_id AS uid,
- ml.locus_symbol AS symbol,
- gitn.gene_name AS description
- FROM snp s
- JOIN b135_snpchrposonref_37_3 scpr ON s.snp_id=scpr.snp_id
- JOIN b135_snpcontigloc_37_3 scl ON scpr.snp_id=scl.snp_id
- JOIN b135_contiginfo_37_3 ci ON scl.ctg_id = ci.ctg_id
- LEFT OUTER JOIN b132_snpcontiglocusid_37_1 ml ON s.snp_id=ml.snp_id
- LEFT OUTER JOIN geneidtoname gitn ON ml.locus_id=gitn.gene_id
- JOIN univariation uv ON s.univar_id=uv.univar_id
- JOIN univariation ruv ON uv.rev_univar_id=ruv.univar_id
- WHERE ci.group_term LIKE 'GRCh%' AND s.snp_id='17849502';
+ SELECT scpr.snp_id AS snp_id,
+ scpr.chr AS chr,
+ scpr.pos AS pos,
+ scpr.orien AS orien,
+ scl.allele AS ref,
+ uv.var_str AS var_str,
+ ruv.var_str AS rev_var_str,
+ ml.locus_id AS uid,
+ ml.locus_symbol AS symbol,
+ gitn.gene_name AS description
+ FROM snp s
+ JOIN b135_snpchrposonref_37_3 scpr ON s.snp_id=scpr.snp_id
+ JOIN b135_snpcontigloc_37_3 scl ON scpr.snp_id=scl.snp_id
+ JOIN b135_contiginfo_37_3 ci ON scl.ctg_id = ci.ctg_id
+ LEFT OUTER JOIN b132_snpcontiglocusid_37_1 ml ON s.snp_id=ml.snp_id
+ LEFT OUTER JOIN geneidtoname gitn ON ml.locus_id=gitn.gene_id
+ JOIN univariation uv ON s.univar_id=uv.univar_id
+ JOIN univariation ruv ON uv.rev_univar_id=ruv.univar_id
+ WHERE ci.group_term LIKE 'GRCh%' AND s.snp_id='17849502';
[I personally use this very query in a program called snp_info, which I'll probably share later.]