1 [[!meta title="Shrinking lists of gene names in R"]]
3 I've been trying to finish a paper where I compare gene expression in
4 14 different placentas. One of the supplemental figures compares
5 median expression in gene trees across all 14 species, but because
7 [ENSGT00840000129673](http://www.ensembl.org/Multi/GeneTree/Image?gt=ENSGT00840000129673)
8 aren't very expressive, and names like
9 "COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5,
10 COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2,
11 COL5A1, COL24A1" take up too much space, I wanted a function which could
12 collapse the gene names into something which uses bash glob syntax to
13 more succinctly list the gene names, like:
14 COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}.
16 Thus, a crazy function which uses `lcprefix` from `Biostrings` and
17 some looping was born:
20 collapse.gene.names <- function(x,min.collapse=2) {
21 ## longest common substring
22 if (is.null(x) || length(x)==0) {
23 return(as.character(NA))
26 str_collapse <- function(y,len) {
27 if (len == 1 || length(y) < 2) {
31 gsub(paste0("^(.{",len,"}).*$"),"\\1",y[1])
33 gsub(paste0("^.{",len,"}"),"",y)
35 sum(combn(y.rem,2,function(x){Biostrings::lcprefix(x[1],x[2])}) >= 2)
36 if (length(y.rem) > 3 &&
40 collapse.gene.names(y.rem,min.collapse=1)
43 "{",paste(collapse=",",
48 while (i <= length(x)) {
50 pmin(sapply(x,Biostrings::lcprefix,x[i]))
52 which(col.pmin > min.collapse)
53 if (length(collapseable) == 0) {
58 str_collapse(x[collapseable],
59 min(col.pmin[collapseable]))
61 i <- max(collapseable)+1
64 return(paste0(collapse=",",ret))
68 [[!tag genetics biology tech R]]