1 [[!meta title="Shrinking lists of gene names in R"]]
3 I've been trying to finish a paper where I compare gene expression in
4 14 different placentas. One of the supplemental figures compares
5 median expression in gene trees across all 14 species, but because
6 tree names like ENT000001 aren't very expressive, and names like
7 "COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5,
8 COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2,
9 COL5A1, COL24A1" take up too much space, wanted a function which could
10 collapse the gene names into something which uses bash glob syntax to
11 more succinctly list the gene names, like:
12 COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}.
14 Thus, a crazy function which uses `lcprefix` from `Biostrings` and
15 some looping was born:
18 collapse.gene.names <- function(x,min.collapse=2) {
19 ## longest common substring
20 if (is.null(x) || length(x)==0) {
21 return(as.character(NA))
24 str_collapse <- function(y,len) {
25 if (len == 1 || length(y) < 2) {
29 gsub(paste0("^(.{",len,"}).*$"),"\\1",y[1])
31 gsub(paste0("^.{",len,"}"),"",y)
33 sum(combn(y.rem,2,function(x){Biostrings::lcprefix(x[1],x[2])}) >= 2)
34 if (length(y.rem) > 3 &&
38 collapse.gene.names(y.rem,min.collapse=1)
41 "{",paste(collapse=",",
46 while (i <= length(x)) {
48 pmin(sapply(x,Biostrings::lcprefix,x[i]))
50 which(col.pmin > min.collapse)
51 if (length(collapseable) == 0) {
56 str_collapse(x[collapseable],
57 min(col.pmin[collapseable]))
59 i <- max(collapseable)+1
62 return(paste0(collapse=",",ret))
66 [[!tag genetics biology tech R]]