[[!meta title="Shrinking lists of gene names in R"]] I've been trying to finish a paper where I compare gene expression in 14 different placentas. One of the supplemental figures compares median expression in gene trees across all 14 species, but because tree ids like [ENSGT00840000129673](http://www.ensembl.org/Multi/GeneTree/Image?gt=ENSGT00840000129673) aren't very expressive, and names like "COL11A2, COL5A3, COL4A1, COL1A1, COL2A1, COL1A2, COL4A6, COL4A5, COL7A1, COL27A1, COL11A1, COL4A4, COL4A3, COL3A1, COL4A2, COL5A2, COL5A1, COL24A1" take up too much space, I wanted a function which could collapse the gene names into something which uses bash glob syntax to more succinctly list the gene names, like: COL{11A{1,2},1A{1,2},24A1,27A1,2A1,3A1,4A{1,2,3,4,5,6},5A{1,2,3},7A1}. Thus, a crazy function which uses `lcprefix` from `Biostrings` and some looping was born: [[!format R """ collapse.gene.names <- function(x,min.collapse=2) { ## longest common substring if (is.null(x) || length(x)==0) { return(as.character(NA)) } x <- sort(unique(x)) str_collapse <- function(y,len) { if (len == 1 || length(y) < 2) { return(y) } y.tree <- gsub(paste0("^(.{",len,"}).*$"),"\\1",y[1]) y.rem <- gsub(paste0("^.{",len,"}"),"",y) y.rem.prefix <- sum(combn(y.rem,2,function(x){Biostrings::lcprefix(x[1],x[2])}) >= 2) if (length(y.rem) > 3 && y.rem.prefix >= 2 ) { y.rem <- collapse.gene.names(y.rem,min.collapse=1) } paste0(y.tree, "{",paste(collapse=",", y.rem),"}") } i <- 1 ret <- NULL while (i <= length(x)) { col.pmin <- pmin(sapply(x,Biostrings::lcprefix,x[i])) collapseable <- which(col.pmin > min.collapse) if (length(collapseable) == 0) { ret <- c(ret,x[i]) i <- i+1 } else { ret <- c(ret, str_collapse(x[collapseable], min(col.pmin[collapseable])) ) i <- max(collapseable)+1 } } return(paste0(collapse=",",ret)) } """]] [[!tag genetics biology tech R]]