From 28d4a428fd425b0abe7d6b6685f58d68a58a48d3 Mon Sep 17 00:00:00 2001 From: Don Armstrong Date: Mon, 16 Apr 2012 11:32:10 -0700 Subject: [PATCH] add uneccessary data frame --- posts/unnecessary_data_frame_slow.mdwn | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) create mode 100644 posts/unnecessary_data_frame_slow.mdwn diff --git a/posts/unnecessary_data_frame_slow.mdwn b/posts/unnecessary_data_frame_slow.mdwn new file mode 100644 index 0000000..23ffb12 --- /dev/null +++ b/posts/unnecessary_data_frame_slow.mdwn @@ -0,0 +1,20 @@ +[[!meta title="Unnecessary Use of data.frame is Slow"]] + +I've been working for a while on a reasonably large Genome-Wide +Association Study dataset which has lead me through various +interesting parts of handing large datasets in R. This dataset is +approximately 320,000 rows by 5000 columns. After getting Rmpi +working, and handling the dataset by row so I don't run out of memory, +I've managed to get pretty decent performance. However, one small +section of the code seemed to be taking forever to run. + +It turns out that assigning data to a data.frame by row is incredibly +slow in R. Thus, a section of my code which should have taken +microseconds was taking tenths of seconds, and threatening to run all +week. Using a matrix instead (which is basically what I want anyway) +and converting to a data.frame at the very end makes the code multiple +orders of magnitude faster. + +Moral of the story? Don't use data.frame unnecessarily. + +[[!tag tech r]] -- 2.39.2