X-Git-Url: https://git.donarmstrong.com/?p=don.git;a=blobdiff_plain;f=posts%2Funnecessary_data_frame_slow.mdwn;fp=posts%2Funnecessary_data_frame_slow.mdwn;h=23ffb1291635a7c0d00c111a1eeb83bd38d23c49;hp=0000000000000000000000000000000000000000;hb=28d4a428fd425b0abe7d6b6685f58d68a58a48d3;hpb=ca652cfbd2e7c1f63cc8ac6a310b37cd0c8fafc0

diff --git a/posts/unnecessary_data_frame_slow.mdwn b/posts/unnecessary_data_frame_slow.mdwn
new file mode 100644
index 0000000..23ffb12
--- /dev/null
+++ b/posts/unnecessary_data_frame_slow.mdwn
@@ -0,0 +1,20 @@
+[[!meta title="Unnecessary Use of data.frame is Slow"]]
+
+I've been working for a while on a reasonably large Genome-Wide
+Association Study dataset which has lead me through various
+interesting parts of handing large datasets in R. This dataset is
+approximately 320,000 rows by 5000 columns. After getting Rmpi
+working, and handling the dataset by row so I don't run out of memory,
+I've managed to get pretty decent performance. However, one small
+section of the code seemed to be taking forever to run.
+
+It turns out that assigning data to a data.frame by row is incredibly
+slow in R. Thus, a section of my code which should have taken
+microseconds was taking tenths of seconds, and threatening to run all
+week. Using a matrix instead (which is basically what I want anyway)
+and converting to a data.frame at the very end makes the code multiple
+orders of magnitude faster.
+
+Moral of the story? Don't use data.frame unnecessarily.
+
+[[!tag tech r]]