Our next "R and big data tip" is: summarizing big data.
We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything).
Simple question: is there an easy way to summarize big data in R
?
The answer is: yes, but we suggest you use the replyr
package to do so.
Let’s set up a trivial example.
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.5.0'
library("sparklyr")
packageVersion("sparklyr")
## [1] '0.5.5'
library("replyr")
packageVersion("replyr")
## [1] '0.3.902'
sc <- sparklyr::spark_connect(version='2.0.2',
master = "local")
diris <- copy_to(sc, iris, 'diris')
The usual S3
–summary()
summarizes the handle, not the data.
summary(diris)
## Length Class Mode
## src 1 src_spark list
## ops 3 op_base_remote list
tibble::glimpse()
throws.
packageVersion("tibble")
## [1] '1.3.3'
# errors-out
glimpse(diris)
## Observations: 150
## Variables: 5
## Error in if (width[i] <= max_width[i]) next: missing value where TRUE/FALSE needed
broom::glance()
throws.
packageVersion("broom")
## [1] '0.4.2'
broom::glance(diris)
## Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl
replyr_summary()
works, and returns results in a data.frame
.
replyr_summary(diris) %>%
select(-nunique, -index, -nrows)
## column class nna min max mean sd lexmin lexmax
## 1 Sepal_Length numeric 0 4.3 7.9 5.843333 0.8280661 <NA> <NA>
## 2 Sepal_Width numeric 0 2.0 4.4 3.057333 0.4358663 <NA> <NA>
## 3 Petal_Length numeric 0 1.0 6.9 3.758000 1.7652982 <NA> <NA>
## 4 Petal_Width numeric 0 0.1 2.5 1.199333 0.7622377 <NA> <NA>
## 5 Species character 0 NA NA NA NA setosa virginica
sparklyr::spark_disconnect(sc)
rm(list=ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 762515 40.8 1442291 77.1 1168576 62.5
## Vcells 1394407 10.7 2552219 19.5 1820135 13.9
Categories: Uncategorized
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.