Recently I noticed that the
sparklyr had the following odd behavior:
suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #>  '0.7.2.9000' packageVersion("sparklyr") #>  '0.6.2' packageVersion("dbplyr") #>  '126.96.36.19900' sc <- spark_connect(master = 'local') #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #>  NA ncol(d) #>  NA nrow(d) #>  NA
This means user code or user analyses that depend on one of
nrow() possibly breaks.
nrow() used to return something other than
NA, so older work may not be reproducible.
In fact: where I actually noticed this was deep in debugging a client project (not in a trivial example, such as above).
Tron: fights for the users.
In my opinion: this choice is going to be a great source of surprises, unexpected behavior, and bugs going forward for both
nrow()” and “
print.tbl_sparkis too slow since
tibbleas the default way of printing records”.
A little digging gets us to this:
The above might make sense if
dbplyr were the only users of
Frankly if I call
nrow() I expect to learn the number of rows in a table.
The suggestion is for all user code to adapt to use
sdf_nrow() (instead of
tibble adapting). Even if practical (there are already a lot of existing
sparklyr analyses), this prohibits the writing of generic
dplyr code that works the same over local data, databases, and
Spark (by generic code, we mean code that does not check the data source type and adapt). The situation is possibly even worse for non-
dbplyr users (i.e., databases such as
PostgreSQL), as I don’t see any obvious convenient “no please really calculate the number of rows for me” (other than “
d %>% tally %>% pull“, but that turns out to not always work).
I admit, calling
nrow() against an arbitrary query can be expensive. However, I am usually calling
nrow() on physical tables (not on arbitrary
dplyr queries or pipelines). Physical tables ofter deliberately carry explicit meta-data to make it possible for
nrow() to be a cheap operation.
Allowing the user to write reliable generic code that works against many
dplyr data sources is the purpose of our
replyr package. Being able to use the same code many places increases the value of the code (without user facing complexity) and allows one to rehearse procedures in-memory before trying databases or
Spark. Below are the functions
replyr supplies for examining the size of tables:
library("replyr") packageVersion("replyr") #>  '0.5.4' replyr_hasrows(d) #>  TRUE replyr_dim(d) #>  2 1 replyr_ncol(d) #>  1 replyr_nrow(d) #>  2 spark_disconnect(sc)
Note: the above is only working properly in the development version of
replyr, as I only found out about the issue and made the fix recently.
replyr_hasrows() was added as I found in many projects the primary use of
nrow() was to determine if there was any data in a table. The idea is: user code uses the
replyr functions, and the
replyr functions deal with the complexities of dealing with different data sources. This also gives us a central place to collect patches and fixes as we run into future problems.
replyr accretes functionality as our group runs into different use cases (and we try to put use cases first, prior to other design considerations).
The point of
replyr is to provide re-usable work arounds of design choices far away from our influence.
Categories: Coding Opinion Statistics Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.