spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From olarayej <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-11031][SPARKR] Method str() on a DataFr...
Date Sat, 21 Nov 2015 00:57:29 GMT
Github user olarayej commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9613#discussion_r45537232
  
    --- Diff: R/pkg/R/DataFrame.R ---
    @@ -2199,3 +2199,97 @@ setMethod("coltypes",
     
                 rTypes
               })
    +
    +#' Display the structure of a DataFrame, including column names, column types, as well
as a
    +#' a small sample of rows.
    +#' @name str
    +#' @title Compactly display the structure of a dataset
    +#' @rdname str
    +#' @family DataFrame functions
    +#' @param object a DataFrame
    +#' @examples \dontrun{
    +#' # Create a DataFrame from the Iris dataset
    +#' irisDF <- createDataFrame(sqlContext, iris)
    +#' 
    +#' # Show the structure of the DataFrame
    +#' str(irisDF)
    +#' }
    +setMethod("str",
    +          signature(object = "DataFrame"),
    +          function(object) {
    +
    +            # TODO: These could be made global parameters, though in R it's not the case
    +            MAX_CHAR_PER_ROW <- 120
    +            MAX_COLS <- 100
    +
    +            # Get the column names and types of the DataFrame
    +            names <- names(object)
    +            types <- coltypes(object)
    +
    +            # Get the number of rows.
    +            # TODO: Ideally, this should be cached
    +            cachedCount <- nrow(object)
    +
    +            # Get the first elements of the dataset. Limit number of columns accordingly
    +            dataFrame <- if (ncol(object) > MAX_COLS) {
    +                           head(object[, c(1:MAX_COLS)])
    +                         } else {
    +                           head(object)
    +                         }
    +
    +            # The number of observations will be displayed only if the number
    +            # of rows of the dataset has already been cached.
    +            if (!is.null(cachedCount)) {
    --- End diff --
    
    Yes, that's why I added the TODO. In our implementation, we had a global cache to store
the number of rows of the datasets in the current session. At some point, we'll need to implement
some caching mechanism so that every time you run str() or nrow(), you don't have to do a
full data scan. When such caching mechanism is implemented, all we'll need to do is to change
this line accordingly:
    
    cachedCount <- nrow(object)
    
    by
    
    cachedCount <- FUNCTION_TO_GET_CACHED_NROW(object)
    
    The behavior of str() is such that if nrow() hasn't been cached, the number of rows is
simply not shown.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message