Class

org.apache.spark.sql

AQPDataFrame

Related Doc: package sql

Permalink

case class AQPDataFrame(snappySession: SnappySession, qe: QueryExecution) extends DataFrame with Product with Serializable

Linear Supertypes
Product, Equals, Dataset[Row], Serializable, Serializable, AnyRef, Any
Ordering
  1. Alphabetic
  2. By Inheritance
Inherited
  1. AQPDataFrame
  2. Product
  3. Equals
  4. Dataset
  5. Serializable
  6. Serializable
  7. AnyRef
  8. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new AQPDataFrame(snappySession: SnappySession, qe: QueryExecution)

    Permalink

Value Members

  1. final def !=(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int

    Permalink
    Definition Classes
    AnyRef → Any
  3. final def ==(arg0: Any): Boolean

    Permalink
    Definition Classes
    AnyRef → Any
  4. def agg(expr: Column, exprs: Column*): DataFrame

    Permalink

    Aggregates on the entire Dataset without groups.

    Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg(max($"age"), avg($"salary"))
    ds.groupBy().agg(max($"age"), avg($"salary"))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  5. def agg(exprs: Map[String, String]): DataFrame

    Permalink

    (Java-specific) Aggregates on the entire Dataset without groups.

    (Java-specific) Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg(Map("age" -> "max", "salary" -> "avg"))
    ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Definition Classes
    Dataset
    Since

    2.0.0

  6. def agg(exprs: Map[String, String]): DataFrame

    Permalink

    (Scala-specific) Aggregates on the entire Dataset without groups.

    (Scala-specific) Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg(Map("age" -> "max", "salary" -> "avg"))
    ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
    Definition Classes
    Dataset
    Since

    2.0.0

  7. def agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame

    Permalink

    (Scala-specific) Aggregates on the entire Dataset without groups.

    (Scala-specific) Aggregates on the entire Dataset without groups.

    // ds.agg(...) is a shorthand for ds.groupBy().agg(...)
    ds.agg("age" -> "max", "salary" -> "avg")
    ds.groupBy().agg("age" -> "max", "salary" -> "avg")
    Definition Classes
    Dataset
    Since

    2.0.0

  8. def alias(alias: Symbol): Dataset[Row]

    Permalink

    (Scala-specific) Returns a new Dataset with an alias set.

    (Scala-specific) Returns a new Dataset with an alias set. Same as as.

    Definition Classes
    Dataset
    Since

    2.0.0

  9. def alias(alias: String): Dataset[Row]

    Permalink

    Returns a new Dataset with an alias set.

    Returns a new Dataset with an alias set. Same as as.

    Definition Classes
    Dataset
    Since

    2.0.0

  10. def apply(colName: String): Column

    Permalink

    Selects column based on the column name and return it as a Column.

    Selects column based on the column name and return it as a Column.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    The column name can also reference to a nested column like a.b.

  11. def as(alias: Symbol): Dataset[Row]

    Permalink

    (Scala-specific) Returns a new Dataset with an alias set.

    (Scala-specific) Returns a new Dataset with an alias set.

    Definition Classes
    Dataset
    Since

    2.0.0

  12. def as(alias: String): Dataset[Row]

    Permalink

    Returns a new Dataset with an alias set.

    Returns a new Dataset with an alias set.

    Definition Classes
    Dataset
    Since

    1.6.0

  13. def as[U](implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type.

    :: Experimental :: Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:

    • When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive).
    • When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1).
    • When U is a primitive type (i.e. String, Int, etc), then the first column of the DataFrame will be used.

    If the schema of the Dataset does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.

    Note that as[] only changes the view of the data that is passed into typed operations, such as map(), and does not eagerly project away any columns that are not present in the specified class.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  14. final def asInstanceOf[T0]: T0

    Permalink
    Definition Classes
    Any
  15. def cache(): AQPDataFrame.this.type

    Permalink

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Definition Classes
    Dataset
    Since

    1.6.0

  16. def checkpoint(eager: Boolean): Dataset[Row]

    Permalink

    Returns a checkpointed version of this Dataset.

    Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.1.0

  17. def checkpoint(): Dataset[Row]

    Permalink

    Eagerly checkpoint a Dataset and return the new Dataset.

    Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with SparkContext#setCheckpointDir.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.1.0

  18. def clone(): AnyRef

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  19. def coalesce(numPartitions: Int): Dataset[Row]

    Permalink

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

    However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).

    Definition Classes
    Dataset
    Since

    1.6.0

  20. def col(colName: String): Column

    Permalink

    Selects column based on the column name and return it as a Column.

    Selects column based on the column name and return it as a Column.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    The column name can also reference to a nested column like a.b.

  21. def collect(): Array[Row]

    Permalink

    Returns an array that contains all of Rows in this Dataset.

    Returns an array that contains all of Rows in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

    For Java API, use collectAsList.

    Definition Classes
    Dataset
    Since

    1.6.0

  22. def collectAsList(): List[Row]

    Permalink

    Returns a Java list that contains all of Rows in this Dataset.

    Returns a Java list that contains all of Rows in this Dataset.

    Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.

    Definition Classes
    Dataset
    Since

    1.6.0

  23. def columns: Array[String]

    Permalink

    Returns all column names as an array.

    Returns all column names as an array.

    Definition Classes
    Dataset
    Since

    1.6.0

  24. def count(): Long

    Permalink

    Returns the number of rows in the Dataset.

    Returns the number of rows in the Dataset.

    Definition Classes
    Dataset
    Since

    1.6.0

  25. def createGlobalTempView(viewName: String): Unit

    Permalink

    Creates a global temporary view using the given name.

    Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.

    Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database _global_temp, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM _global_temp.view1.

    Definition Classes
    Dataset
    Annotations
    @throws( ... )
    Since

    2.1.0

    Exceptions thrown

    AnalysisException if the view name already exists

  26. def createOrReplaceTempView(viewName: String): Unit

    Permalink

    Creates a local temporary view using the given name.

    Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

    Definition Classes
    Dataset
    Since

    2.0.0

  27. def createTempView(viewName: String): Unit

    Permalink

    Creates a local temporary view using the given name.

    Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.

    Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use db1.view1 to reference a local temporary view.

    Definition Classes
    Dataset
    Annotations
    @throws( ... )
    Since

    2.0.0

    Exceptions thrown

    AnalysisException if the view name already exists

  28. def crossJoin(right: Dataset[_]): DataFrame

    Permalink

    Explicit cartesian join with another DataFrame.

    Explicit cartesian join with another DataFrame.

    right

    Right side of the join operation.

    Definition Classes
    Dataset
    Since

    2.1.0

    Note

    Cartesian joins are very expensive without an extra filter that can be pushed down.

  29. def cube(col1: String, cols: String*): RelationalGroupedDataset

    Permalink

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns cubed by department and group.
    ds.cube("department", "group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    ds.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  30. def cube(cols: Column*): RelationalGroupedDataset

    Permalink

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns cubed by department and group.
    ds.cube($"department", $"group").avg()
    
    // Compute the max age and average salary, cubed by department and gender.
    ds.cube($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  31. def describe(cols: String*): DataFrame

    Permalink

    Computes statistics for numeric and string columns, including count, mean, stddev, min, and max.

    Computes statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.

    This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the agg function instead.

    ds.describe("age", "height").show()
    
    // output:
    // summary age   height
    // count   10.0  10.0
    // mean    53.3  178.05
    // stddev  11.6  15.7
    // min     18.0  163.0
    // max     92.0  192.0
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    1.6.0

  32. def distinct(): Dataset[Row]

    Permalink

    Returns a new Dataset that contains only the unique rows from this Dataset.

    Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for dropDuplicates.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

  33. def drop(col: Column): DataFrame

    Permalink

    Returns a new Dataset with a column dropped.

    Returns a new Dataset with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.

    Definition Classes
    Dataset
    Since

    2.0.0

  34. def drop(colNames: String*): DataFrame

    Permalink

    Returns a new Dataset with columns dropped.

    Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).

    This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  35. def drop(colName: String): DataFrame

    Permalink

    Returns a new Dataset with a column dropped.

    Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.

    This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.

    Definition Classes
    Dataset
    Since

    2.0.0

  36. def dropDuplicates(col1: String, cols: String*): Dataset[Row]

    Permalink

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  37. def dropDuplicates(colNames: Array[String]): Dataset[Row]

    Permalink

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Definition Classes
    Dataset
    Since

    2.0.0

  38. def dropDuplicates(colNames: Seq[String]): Dataset[Row]

    Permalink

    (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    (Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.

    Definition Classes
    Dataset
    Since

    2.0.0

  39. def dropDuplicates(): Dataset[Row]

    Permalink

    Returns a new Dataset that contains only the unique rows from this Dataset.

    Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for distinct.

    Definition Classes
    Dataset
    Since

    2.0.0

  40. def dtypes: Array[(String, String)]

    Permalink

    Returns all column names and their data types as an array.

    Returns all column names and their data types as an array.

    Definition Classes
    Dataset
    Since

    1.6.0

  41. final def eq(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  42. def except(other: Dataset[Row]): Dataset[Row]

    Permalink

    Returns a new Dataset containing rows in this Dataset but not in another Dataset.

    Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to EXCEPT in SQL.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

  43. def explain(): Unit

    Permalink

    Prints the physical plan to the console for debugging purposes.

    Prints the physical plan to the console for debugging purposes.

    Definition Classes
    Dataset
    Since

    1.6.0

  44. def explain(extended: Boolean): Unit

    Permalink

    Prints the plans (logical and physical) to the console for debugging purposes.

    Prints the plans (logical and physical) to the console for debugging purposes.

    Definition Classes
    Dataset
    Since

    1.6.0

  45. def filter(func: FilterFunction[Row]): Dataset[Row]

    Permalink

    :: Experimental :: (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    :: Experimental :: (Java-specific) Returns a new Dataset that only contains elements where func returns true.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  46. def filter(func: (Row) ⇒ Boolean): Dataset[Row]

    Permalink

    :: Experimental :: (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    :: Experimental :: (Scala-specific) Returns a new Dataset that only contains elements where func returns true.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  47. def filter(conditionExpr: String): Dataset[Row]

    Permalink

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDs.filter("age > 15")
    Definition Classes
    Dataset
    Since

    1.6.0

  48. def filter(condition: Column): Dataset[Row]

    Permalink

    Filters rows using the given condition.

    Filters rows using the given condition.

    // The following are equivalent:
    peopleDs.filter($"age" > 15)
    peopleDs.where($"age" > 15)
    Definition Classes
    Dataset
    Since

    1.6.0

  49. def finalize(): Unit

    Permalink
    Attributes
    protected[java.lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  50. def first(): Row

    Permalink

    Returns the first row.

    Returns the first row. Alias for head().

    Definition Classes
    Dataset
    Since

    1.6.0

  51. def flatMap[U](f: FlatMapFunction[Row, U], encoder: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    :: Experimental :: (Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  52. def flatMap[U](func: (Row) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    :: Experimental :: (Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  53. def foreach(func: ForeachFunction[Row]): Unit

    Permalink

    (Java-specific) Runs func on each element of this Dataset.

    (Java-specific) Runs func on each element of this Dataset.

    Definition Classes
    Dataset
    Since

    1.6.0

  54. def foreach(f: (Row) ⇒ Unit): Unit

    Permalink

    Applies a function f to all rows.

    Applies a function f to all rows.

    Definition Classes
    Dataset
    Since

    1.6.0

  55. def foreachPartition(func: ForeachPartitionFunction[Row]): Unit

    Permalink

    (Java-specific) Runs func on each partition of this Dataset.

    (Java-specific) Runs func on each partition of this Dataset.

    Definition Classes
    Dataset
    Since

    1.6.0

  56. def foreachPartition(f: (Iterator[Row]) ⇒ Unit): Unit

    Permalink

    Applies a function f to each partition of this Dataset.

    Applies a function f to each partition of this Dataset.

    Definition Classes
    Dataset
    Since

    1.6.0

  57. final def getClass(): Class[_]

    Permalink
    Definition Classes
    AnyRef → Any
  58. def groupBy(col1: String, cols: String*): RelationalGroupedDataset

    Permalink

    Groups the Dataset using the specified columns, so that we can run aggregation on them.

    Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns grouped by department.
    ds.groupBy("department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    ds.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  59. def groupBy(cols: Column*): RelationalGroupedDataset

    Permalink

    Groups the Dataset using the specified columns, so we can run aggregation on them.

    Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns grouped by department.
    ds.groupBy($"department").avg()
    
    // Compute the max age and average salary, grouped by department and gender.
    ds.groupBy($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  60. def groupByKey[K](func: MapFunction[Row, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, Row]

    Permalink

    :: Experimental :: (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    :: Experimental :: (Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.0.0

  61. def groupByKey[K](func: (Row) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, Row]

    Permalink

    :: Experimental :: (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    :: Experimental :: (Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key func.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.0.0

  62. def head(): Row

    Permalink

    Returns the first row.

    Returns the first row.

    Definition Classes
    Dataset
    Since

    1.6.0

  63. def head(n: Int): Array[Row]

    Permalink

    Returns the first n rows.

    Returns the first n rows.

    Definition Classes
    Dataset
    Since

    1.6.0

    Note

    this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.

  64. def inputFiles: Array[String]

    Permalink

    Returns a best-effort snapshot of the files that compose this Dataset.

    Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.

    Definition Classes
    Dataset
    Since

    2.0.0

  65. def intersect(other: Dataset[Row]): Dataset[Row]

    Permalink

    Returns a new Dataset containing rows only in both this Dataset and another Dataset.

    Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to INTERSECT in SQL.

    Definition Classes
    Dataset
    Since

    1.6.0

    Note

    Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

  66. final def isInstanceOf[T0]: Boolean

    Permalink
    Definition Classes
    Any
  67. def isLocal: Boolean

    Permalink

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Returns true if the collect and take methods can be run locally (without any Spark executors).

    Definition Classes
    Dataset
    Since

    1.6.0

  68. def isStreaming: Boolean

    Permalink

    Returns true if this Dataset contains one or more sources that continuously return data as it arrives.

    Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a StreamingQuery using the start() method in DataStreamWriter. Methods that return a single answer, e.g. count() or collect(), will throw an AnalysisException when there is a streaming source present.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.0.0

  69. def javaRDD: JavaRDD[Row]

    Permalink

    Returns the content of the Dataset as a JavaRDD of Ts.

    Returns the content of the Dataset as a JavaRDD of Ts.

    Definition Classes
    Dataset
    Since

    1.6.0

  70. def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

    Permalink

    Join with another DataFrame, using the given join expression.

    Join with another DataFrame, using the given join expression. The following performs a full outer join between df1 and df2.

    // Scala:
    import org.apache.spark.sql.functions._
    df1.join(df2, $"df1Key" === $"df2Key", "outer")
    
    // Java:
    import static org.apache.spark.sql.functions.*;
    df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
    right

    Right side of the join.

    joinExprs

    Join expression.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

    Definition Classes
    Dataset
    Since

    2.0.0

  71. def join(right: Dataset[_], joinExprs: Column): DataFrame

    Permalink

    Inner join with another DataFrame, using the given join expression.

    Inner join with another DataFrame, using the given join expression.

    // The following two are equivalent:
    df1.join(df2, $"df1Key" === $"df2Key")
    df1.join(df2).where($"df1Key" === $"df2Key")
    Definition Classes
    Dataset
    Since

    2.0.0

  72. def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

    Permalink

    Equi-join with another DataFrame using the given columns.

    Equi-join with another DataFrame using the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use the crossJoin method.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    If you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

  73. def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame

    Permalink

    Inner equi-join with another DataFrame using the given columns.

    Inner equi-join with another DataFrame using the given columns.

    Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the columns "user_id" and "user_name"
    df1.join(df2, Seq("user_id", "user_name"))
    right

    Right side of the join operation.

    usingColumns

    Names of the columns to join on. This columns must exist on both sides.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    If you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

  74. def join(right: Dataset[_], usingColumn: String): DataFrame

    Permalink

    Inner equi-join with another DataFrame using the given column.

    Inner equi-join with another DataFrame using the given column.

    Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

    // Joining df1 and df2 using the column "user_id"
    df1.join(df2, "user_id")
    right

    Right side of the join operation.

    usingColumn

    Name of the column to join on. This column must exist on both sides.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    If you perform a self-join using this function without aliasing the input DataFrames, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.

  75. def join(right: Dataset[_]): DataFrame

    Permalink

    Join with another DataFrame.

    Join with another DataFrame.

    Behaves as an INNER JOIN and requires a subsequent join predicate.

    right

    Right side of the join operation.

    Definition Classes
    Dataset
    Since

    2.0.0

  76. def joinWith[U](other: Dataset[U], condition: Column): Dataset[(Row, U)]

    Permalink

    :: Experimental :: Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    :: Experimental :: Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    other

    Right side of the join.

    condition

    Join expression.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  77. def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(Row, U)]

    Permalink

    :: Experimental :: Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    :: Experimental :: Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.

    This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.

    This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.

    other

    Right side of the join.

    condition

    Join expression.

    joinType

    Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  78. def limit(n: Int): Dataset[Row]

    Permalink

    Returns a new Dataset by taking the first n rows.

    Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

    Definition Classes
    Dataset
    Since

    2.0.0

  79. def map[U](func: MapFunction[Row, U], encoder: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    :: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying func to each element.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  80. def map[U](func: (Row) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    :: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  81. def mapPartitions[U](f: MapPartitionsFunction[Row, U], encoder: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.

    :: Experimental :: (Java-specific) Returns a new Dataset that contains the result of applying f to each partition.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  82. def mapPartitions[U](func: (Iterator[Row]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

    Permalink

    :: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    :: Experimental :: (Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  83. def na: DataFrameNaFunctions

    Permalink

    Returns a DataFrameNaFunctions for working with missing data.

    Returns a DataFrameNaFunctions for working with missing data.

    // Dropping rows containing any null values.
    ds.na.drop()
    Definition Classes
    Dataset
    Since

    1.6.0

  84. final def ne(arg0: AnyRef): Boolean

    Permalink
    Definition Classes
    AnyRef
  85. final def notify(): Unit

    Permalink
    Definition Classes
    AnyRef
  86. final def notifyAll(): Unit

    Permalink
    Definition Classes
    AnyRef
  87. def orderBy(sortExprs: Column*): Dataset[Row]

    Permalink

    Returns a new Dataset sorted by the given expressions.

    Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  88. def orderBy(sortCol: String, sortCols: String*): Dataset[Row]

    Permalink

    Returns a new Dataset sorted by the given expressions.

    Returns a new Dataset sorted by the given expressions. This is an alias of the sort function.

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  89. def persist(newLevel: StorageLevel): AQPDataFrame.this.type

    Permalink

    Persist this Dataset with the given storage level.

    Persist this Dataset with the given storage level.

    newLevel

    One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

    Definition Classes
    Dataset
    Since

    1.6.0

  90. def persist(): AQPDataFrame.this.type

    Permalink

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Persist this Dataset with the default storage level (MEMORY_AND_DISK).

    Definition Classes
    Dataset
    Since

    1.6.0

  91. def printSchema(): Unit

    Permalink

    Prints the schema to the console in a nice tree format.

    Prints the schema to the console in a nice tree format.

    Definition Classes
    Dataset
    Since

    1.6.0

  92. val qe: QueryExecution

    Permalink
  93. val queryExecution: QueryExecution

    Permalink
    Definition Classes
    Dataset
  94. def randomSplit(weights: Array[Double]): Array[Dataset[Row]]

    Permalink

    Randomly splits this Dataset with the provided weights.

    Randomly splits this Dataset with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    Definition Classes
    Dataset
    Since

    2.0.0

  95. def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[Row]]

    Permalink

    Randomly splits this Dataset with the provided weights.

    Randomly splits this Dataset with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    seed

    Seed for sampling. For Java API, use randomSplitAsList.

    Definition Classes
    Dataset
    Since

    2.0.0

  96. def randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[Row]]

    Permalink

    Returns a Java list that contains randomly split Dataset with the provided weights.

    Returns a Java list that contains randomly split Dataset with the provided weights.

    weights

    weights for splits, will be normalized if they don't sum to 1.

    seed

    Seed for sampling.

    Definition Classes
    Dataset
    Since

    2.0.0

  97. lazy val rdd: RDD[Row]

    Permalink

    Represents the content of the Dataset as an RDD of T.

    Represents the content of the Dataset as an RDD of T.

    Definition Classes
    Dataset
    Since

    1.6.0

  98. def reduce(func: ReduceFunction[Row]): Row

    Permalink

    :: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function.

    :: Experimental :: (Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  99. def reduce(func: (Row, Row) ⇒ Row): Row

    Permalink

    :: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function.

    :: Experimental :: (Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  100. def repartition(partitionExprs: Column*): Dataset[Row]

    Permalink

    Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions.

    Returns a new Dataset partitioned by the given partitioning expressions, using spark.sql.shuffle.partitions as number of partitions. The resulting Dataset is hash partitioned.

    This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  101. def repartition(numPartitions: Int, partitionExprs: Column*): Dataset[Row]

    Permalink

    Returns a new Dataset partitioned by the given partitioning expressions into numPartitions.

    Returns a new Dataset partitioned by the given partitioning expressions into numPartitions. The resulting Dataset is hash partitioned.

    This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  102. def repartition(numPartitions: Int): Dataset[Row]

    Permalink

    Returns a new Dataset that has exactly numPartitions partitions.

    Returns a new Dataset that has exactly numPartitions partitions.

    Definition Classes
    Dataset
    Since

    1.6.0

  103. def rollup(col1: String, cols: String*): RelationalGroupedDataset

    Permalink

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).

    // Compute the average for all numeric columns rolluped by department and group.
    ds.rollup("department", "group").avg()
    
    // Compute the max age and average salary, rolluped by department and gender.
    ds.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  104. def rollup(cols: Column*): RelationalGroupedDataset

    Permalink

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.

    Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.

    // Compute the average for all numeric columns rolluped by department and group.
    ds.rollup($"department", $"group").avg()
    
    // Compute the max age and average salary, rolluped by department and gender.
    ds.rollup($"department", $"gender").agg(Map(
      "salary" -> "avg",
      "age" -> "max"
    ))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  105. def sample(withReplacement: Boolean, fraction: Double): Dataset[Row]

    Permalink

    Returns a new Dataset by sampling a fraction of rows, using a random seed.

    Returns a new Dataset by sampling a fraction of rows, using a random seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    Definition Classes
    Dataset
    Since

    1.6.0

    Note

    This is NOT guaranteed to provide exactly the fraction of the total count of the given Dataset.

  106. def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[Row]

    Permalink

    Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.

    Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.

    withReplacement

    Sample with replacement or not.

    fraction

    Fraction of rows to generate.

    seed

    Seed for sampling.

    Definition Classes
    Dataset
    Since

    1.6.0

    Note

    This is NOT guaranteed to provide exactly the fraction of the count of the given Dataset.

  107. def schema: StructType

    Permalink

    Returns the schema of this Dataset.

    Returns the schema of this Dataset.

    Definition Classes
    Dataset
    Since

    1.6.0

  108. def select[U1, U2, U3, U4, U5](c1: TypedColumn[Row, U1], c2: TypedColumn[Row, U2], c3: TypedColumn[Row, U3], c4: TypedColumn[Row, U4], c5: TypedColumn[Row, U5]): Dataset[(U1, U2, U3, U4, U5)]

    Permalink

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  109. def select[U1, U2, U3, U4](c1: TypedColumn[Row, U1], c2: TypedColumn[Row, U2], c3: TypedColumn[Row, U3], c4: TypedColumn[Row, U4]): Dataset[(U1, U2, U3, U4)]

    Permalink

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  110. def select[U1, U2, U3](c1: TypedColumn[Row, U1], c2: TypedColumn[Row, U2], c3: TypedColumn[Row, U3]): Dataset[(U1, U2, U3)]

    Permalink

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  111. def select[U1, U2](c1: TypedColumn[Row, U1], c2: TypedColumn[Row, U2]): Dataset[(U1, U2)]

    Permalink

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    :: Experimental :: Returns a new Dataset by computing the given Column expressions for each element.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  112. def select[U1](c1: TypedColumn[Row, U1]): Dataset[U1]

    Permalink

    :: Experimental :: Returns a new Dataset by computing the given Column expression for each element.

    :: Experimental :: Returns a new Dataset by computing the given Column expression for each element.

    val ds = Seq(1, 2, 3).toDS()
    val newDS = ds.select(expr("value + 1").as[Int])
    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    1.6.0

  113. def select(col: String, cols: String*): DataFrame

    Permalink

    Selects a set of columns.

    Selects a set of columns. This is a variant of select that can only select existing columns using column names (i.e. cannot construct expressions).

    // The following two are equivalent:
    ds.select("colA", "colB")
    ds.select($"colA", $"colB")
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  114. def select(cols: Column*): DataFrame

    Permalink

    Selects a set of column based expressions.

    Selects a set of column based expressions.

    ds.select($"colA", $"colB" + 1)
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  115. def selectExpr(exprs: String*): DataFrame

    Permalink

    Selects a set of SQL expressions.

    Selects a set of SQL expressions. This is a variant of select that accepts SQL expressions.

    // The following are equivalent:
    ds.selectExpr("colA", "colB as newName", "abs(colC)")
    ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  116. def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

    Permalink

    Internal helper function for building typed selects that return tuples.

    Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

    Attributes
    protected
    Definition Classes
    Dataset
  117. def show(numRows: Int, truncate: Int): Unit

    Permalink

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    If set to more than 0, truncates strings to truncate characters and all cells will be aligned right.

    Definition Classes
    Dataset
    Since

    1.6.0

  118. def show(numRows: Int, truncate: Boolean): Unit

    Permalink

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Definition Classes
    Dataset
    Since

    1.6.0

  119. def show(truncate: Boolean): Unit

    Permalink

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form.

    truncate

    Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

    Definition Classes
    Dataset
    Since

    1.6.0

  120. def show(): Unit

    Permalink

    Displays the top 20 rows of Dataset in a tabular form.

    Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

    Definition Classes
    Dataset
    Since

    1.6.0

  121. def show(numRows: Int): Unit

    Permalink

    Displays the Dataset in a tabular form.

    Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

    year  month AVG('Adj Close) MAX('Adj Close)
    1980  12    0.503218        0.595103
    1981  01    0.523289        0.570307
    1982  02    0.436504        0.475256
    1983  03    0.410516        0.442194
    1984  04    0.450090        0.483521
    numRows

    Number of rows to show

    Definition Classes
    Dataset
    Since

    1.6.0

  122. val snappySession: SnappySession

    Permalink
  123. def sort(sortExprs: Column*): Dataset[Row]

    Permalink

    Returns a new Dataset sorted by the given expressions.

    Returns a new Dataset sorted by the given expressions. For example:

    ds.sort($"col1", $"col2".desc)
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  124. def sort(sortCol: String, sortCols: String*): Dataset[Row]

    Permalink

    Returns a new Dataset sorted by the specified column, all in ascending order.

    Returns a new Dataset sorted by the specified column, all in ascending order.

    // The following 3 are equivalent
    ds.sort("sortcol")
    ds.sort($"sortcol")
    ds.sort($"sortcol".asc)
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  125. def sortWithinPartitions(sortExprs: Column*): Dataset[Row]

    Permalink

    Returns a new Dataset with each partition sorted by the given expressions.

    Returns a new Dataset with each partition sorted by the given expressions.

    This is the same operation as "SORT BY" in SQL (Hive QL).

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  126. def sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[Row]

    Permalink

    Returns a new Dataset with each partition sorted by the given expressions.

    Returns a new Dataset with each partition sorted by the given expressions.

    This is the same operation as "SORT BY" in SQL (Hive QL).

    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  127. val sparkSession: SparkSession

    Permalink
    Definition Classes
    Dataset
  128. lazy val sqlContext: SQLContext

    Permalink
    Definition Classes
    Dataset
  129. def stat: DataFrameStatFunctions

    Permalink

    Returns a DataFrameStatFunctions for working statistic functions support.

    Returns a DataFrameStatFunctions for working statistic functions support.

    // Finding frequent items in column with name 'a'.
    ds.stat.freqItems(Seq("a"))
    Definition Classes
    Dataset
    Since

    1.6.0

  130. def storageLevel: StorageLevel

    Permalink

    Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.

    Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.

    Definition Classes
    Dataset
    Since

    2.1.0

  131. final def synchronized[T0](arg0: ⇒ T0): T0

    Permalink
    Definition Classes
    AnyRef
  132. def take(n: Int): Array[Row]

    Permalink

    Returns the first n rows in the Dataset.

    Returns the first n rows in the Dataset.

    Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Definition Classes
    Dataset
    Since

    1.6.0

  133. def takeAsList(n: Int): List[Row]

    Permalink

    Returns the first n rows in the Dataset as a list.

    Returns the first n rows in the Dataset as a list.

    Running take requires moving data into the application's driver process, and doing so with a very large n can crash the driver process with OutOfMemoryError.

    Definition Classes
    Dataset
    Since

    1.6.0

  134. def toDF(colNames: String*): DataFrame

    Permalink

    Converts this strongly typed collection of data to generic DataFrame with columns renamed.

    Converts this strongly typed collection of data to generic DataFrame with columns renamed. This can be quite convenient in conversion from an RDD of tuples into a DataFrame with meaningful names. For example:

    val rdd: RDD[(Int, String)] = ...
    rdd.toDF()  // this implicit conversion creates a DataFrame with column name `_1` and `_2`
    rdd.toDF("id", "name")  // this creates a DataFrame with column name "id" and "name"
    Definition Classes
    Dataset
    Annotations
    @varargs()
    Since

    2.0.0

  135. def toDF(): DataFrame

    Permalink

    Converts this strongly typed collection of data to generic Dataframe.

    Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.

    Definition Classes
    Dataset
    Since

    1.6.0

  136. def toJSON: Dataset[String]

    Permalink

    Returns the content of the Dataset as a Dataset of JSON strings.

    Returns the content of the Dataset as a Dataset of JSON strings.

    Definition Classes
    Dataset
    Since

    2.0.0

  137. def toJavaRDD: JavaRDD[Row]

    Permalink

    Returns the content of the Dataset as a JavaRDD of Ts.

    Returns the content of the Dataset as a JavaRDD of Ts.

    Definition Classes
    Dataset
    Since

    1.6.0

  138. def toLocalIterator(): Iterator[Row]

    Permalink

    Return an iterator that contains all of Rows in this Dataset.

    Return an iterator that contains all of Rows in this Dataset.

    The iterator will consume as much memory as the largest partition in this Dataset.

    Definition Classes
    Dataset
    Since

    2.0.0

    Note

    this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.

  139. def toString(): String

    Permalink
    Definition Classes
    Dataset → AnyRef → Any
  140. def transform[U](t: (Dataset[Row]) ⇒ Dataset[U]): Dataset[U]

    Permalink

    Concise syntax for chaining custom transformations.

    Concise syntax for chaining custom transformations.

    def featurize(ds: Dataset[T]): Dataset[U] = ...
    
    ds
      .transform(featurize)
      .transform(...)
    Definition Classes
    Dataset
    Since

    1.6.0

  141. def union(other: Dataset[Row]): Dataset[Row]

    Permalink

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.

    To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

    Definition Classes
    Dataset
    Since

    2.0.0

  142. def unpersist(): AQPDataFrame.this.type

    Permalink

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Definition Classes
    Dataset
    Since

    1.6.0

  143. def unpersist(blocking: Boolean): AQPDataFrame.this.type

    Permalink

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

    blocking

    Whether to block until all blocks are deleted.

    Definition Classes
    Dataset
    Since

    1.6.0

  144. final def wait(): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  145. final def wait(arg0: Long, arg1: Int): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  146. final def wait(arg0: Long): Unit

    Permalink
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  147. def where(conditionExpr: String): Dataset[Row]

    Permalink

    Filters rows using the given SQL expression.

    Filters rows using the given SQL expression.

    peopleDs.where("age > 15")
    Definition Classes
    Dataset
    Since

    1.6.0

  148. def where(condition: Column): Dataset[Row]

    Permalink

    Filters rows using the given condition.

    Filters rows using the given condition. This is an alias for filter.

    // The following are equivalent:
    peopleDs.filter($"age" > 15)
    peopleDs.where($"age" > 15)
    Definition Classes
    Dataset
    Since

    1.6.0

  149. def withColumn(colName: String, col: Column): DataFrame

    Permalink

    Returns a new Dataset by adding a column or replacing the existing column that has the same name.

    Returns a new Dataset by adding a column or replacing the existing column that has the same name.

    Definition Classes
    Dataset
    Since

    2.0.0

  150. def withColumnRenamed(existingName: String, newName: String): DataFrame

    Permalink

    Returns a new Dataset with a column renamed.

    Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.

    Definition Classes
    Dataset
    Since

    2.0.0

  151. def withError(error: Double, confidence: Double = Constant.DEFAULT_CONFIDENCE, behavior: String = Constant.DEFAULT_BEHAVIOR): DataFrame

    Permalink
  152. def withWatermark(eventTime: String, delayThreshold: String): Dataset[Row]

    Permalink

    :: Experimental :: Defines an event time watermark for this Dataset.

    :: Experimental :: Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.

    Spark will use this watermark for several purposes:

    • To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
    • To minimize the amount of state that we need to keep for on-going aggregations.

    The current watermark is computed by looking at the MAX(eventTime) seen across all of the partitions in the query minus a user specified delayThreshold. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at least delayThreshold behind the actual event time. In some cases we may still process records that arrive more than delayThreshold late.

    eventTime

    the name of the column that contains the event time of the row.

    delayThreshold

    the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.1.0

  153. def write: DataFrameWriter[Row]

    Permalink

    Interface for saving the content of the non-streaming Dataset out into external storage.

    Interface for saving the content of the non-streaming Dataset out into external storage.

    Definition Classes
    Dataset
    Since

    1.6.0

  154. def writeStream: DataStreamWriter[Row]

    Permalink

    :: Experimental :: Interface for saving the content of the streaming Dataset out into external storage.

    :: Experimental :: Interface for saving the content of the streaming Dataset out into external storage.

    Definition Classes
    Dataset
    Annotations
    @Experimental() @Evolving()
    Since

    2.0.0

Deprecated Value Members

  1. def explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame

    Permalink

    (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.

    Given that this is deprecated, as an alternative, you can explode columns either using functions.explode():

    ds.select(explode(split('words, " ")).as("word"))

    or flatMap():

    ds.flatMap(_.words.split(" "))
    Definition Classes
    Dataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) use flatMap() or select() with functions.explode() instead

    Since

    2.0.0

  2. def explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame

    Permalink

    (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function.

    (Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a LATERAL VIEW in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.

    Given that this is deprecated, as an alternative, you can explode columns either using functions.explode() or flatMap(). The following example uses these alternatives to count the number of books that contain a given word:

    case class Book(title: String, words: String)
    val ds: Dataset[Book]
    
    val allWords = ds.select('title, explode(split('words, " ")).as("word"))
    
    val bookCountPerWord = allWords.groupBy("word").agg(countDistinct("title"))

    Using flatMap() this can similarly be exploded as:

    ds.flatMap(_.words.split(" "))
    Definition Classes
    Dataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) use flatMap() or select() with functions.explode() instead

    Since

    2.0.0

  3. def registerTempTable(tableName: String): Unit

    Permalink

    Registers this Dataset as a temporary table using the given name.

    Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.

    Definition Classes
    Dataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) Use createOrReplaceTempView(viewName) instead.

    Since

    1.6.0

  4. def unionAll(other: Dataset[Row]): Dataset[Row]

    Permalink

    Returns a new Dataset containing union of rows in this Dataset and another Dataset.

    Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is equivalent to UNION ALL in SQL.

    To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.

    Definition Classes
    Dataset
    Annotations
    @deprecated
    Deprecated

    (Since version 2.0.0) use union()

    Since

    2.0.0

Inherited from Product

Inherited from Equals

Inherited from Dataset[Row]

Inherited from Serializable

Inherited from Serializable

Inherited from AnyRef

Inherited from Any

Actions

Basic Dataset functions

streaming

Typed transformations

Untyped transformations

Ungrouped