A relation produced by applying func
to each element of the child
, concatenating the
resulting columns at the end of the input row.
An optimized version of AppendColumns, that can be executed on deserialized object directly.
A logical plan node with a left and right child.
A hint for the optimizer that we should broadcast the child
if used in a join operator.
A relation produced by applying func
to each grouping key and associated values from left and
right children.
Statistics collected for a column.
Statistics collected for a column.
1. Supported data types are defined in ColumnStat.supportsType
.
2. The JVM data type stored in min/max is the external data type (used in Row) for the
corresponding Catalyst data type. For example, for DateType we store java.sql.Date, and for
TimestampType we store java.sql.Timestamp.
3. For integral types, they are all upcasted to longs, i.e. shorts are stored as longs.
4. There is no guarantee that the statistics collected are accurate. Approximation algorithms
(sketches) might have been used, and the data collected can also be stale.
number of distinct values
minimum value
maximum value
number of nulls
average length of the values. For fixed-length types, this should be a constant.
maximum length of the values. For fixed-length types, this should be a constant.
A logical node that represents a non-query command to be executed by the system.
A logical node that represents a non-query command to be executed by the system. For example, commands can be used by parsers to represent DDL operations. Commands, unlike queries, are eagerly executed.
Takes the input row from child and turns it into object using the given deserializer expression.
Returns a new logical plan that dedups input rows.
Used to mark a user specified column as holding the event time for a row.
Apply a number of projections to every input row, hence we will get multiple output rows for an input row.
Apply a number of projections to every input row, hence we will get multiple output rows for an input row.
to apply
of all projections.
operator.
Applies a Generator to a stream of input rows, combining the output of each into a new stream of rows.
Applies a Generator to a stream of input rows, combining the
output of each into a new stream of rows. This operation is similar to a flatMap
in functional
programming with one important additional feature, which allows the input rows to be joined with
their output.
the generator expression
when true, each output row is implicitly joined with the input tuple that produced it.
when true, each input row will be output at least once, even if the output of the
given generator
is empty. outer
has no effect when join
is false.
Qualifier for the attributes of generator(UDTF)
The output schema of the Generator.
Children logical plan node
A GROUP BY clause with GROUPING SETS can generate a result set equivalent to generated by a UNION ALL of multiple simple GROUP BY clauses.
A GROUP BY clause with GROUPING SETS can generate a result set equivalent to generated by a UNION ALL of multiple simple GROUP BY clauses.
We will transform GROUPING SETS into logical plan Aggregate(.., Expand) in Analyzer
A list of bitmasks, each of the bitmask indicates the selected GroupBy expressions
The Group By expressions candidates, take effective only if the associated bit in the bitmask set to 1.
Child operator
The Aggregation expressions, those non selected group by expressions will be considered as constant null if it appears in the expressions
Insert some data into a table.
Insert some data into a table.
the logical plan representing the table. In the future this should be a org.apache.spark.sql.catalyst.catalog.CatalogTable once we converge Hive tables and data source tables.
a map from the partition key to the partition value (optional). If the partition
value is optional, dynamic partition insert will be performed.
As an example, INSERT INTO tbl PARTITION (a=1, b=2) AS ...
would have
Map('a' -> Some('1'), 'b' -> Some('2')),
and INSERT INTO tbl PARTITION (a=1, b) AS ...
would have Map('a' -> Some('1'), 'b' -> None).
the logical plan representing data to write to.
overwrite existing table or partitions.
If true, only write if the table or partition does not exist.
A logical plan node with no children.
A relation produced by applying func
to each element of the child
.
Applies func to each unique group in child
, based on the evaluation of groupingAttributes
.
Applies func to each unique group in child
, based on the evaluation of groupingAttributes
.
Func is invoked with an object representation of the grouping key an iterator containing the
object representation of all the rows with that key.
used to extract the key object for each group.
used to extract the items in the iterator from an input row.
A relation produced by applying func
to each partition of the child
.
A relation produced by applying a serialized R function func
to each partition of the child
.
A trait for logical operators that consumes domain objects as input.
A trait for logical operators that consumes domain objects as input. The output of its child must be a single-field row containing the input object.
A trait for logical operators that produces domain objects as output.
A trait for logical operators that produces domain objects as output. The output of this operator is a single-field safe row containing the produced object.
Options for writing new data into a table.
Options for writing new data into a table.
whether to overwrite existing data in the table.
if non-empty, specifies that we only want to overwrite partitions that match this partial partition spec. If empty, all partitions will be overwritten.
Returns a new RDD that has exactly numPartitions
partitions.
Returns a new RDD that has exactly numPartitions
partitions. Differs from
RepartitionByExpression as this method is called directly by DataFrame's, because the user
asked for coalesce
or repartition
. RepartitionByExpression is used when the consumer
of the output requires some specific ordering or distribution of the data.
This method repartitions data using Expressions into numPartitions
, and receives
information about the number of partitions during execution.
This method repartitions data using Expressions into numPartitions
, and receives
information about the number of partitions during execution. Used when a specific ordering or
distribution is expected by the consumer of the query result. Use Repartition for RDD-like
coalesce
and repartition
.
If numPartitions
is not specified, the number of partitions will be the number set by
spark.sql.shuffle.partitions
.
When planning take() or collect() operations, this special node that is inserted at the top of the logical plan before invoking the query planner.
When planning take() or collect() operations, this special node that is inserted at the top of the logical plan before invoking the query planner.
Rules can pattern-match on this node in order to apply transformations that only take effect at the top of the logical query plan.
Sample the dataset.
Sample the dataset.
Lower-bound of the sampling probability (usually 0.0)
Upper-bound of the sampling probability. The expected fraction sampled will be ub - lb.
Whether to sample with replacement.
the random seed
the LogicalPlan
Is created from TABLESAMPLE in the parser.
Input and output properties when passing data to a script.
Input and output properties when passing data to a script. For example, in Hive this would specify which SerDes to use.
Transforms the input by forking and running the specified script.
Transforms the input by forking and running the specified script.
the set of expression that should be passed to the script.
the command that should be executed.
the attributes that are produced by the script.
the input and output schema applied in the execution of the script.
Takes the input object from child and turns it into unsafe row using the given serializer expression.
The ordering expressions
True means global sorting apply for entire data set, False means sorting only apply within the partition.
Child logical plan
Estimates of various statistics.
Estimates of various statistics. The default estimation logic simply lazily multiplies the
corresponding statistic produced by the children. To override this behavior, override
statistics
and assign it an overridden version of Statistics
.
NOTE: concrete and/or overridden versions of statistics fields should pay attention to the performance of the implementations. The reason is that estimations might get triggered in performance-critical processes, such as query plan planning.
Note that we are using a BigInt here since it is easy to overflow a 64-bit integer in cardinality estimation (e.g. cartesian joins).
Physical size in bytes. For leaf operators this defaults to 1, otherwise it
defaults to the product of children's sizeInBytes
.
Estimated number of rows.
Column-level statistics.
If true, output is small enough to be used in a broadcast join.
This node is inserted at the top of a subquery when it is optimized.
This node is inserted at the top of a subquery when it is optimized. This makes sure we can recognize a subquery as such, and it allows us to write subquery aware transformations.
A relation produced by applying func
to each element of the child
and filter them by the
resulting boolean value.
A relation produced by applying func
to each element of the child
and filter them by the
resulting boolean value.
This is logically equal to a normal Filter operator whose condition expression is decoding the input row to object and apply the given function with decoded object. However we need the encapsulation of TypedFilter to make the concept more clear and make it easier to write optimizer rules.
A logical plan node with single child.
A container for holding named common table expressions (CTEs) and a query plan.
A container for holding named common table expressions (CTEs) and a query plan. This operator will be removed during analysis and the relations will be substituted into child.
The final query of this CTE.
A sequence of pair (alias, the CTE definition) that this CTE defined Each CTE can see the base tables and the previously defined CTEs only.
Factory for constructing new AppendColumn
nodes.
Factory for constructing new CoGroup
nodes.
Factory for constructing new FlatMapGroupsInR
nodes.
Factory for constructing new MapGroups
nodes.
A relation with one row.
A relation with one row. This is used in "SELECT ..." without a from clause.
Factory for constructing new Range
nodes.
Factory for constructing new Union
nodes.
A relation produced by applying
func
to each element of thechild
, concatenating the resulting columns at the end of the input row.used to extract the input to
func
from an input row.use to serialize the output of
func
.