org.apache.spark.sql.execution
All the attributes that are used for this plan.
All the attributes that are used for this plan.
Returns the tree node at the specified number, used primarily for interactive debugging.
Returns the tree node at the specified number, used primarily for interactive debugging. Numbers for each node can be found in the numberedTreeString.
Note that this cannot return BaseType because logical plan's plan node might return physical plan for innerChildren, e.g. in-memory relation logical plan node has a reference to the physical plan node it is referencing.
Returns a string representing the arguments to this node, minus any children
Returns a string representing the arguments to this node, minus any children
Returns a 'scala code' representation of this TreeNode
and its children.
Returns a 'scala code' representation of this TreeNode
and its children. Intended for use
when debugging where the prettier toString function is obfuscating the actual structure. In the
case of 'pure' TreeNodes
that only contain primitives and other TreeNodes, the result can be
pasted in the REPL to build an equivalent Tree.
Canonicalized copy of this query plan.
Canonicalized copy of this query plan.
Returns a Seq of the children of this node.
Returns a Seq of the children of this node. Children should not change. Immutability required for containsChild optimization
Args that have cleaned such that differences in expression id should not affect equality
Args that have cleaned such that differences in expression id should not affect equality
Eagerly clear any broadcasts created by this plan execution.
Eagerly clear any broadcasts created by this plan execution.
Returns a Seq containing the result of applying a partial function to all elements in this tree on which the function is defined.
Returns a Seq containing the result of applying a partial function to all elements in this tree on which the function is defined.
Finds and returns the first TreeNode of the tree for which the given partial function is defined (pre-order), and applies the partial function to it.
Returns a Seq containing the leaves in this tree.
Returns a Seq containing the leaves in this tree.
An ExpressionSet that contains invariants about the rows output by this operator.
An ExpressionSet that contains invariants about the rows output by this operator. For
example, if this set contains the expression a = 2
then that expression is guaranteed to
evaluate to true
for all rows produced.
Consume the generated columns or row from current SparkPlan, call its parent's doConsume()
.
Consume the generated columns or row from current SparkPlan, call its parent's doConsume()
.
Generate the Java source code to process the rows from child SparkPlan.
Generate the Java source code to process the rows from child SparkPlan.
This should be override by subclass to support codegen.
For example, Filter will generate the code like this:
# code to evaluate the predicate expression, result is isNull1 and value2 if (isNull1 || !value2) continue; # call consume(), which will call parent.doConsume()
Note: A plan can either consume the rows as UnsafeRow (row), or a list of variables (input).
Overridden by concrete implementations of SparkPlan.
Overridden by concrete implementations of SparkPlan. Produces the result of the query as an RDD[InternalRow]
Overridden by concrete implementations of SparkPlan.
Overridden by concrete implementations of SparkPlan. Produces the result of the query as a broadcast variable.
Overridden by concrete implementations of SparkPlan.
Overridden by concrete implementations of SparkPlan. It is guaranteed to run before any
execute
of SparkPlan. This is helpful if we want to set up some state before executing the
query, e.g., BroadcastHashJoin
uses it to broadcast asynchronously.
Note: the prepare method has already walked down the tree, so the implementation doesn't need to call children's prepare methods.
This will only be called once, protected by this
.
Generate the Java source code to process, should be overridden by subclass to support codegen.
Generate the Java source code to process, should be overridden by subclass to support codegen.
doProduce() usually generate the framework, for example, aggregation could generate this:
if (!initialized) { # create a hash map, then build the aggregation hash map # call child.produce() initialized = true; } while (hashmap.hasNext()) { row = hashmap.next(); # build the aggregation results # create variables for results # call consume(), which will call parent.doConsume() if (shouldStop()) return; }
Returns source code to evaluate the variables for required attributes, and clear the code of evaluated variables, to prevent them to be evaluated twice.
Returns source code to evaluate the variables for required attributes, and clear the code of evaluated variables, to prevent them to be evaluated twice.
Returns source code to evaluate all the variables, and clear the code of them, to prevent them to be evaluated twice.
Returns source code to evaluate all the variables, and clear the code of them, to prevent them to be evaluated twice.
Returns the result of this query as an RDD[InternalRow] by delegating to doExecute
after
preparations.
Returns the result of this query as an RDD[InternalRow] by delegating to doExecute
after
preparations.
Concrete implementations of SparkPlan should override doExecute
.
Returns the result of this query as a broadcast variable by delegating to doExecuteBroadcast
after preparations.
Returns the result of this query as a broadcast variable by delegating to doExecuteBroadcast
after preparations.
Concrete implementations of SparkPlan should override doExecuteBroadcast
.
Runs this query returning the result as an array.
Runs this query returning the result as an array.
Runs this query returning the result as an array, using external Row format.
Runs this query returning the result as an array, using external Row format.
Execute a query after preparing the query and adding query plan information to created RDDs for visualization.
Execute a query after preparing the query and adding query plan information to created RDDs for visualization.
Runs this query returning the first n
rows as an array.
Runs this query returning the first n
rows as an array.
This is modeled after RDD.take but never runs any job locally on the driver.
Runs this query returning the result as an iterator of InternalRow.
Runs this query returning the result as an iterator of InternalRow.
Note: this will trigger multiple jobs (one for each partition).
Returns all of the expressions present in this query plan operator.
Returns all of the expressions present in this query plan operator.
Faster version of equality which short-circuits when two treeNodes are the same instance.
Faster version of equality which short-circuits when two treeNodes are the same instance.
We don't just override Object.equals, as doing so prevents the scala compiler from
generating case class equals
methods
Find the first TreeNode that satisfies the condition specified by f
.
Returns a Seq by applying a function to all nodes in this tree and using the elements of the resulting collections.
Returns a Seq by applying a function to all nodes in this tree and using the elements of the resulting collections.
Runs the given function on this node and then recursively on children.
Runs the given function recursively on children then on this node.
Generate code to compare equality of a given object (objVar) against key column variables.
Generate code to calculate the hash code for given column variables that correspond to the key columns in this class.
Generate code to lookup the map or insert a new key, value if not found.
Appends the string represent of this node and its children to the given StringBuilder.
Appends the string represent of this node and its children to the given StringBuilder.
The i
-th element in lastChildren
indicates whether the ancestor of the current node at
depth i + 1
is the last child of its own parent node. The depth of the root node is 0, and
lastChildren
for the root node should be empty.
Note that this traversal (numbering) order must be the same as getNodeNumbered.
Generate code to update a class object fields with given resultVars.
Generate code to update a class object fields with given resultVars. If
accessors for fields have been generated (using getColumnVars
)
then those can be passed for faster reads where required.
the variable holding reference to the class object
accessors for object fields, if available
result values to be assigned to object fields
if true then update key fields else value fields
if true then a copy of reference values is assigned else only reference copy done
if true then this is for initialization of fields after object creation so some checks can be skipped
code to assign objVar fields to given resultVars
get the generated class name
Get the ExprCode for the key and/or value columns given a class object variable.
Get the ExprCode for the key and/or value columns given a class object variable. This also returns an initialization code that should be inserted in generated code first. The last element in the result tuple is the names of null mask variables.
Extracts the relevant constraints from a given set of constraints based on the attributes that appear in the outputSet.
All the nodes that should be shown as a inner nested tree of this node.
Returns all the RDDs of InternalRow which generates the input rows.
Returns all the RDDs of InternalRow which generates the input rows.
Note: right now we support up to two RDDs.
The set of all attributes that are input to this operator by its children.
The set of all attributes that are input to this operator by its children.
Return a LongSQLMetric according to the name.
Return a LongSQLMetric according to the name.
Overridden make copy also propagates sqlContext to copied plan.
Returns a Seq containing the result of applying the given function to each node in this tree in a preorder traversal.
Returns a Seq containing the result of applying the given function to each node in this tree in a preorder traversal.
the function to be applied.
Returns a copy of this node where f
has been applied to all the nodes children.
Returns a copy of this node where f
has been applied to all the nodes children.
Apply a map function to each expression present in this query operator, and return a new query operator based on the mapped expressions.
Apply a map function to each expression present in this query operator, and return a new query operator based on the mapped expressions.
Efficient alternative to productIterator.map(f).toArray
.
Efficient alternative to productIterator.map(f).toArray
.
Return all metadata that describes more details of this SparkPlan.
Return all metadata that describes more details of this SparkPlan.
Creates a metric using the specified name.
Creates a metric using the specified name.
name of the variable representing the metric
Return all metrics containing metrics of this SparkPlan.
Return all metrics containing metrics of this SparkPlan.
Attributes that are referenced by expressions but not provided by this nodes children.
Attributes that are referenced by expressions but not provided by this nodes children. Subclasses should override this method if they produce attributes internally as it is used by assertions designed to prevent the construction of invalid plans.
Creates a row ordering for the given schema, in natural ascending order.
Creates a row ordering for the given schema, in natural ascending order.
Returns the name of this type of TreeNode.
Returns the name of this type of TreeNode. Defaults to the class name. Note that we remove the "Exec" suffix for physical operators here.
Returns a string representation of the nodes in this tree, where each operator is numbered.
Returns a string representation of the nodes in this tree, where each operator is numbered. The numbers can be used with TreeNode.apply to easily access specific subtrees.
The numbers are based on depth-first traversal of the tree (with innerChildren traversed first before children).
Args to the constructor that should be copied, but not transformed.
Args to the constructor that should be copied, but not transformed. These are appended to the transformed args automatically by makeCopy
Specifies how data is ordered in each partition.
Specifies how data is ordered in each partition.
Specifies how data is partitioned across different nodes in the cluster.
Specifies how data is partitioned across different nodes in the cluster.
Returns the set of attributes that are output by this node.
Returns the set of attributes that are output by this node.
Returns the tree node at the specified number, used primarily for interactive debugging.
Returns the tree node at the specified number, used primarily for interactive debugging. Numbers for each node can be found in the numberedTreeString.
This is a variant of apply that returns the node as BaseType (if the type matches).
Which SparkPlan is calling produce() of this one.
Which SparkPlan is calling produce() of this one. It's itself for the first SparkPlan.
Prepare a SparkPlan for execution.
Prepare a SparkPlan for execution. It's idempotent.
Finds scalar subquery expressions in this plan node and starts evaluating them.
Finds scalar subquery expressions in this plan node and starts evaluating them.
Prints out the schema in the tree format
Prints out the schema in the tree format
Returns Java source code to process the rows from input RDD.
Returns Java source code to process the rows from input RDD.
The set of all attributes that are produced by this node.
The set of all attributes that are produced by this node.
All Attributes that appear in expressions from this operator.
All Attributes that appear in expressions from this operator. Note that this set does not include attributes that are implicitly referenced by being passed through to the output tuple.
Specifies any partition requirements on the input data for this operator.
Specifies any partition requirements on the input data for this operator.
Specifies sort order for each partition requirements on the input data for this operator.
Specifies sort order for each partition requirements on the input data for this operator.
Reset all the metrics.
Reset all the metrics.
Returns true when the given query plan will return the same results as this query plan.
Returns true when the given query plan will return the same results as this query plan.
Since its likely undecidable to generally determine if two given plans will produce the same results, it is okay for this function to return false, even if the results are actually the same. Such behavior will not affect correctness, only the application of performance enhancements like caching. However, it is not acceptable to return true if the results could possibly be different.
By default this function performs a modified version of equality that is tolerant of cosmetic differences like attribute naming and or expression id differences. Operators that can do better should override this function.
Returns the output schema in the tree format.
Returns the output schema in the tree format.
ONE line description of this node.
A handle to the SQL Context that was used to create this plan.
A handle to the SQL Context that was used to create this plan. Since many operators need access to the sqlContext for RDD operations or configuration this field is automatically populated by the query planning infrastructure.
A prefix string used when printing the plan.
A prefix string used when printing the plan.
We use "!" to indicate an invalid plan, and "'" to indicate an unresolved plan.
The arguments that should be included in the arg string.
The arguments that should be included in the arg string. Defaults to the productIterator
.
All the subqueries of current plan.
All the subqueries of current plan.
Whether this SparkPlan support whole stage codegen or not.
Whether this SparkPlan support whole stage codegen or not.
Returns a copy of this node where rule
has been recursively applied to the tree.
Returns a copy of this node where rule
has been recursively applied to the tree.
When rule
does not apply to a given node it is left unchanged.
Users should not expect a specific directionality. If a specific directionality is needed,
transformDown or transformUp should be used.
the function use to transform this nodes children
Returns the result of running transformExpressions on this node and all its children.
Returns the result of running transformExpressions on this node and all its children.
Returns a copy of this node where rule
has been recursively applied to it and all of its
children (pre-order).
Returns a copy of this node where rule
has been recursively applied to it and all of its
children (pre-order). When rule
does not apply to a given node it is left unchanged.
the function used to transform this nodes children
Runs transform with rule
on all expressions present in this query operator.
Runs transform with rule
on all expressions present in this query operator.
Users should not expect a specific directionality. If a specific directionality is needed,
transformExpressionsDown or transformExpressionsUp should be used.
the rule to be applied to every expression in this operator.
Runs transformDown with rule
on all expressions present in this query operator.
Runs transformDown with rule
on all expressions present in this query operator.
the rule to be applied to every expression in this operator.
Runs transformUp with rule
on all expressions present in this query operator.
Runs transformUp with rule
on all expressions present in this query operator.
the rule to be applied to every expression in this operator.
Returns a copy of this node where rule
has been recursively applied first to all of its
children and then itself (post-order).
Returns a copy of this node where rule
has been recursively applied first to all of its
children and then itself (post-order). When rule
does not apply to a given node, it is left
unchanged.
the function use to transform this nodes children
Returns a string representation of the nodes in this tree
Returns a string representation of the nodes in this tree
The subset of inputSet those should be evaluated before this plan.
The subset of inputSet those should be evaluated before this plan.
We will use this to insert some code to access those columns that are actually used by current plan before calling doConsume().
This method can be overridden by any child class of QueryPlan to specify a set of constraints based on the given operator's constraint propagation logic.
This method can be overridden by any child class of QueryPlan to specify a set of constraints based on the given operator's constraint propagation logic. These constraints are then canonicalized and filtered automatically to contain only those attributes that appear in the outputSet.
See Canonicalize for more details.
ONE line description of this node with more information
Blocks the thread until all subqueries finish evaluation and update the results.
Blocks the thread until all subqueries finish evaluation and update the results.
Returns a copy of this node with the children replaced.
Returns a copy of this node with the children replaced. TODO: Validate somewhere (in debug mode?) that children are ordered correctly.
Provides helper methods for generated code to use ObjectHashSet with a generated class (having key and value columns as corresponding java type fields). This implementation saves the entire overhead of UnsafeRow conversion for both key type (like in BytesToBytesMap) and value type (like in BytesToBytesMap and VectorizedHashMapGenerator).
It has been carefully optimized to minimize memory reads/writes, with minimalistic code to fit better in CPU instruction cache. Unlike the other two maps used by HashAggregateExec, this has no limitations on the key or value column types.
The basic idea being that all of the key and value columns will be individual fields in a generated java class having corresponding java types. Storage of a column value in the map is a simple matter of assignment of incoming variable to the corresponding field of the class object and access is likewise read from that field of class . Nullability information is crammed in long bit-mask fields which are generated as many required (instead of unnecessary overhead of something like a BitSet).
Hashcode and equals methods are generated for the key column fields. Having both key and value fields in the same class object helps both in cutting down of generated code as well as cache locality and reduces at least one memory access for each row. In testing this alone has shown to improve performance by ~25% in simple group by queries. Furthermore, this class also provides for inline hashcode and equals methods so that incoming register variables in generated code can be directly used (instead of stuffing into a lookup key that will again read those fields inside). The class hashcode method is supposed to be used only internally by rehashing and that too is just a field cached in the class object that is filled in during the initial insert (from the inline hashcode).
For memory management this uses a simple approach of starting with an estimated size, then improving that estimate for future in a rehash where the rehash will also collect the actual size of current entries. If the rehash tells that no memory is available, then it will fallback to dumping the current map into MemoryManager and creating a new one with merge being done by an external sorter in a manner similar to how UnsafeFixedWidthAggregationMap handles the situation. Caller can instead decide to dump the entire map in that scenario like when using for a HashJoin.
Overall this map is 5-10X faster than UnsafeFixedWidthAggregationMap and 2-4X faster than VectorizedHashMapGenerator. It is generic enough to be used for both group by aggregation as well as for HashJoins.