A container for an AggregateFunction with its AggregateMode and a field
(isDistinct
) indicating if DISTINCT keyword is specified for this function.
AggregateFunction is the superclass of two aggregation function interfaces:
The mode of an AggregateFunction.
The ApproximatePercentile function returns the approximate percentile(s) of a column at the given percentage(s).
The ApproximatePercentile function returns the approximate percentile(s) of a column at the given
percentage(s). A percentile is a watermark value below which a given percentage of the column
values fall. For example, the percentile of column col
at percentage 50% is the median of
column col
.
This function supports partial aggregation.
child expression that can produce column value with child.eval(inputRow)
Expression that represents a single percentage value or an array of percentage values. Each percentage value must be between 0.0 and 1.0.
Integer literal expression of approximation accuracy. Higher value yields better accuracy, the default value is DEFAULT_PERCENTILE_ACCURACY.
A central moment is the expected value of a specified power of the deviation of a random variable from the mean.
A central moment is the expected value of a specified power of the deviation of a random variable from the mean. Central moments are often used to characterize the properties of about the shape of a distribution.
This class implements online, one-pass algorithms for computing the central moments of a set of points.
Behavior:
Double.NaN
when the column contains Double.NaN
valuesReferences:
The Collect aggregate function collects all seen expression values into a list of values.
The Collect aggregate function collects all seen expression values into a list of values.
The operator is bound to the slower sort based aggregation path because the number of elements (and their memory usage) can not be determined in advance. This also means that the collected elements are stored on heap, and that too many elements can cause GC pauses and eventually Out of Memory Errors.
Collect a list of elements.
Collect a list of elements.
Collect a set of unique elements.
Collect a set of unique elements.
Compute Pearson correlation between two expressions.
Compute Pearson correlation between two expressions. When applied on empty data (i.e., count is zero), it returns NULL.
Definition of Pearson correlation can be found at http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Compute the covariance between two expressions.
Compute the covariance between two expressions. When applied on empty data (i.e., count is zero), it returns NULL.
API for aggregation functions that are expressed in terms of Catalyst expressions.
API for aggregation functions that are expressed in terms of Catalyst expressions.
When implementing a new expression-based aggregate function, start by implementing
bufferAttributes
, defining attributes for the fields of the mutable aggregation buffer. You
can then use these attributes when defining updateExpressions
, mergeExpressions
, and
evaluateExpressions
.
Please note that children of an aggregate function can be unresolved (it will happen when
we create this function in DataFrame API). So, if there is any fields in
the implemented class that need to access fields of its children, please make
those fields lazy val
s.
Returns the first value of child
for a group of rows.
Returns the first value of child
for a group of rows. If the first value of child
is null
, it returns null
(respecting nulls). Even if First is used on an already
sorted column, if we do partial aggregation and final aggregation (when mergeExpression
is used) its result will not be deterministic (unless the input table is sorted and has
a single partition, and we use a single reducer to do the aggregation.).
HyperLogLog++ (HLL++) is a state of the art cardinality estimation algorithm.
HyperLogLog++ (HLL++) is a state of the art cardinality estimation algorithm. This class implements the dense version of the HLL++ algorithm as an Aggregate Function.
This implementation has been based on the following papers: HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/40671.pdf
Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen#
to estimate the cardinality of.
the maximum estimation error allowed.
API for aggregation functions that are expressed in terms of imperative initialize(), update(), and merge() functions which operate on Row-based aggregation buffers.
API for aggregation functions that are expressed in terms of imperative initialize(), update(), and merge() functions which operate on Row-based aggregation buffers.
Within these functions, code should access fields of the mutable aggregation buffer by adding the
bufferSchema-relative field number to mutableAggBufferOffset
then using this new field number
to access the buffer Row. This is necessary because this aggregation function's buffer is
embedded inside of a larger shared aggregation buffer when an aggregation operator evaluates
multiple aggregate functions at the same time.
We need to perform similar field number arithmetic when merging multiple intermediate
aggregate buffers together in merge()
(in this case, use inputAggBufferOffset
when accessing
the input buffer).
Correct ImperativeAggregate evaluation depends on the correctness of mutableAggBufferOffset
and
inputAggBufferOffset
, but not on the correctness of the attribute ids in aggBufferAttributes
and inputAggBufferAttributes
.
Returns the last value of child
for a group of rows.
Returns the last value of child
for a group of rows. If the last value of child
is null
, it returns null
(respecting nulls). Even if Last is used on an already
sorted column, if we do partial aggregation and final aggregation (when mergeExpression
is used) its result will not be deterministic (unless the input table is sorted and has
a single partition, and we use a single reducer to do the aggregation.).
The Percentile aggregate function returns the exact percentile(s) of numeric column expr
at
the given percentage(s) with value range in [0.0, 1.0].
The Percentile aggregate function returns the exact percentile(s) of numeric column expr
at
the given percentage(s) with value range in [0.0, 1.0].
The operator is bound to the slower sort based aggregation path because the number of elements and their partial order cannot be determined in advance. Therefore we have to store all the elements in memory, and that too many elements can cause GC paused and eventually OutOfMemory Errors.
child expression that produce numeric column value with child.eval(inputRow)
Expression that represents a single percentage value or an array of percentage values. Each percentage value must be in the range [0.0, 1.0].
PivotFirst is an aggregate function used in the second phase of a two phase pivot to do the required rearrangement of values into pivoted form.
PivotFirst is an aggregate function used in the second phase of a two phase pivot to do the required rearrangement of values into pivoted form.
For example on an input of A | B --+-- x | 1 y | 2 z | 3
with pivotColumn=A, valueColumn=B, and pivotColumnValues=[z,y] the output is [3,2].
column that determines which output position to put valueColumn in.
the column that is being rearranged.
the list of pivotColumn values in the order of desired output. Values not listed here will be ignored.
Aggregation function which allows **arbitrary** user-defined java object to be used as internal aggregation buffer.
Aggregation function which allows **arbitrary** user-defined java object to be used as internal aggregation buffer.
aggregation buffer for normal aggregation function `avg` aggregate buffer for `sum` | | v v +--------------+---------------+-----------------------------------+-------------+ | sum1 (Long) | count1 (Long) | generic user-defined java objects | sum2 (Long) | +--------------+---------------+-----------------------------------+-------------+ ^ | aggregation buffer object for `TypedImperativeAggregate` aggregation function
General work flow:
Stage 1: initialize aggregate buffer object.
initialize(buffer: MutableRow)
to set up the empty aggregate buffer.
2. In initialize
, we call createAggregationBuffer(): T
to get the initial buffer object,
and set it to the global buffer row.Stage 2: process input rows.
If the aggregate mode is Partial
or Complete
:
update(buffer: MutableRow, input: InternalRow)
to process the input
row.
2. In update
, we get the buffer object from the global buffer row and call
update(buffer: T, input: InternalRow): Unit
. If the aggregate mode is PartialMerge
or Final
:
merge(buffer: MutableRow, inputBuffer: InternalRow)
to process the
input row, which are serialized buffer objects shuffled from other nodes.
2. In merge
, we get the buffer object from the global buffer row, and get the binary data
from input row and deserialize it to buffer object, then we call
merge(buffer: T, input: T): Unit
to merge these 2 buffer objects.Stage 3: output results.
If the aggregate mode is Partial
or PartialMerge
:
serializeAggregateBufferInPlace
to replace the buffer object in the
global buffer row with binary data.
2. In serializeAggregateBufferInPlace
, we get the buffer object from the global buffer row
and call serialize(buffer: T): Array[Byte]
to serialize the buffer object to binary.
3. The framework outputs buffer attributes and shuffle them to other nodes. If the aggregate mode is Final
or Complete
:
eval(buffer: InternalRow)
to calculate the final result.
2. In eval
, we get the buffer object from the global buffer row and call
eval(buffer: T): Any
to get the final result.
3. The framework outputs these final results.Window function work flow:
The framework calls update(buffer: MutableRow, input: InternalRow)
several times and then
call eval(buffer: InternalRow)
, so there is no need for window operator to call
serializeAggregateBufferInPlace
.
NOTE: SQL with TypedImperativeAggregate functions is planned in sort based aggregation, instead of hash based aggregation, as TypedImperativeAggregate use BinaryType as aggregation buffer's storage format, which is not supported by hash based aggregation. Hash based aggregation only support aggregation buffer of mutable types (like LongType, IntType that have fixed length and can be mutated in place in UnsafeRow)
An AggregateFunction with Complete mode is used to evaluate this function directly from original input rows without any partial aggregation.
An AggregateFunction with Complete mode is used to evaluate this function directly from original input rows without any partial aggregation. This function updates the given aggregation buffer with the original input of this function. When it has processed all input rows, the final result of this function is returned.
An AggregateFunction with Final mode is used to merge aggregation buffers containing intermediate results for this function and then generate final result.
An AggregateFunction with Final mode is used to merge aggregation buffers containing intermediate results for this function and then generate final result. This function updates the given aggregation buffer by merging multiple aggregation buffers. When it has processed all input rows, the final result of this function is returned.
Constants used in the implementation of the HyperLogLogPlusPlus aggregate function.
Constants used in the implementation of the HyperLogLogPlusPlus aggregate function.
See the Appendix to HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality (https://docs.google.com/document/d/1gyjfMHy43U9OWBXxfaeG-3MjGzejW1dlpyMwEYAAWEI/view?fullscreen) for more information.
A place holder expressions used in code-gen, it does not change the corresponding value in the row.
An AggregateFunction with Partial mode is used for partial aggregation.
An AggregateFunction with Partial mode is used for partial aggregation. This function updates the given aggregation buffer with the original input of this function. When it has processed all input rows, the aggregation buffer is returned.
An AggregateFunction with PartialMerge mode is used to merge aggregation buffers containing intermediate results for this function.
An AggregateFunction with PartialMerge mode is used to merge aggregation buffers containing intermediate results for this function. This function updates the given aggregation buffer by merging multiple aggregation buffers. When it has processed all input rows, the aggregation buffer is returned.
AggregateFunction is the superclass of two aggregation function interfaces:
In both interfaces, aggregates must define the schema (aggBufferSchema) and attributes (aggBufferAttributes) of an aggregation buffer which is used to hold partial aggregate results. At runtime, multiple aggregate functions are evaluated by the same operator using a combined aggregation buffer which concatenates the aggregation buffers of the individual aggregate functions.
Code which accepts AggregateFunction instances should be prepared to handle both types of aggregate functions.