The base class of SortBasedAggregationIterator and TungstenAggregationIterator.
A helper trait used to create specialized setter and getter for types supported by org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap's buffer.
A helper trait used to create specialized setter and getter for types supported by org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap's buffer. (see UnsafeFixedWidthAggregationMap.supportsAggregationBufferSchema).
Special plan to collect top-level aggregation on driver itself and avoid an exchange for simple aggregates.
Hash-based aggregate operator that can also fallback to sorting when data exceeds memory size.
This is a helper class to generate an append-only row-based hash map that can act as a 'cache'
for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found).
This is a helper class to generate an append-only row-based hash map that can act as a 'cache'
for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found). This is 'codegened' in HashAggregate to speed
up aggregates w/ key.
NOTE: the generated hash map currently doesn't support nullable keys and falls back to the
BytesToBytesMap
to store them.
This is a helper class to generate an append-only row-based hash map that can act as a 'cache'
for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found).
This is a helper class to generate an append-only row-based hash map that can act as a 'cache'
for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found). This is 'codegened' in HashAggregate to speed
up aggregates w/ key.
We also have VectorizedHashMapGenerator, which generates a append-only vectorized hash map. We choose one of the two as the 1st level, fast hash map during aggregation.
NOTE: This row-based hash map currently doesn't support nullable keys and falls back to the
BytesToBytesMap
to store them.
The internal wrapper used to hook a UserDefinedAggregateFunction udaf
in the
internal aggregation code path.
Hash-based aggregate operator that can also fallback to sorting when data exceeds memory size.
Hash-based aggregate operator that can also fallback to sorting when data exceeds memory size.
Parts of this class have been adapted from Spark's HashAggregateExec.
That class is not extended because it forces that the limitations
HashAggregateExec.supportsAggregate
of that implementation in
the constructor itself while this implementation has no such restriction.
Sort-based aggregate operator.
An iterator used to evaluate AggregateFunction.
An iterator used to evaluate AggregateFunction. It assumes the input rows have been sorted by values of groupingExpressions.
An iterator used to evaluate aggregate functions.
An iterator used to evaluate aggregate functions. It operates on UnsafeRows.
This iterator first uses hash-based aggregation to process input rows. It uses a hash map to store groups and their corresponding aggregation buffers. If this map cannot allocate memory from memory manager, it spills the map into disk and creates a new one. After processed all the input, then merge all the spills together using external sorter, and do sort-based aggregation.
The process has the following step:
The code of this class is organized as follows:
A helper class to hook Aggregator into the aggregation system.
This is a helper class to generate an append-only vectorized hash map that can act as a 'cache'
for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found).
This is a helper class to generate an append-only vectorized hash map that can act as a 'cache'
for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found). This is 'codegened' in HashAggregate to speed
up aggregates w/ key.
It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the
key-value pairs. The index lookups in the array rely on linear probing (with a small number of
maximum tries) and use an inexpensive hash function which makes it really efficient for a
majority of lookups. However, using linear probing and an inexpensive hash function also makes it
less robust as compared to the BytesToBytesMap
(especially for a large number of keys or even
for certain distribution of keys) and requires us to fall back on the latter for correctness. We
also use a secondary columnar batch that logically projects over the original columnar batch and
is equivalent to the BytesToBytesMap
aggregate buffer.
NOTE: This vectorized hash map currently doesn't support nullable keys and falls back to the
BytesToBytesMap
to store them.
Utility functions used by the query planner to convert our plan to new aggregation code path.
The base class of SortBasedAggregationIterator and TungstenAggregationIterator. It mainly contains two parts: 1. It initializes aggregate functions. 2. It creates two functions,
processRow
andgenerateOutput
based on AggregateMode of its aggregate functions.processRow
is the function to handle an input.generateOutput
is used to generate result.