Generates a method that returns true if the group-by keys exist at a given index in the associated org.apache.spark.sql.execution.vectorized.ColumnarBatch.
Generates a method that returns true if the group-by keys exist at a given index in the associated org.apache.spark.sql.execution.vectorized.ColumnarBatch. For instance, if we have 2 long group-by keys, the generated function would be of the form:
private boolean equals(int idx, long agg_key, long agg_key1) { return batch.column(0).getLong(buckets[idx]) == agg_key && batch.column(1).getLong(buckets[idx]) == agg_key1; }
Generates a method that returns a mutable org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row which keeps track of the aggregate value(s) for a given set of keys.
Generates a method that returns a mutable org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row which keeps track of the aggregate value(s) for a given set of keys. If the corresponding row doesn't exist, the generated method adds the corresponding row in the associated org.apache.spark.sql.execution.vectorized.ColumnarBatch. For instance, if we have 2 long group-by keys, the generated function would be of the form:
public org.apache.spark.sql.execution.vectorized.ColumnarBatch.Row findOrInsert( long agg_key, long agg_key1) { long h = hash(agg_key, agg_key1); int step = 0; int idx = (int) h & (numBuckets - 1); while (step < maxSteps) { // Return bucket index if it's either an empty slot or already contains the key if (buckets[idx] == -1) { batch.column(0).putLong(numRows, agg_key); batch.column(1).putLong(numRows, agg_key1); batch.column(2).putLong(numRows, 0); buckets[idx] = numRows++; return batch.getRow(buckets[idx]); } else if (equals(idx, agg_key, agg_key1)) { return batch.getRow(buckets[idx]); } idx = (idx + 1) & (numBuckets - 1); step++; } // Didn't find it return null; }
Generates a method that computes a hash by currently xor-ing all individual group-by keys.
Generates a method that computes a hash by currently xor-ing all individual group-by keys. For instance, if we have 2 long group-by keys, the generated function would be of the form:
private long hash(long agg_key, long agg_key1) { return agg_key ^ agg_key1; }
This is a helper class to generate an append-only vectorized hash map that can act as a 'cache' for extremely fast key-value lookups while evaluating aggregates (and fall back to the
BytesToBytesMap
if a given key isn't found). This is 'codegened' in HashAggregate to speed up aggregates w/ key.It is backed by a power-of-2-sized array for index lookups and a columnar batch that stores the key-value pairs. The index lookups in the array rely on linear probing (with a small number of maximum tries) and use an inexpensive hash function which makes it really efficient for a majority of lookups. However, using linear probing and an inexpensive hash function also makes it less robust as compared to the
BytesToBytesMap
(especially for a large number of keys or even for certain distribution of keys) and requires us to fall back on the latter for correctness. We also use a secondary columnar batch that logically projects over the original columnar batch and is equivalent to theBytesToBytesMap
aggregate buffer.NOTE: This vectorized hash map currently doesn't support nullable keys and falls back to the
BytesToBytesMap
to store them.