Store type of column once to avoid checking for every row at runtime
Store type of column once to avoid checking for every row at runtime
Store pending values to be flushed in a separate buffer so that we do not end up creating too small ColumnBatches.
Store pending values to be flushed in a separate buffer so that we do not end up creating too small ColumnBatches.
Note that this mini-cache is copy-on-write (to avoid copy-on-read for readers) so the buffer inside should never be changed rather the whole buffer replaced if required. This should happen only inside flushCache.
Random number generator for sampling.
Random number generator for sampling.
Map of each stratum key (i.e.
Map of each stratum key (i.e. a unique combination of values of columns in qcs) to related metadata and reservoir
A stratified sampling implementation that uses a fraction and initial cache size. Latter is used as the initial reservoir size per stratum for reservoir sampling. It primarily tries to satisfy the fraction of the total data repeatedly filling up the cache as required (and expanding the cache size for bigger reservoir if required in next rounds). The fraction is attempted to be satisfied while ensuring that the selected rows are equally divided among the current stratum (for those that received any rows, that is).