Store type of column once to avoid checking for every row at runtime
Store type of column once to avoid checking for every row at runtime
Check whether the cache needs to be flushed.
Check whether the cache needs to be flushed. This should be invoked whenever there is a potential significant increase in memory consumption
java.lang.Boolean.TRUE if cache needs to be flushed and fully reset, java.lang.Boolean.FALSE if cache needs to be flushed but no full reset, and null if cache does not need to be flushed
Store pending values to be flushed in a separate buffer so that we do not end up creating too small ColumnBatches.
Store pending values to be flushed in a separate buffer so that we do not end up creating too small ColumnBatches.
Note that this mini-cache is copy-on-write (to avoid copy-on-read for readers) so the buffer inside should never be changed rather the whole buffer replaced if required. This should happen only inside flushCache.
Random number generator for sampling.
Random number generator for sampling.
Map of each stratum key (i.e.
Map of each stratum key (i.e. a unique combination of values of columns in qcs) to related metadata and reservoir
not used for this implementation so return init size
not used for this implementation so return init size
A stratified sampling implementation that uses an error limit with confidence on a numerical column to sample as much as required to maintaining the expected error within the limit. An optional initial cache size can be specified that is used as the initial reservoir size per stratum for reservoir sampling. The error limit is attempted to be honoured for each of the stratum independently and the sampling rate increased or decreased accordingly. It uses standard closed form estimation of the sampling error increasing or decreasing the sampling as required (and expanding the cache size for bigger reservoir if required in next rounds).