org.apache.spark.sql.execution.columnar.encoding
Finish encoding the current column and return the data as a ByteBuffer.
Finish encoding the current column and return the data as a ByteBuffer. The encoder can be reused for new column data of same type again.
Temporary offset results to be read by generated code immediately after initializeComplexType, so not an issue for nested types.
Temporary offset results to be read by generated code immediately after initializeComplexType, so not an issue for nested types.
Close and relinquish all resources of this encoder.
Close and relinquish all resources of this encoder. The encoder may no longer be usable after this call.
The final size of the encoder column (excluding header and nulls) which should match that occupied after finish but without writing anything.
The final size of the encoder column (excluding header and nulls) which should match that occupied after finish but without writing anything.
Expand the underlying bytes if required and return the new cursor
Expand the underlying bytes if required and return the new cursor
flush any pending data when finish is not being invoked explicitly
flush any pending data when finish is not being invoked explicitly
Initialize this ColumnEncoder.
Initialize this ColumnEncoder.
DataType of the field to be written
True if the field is nullable, false otherwise
Initial estimated number of elements to be written
True if header is to be written to data (typeId etc)
the BufferAllocator to use for the data
the minimum size of initial buffer to use (ignored if <= 0)
initial position of the cursor that caller must use to write
Initialize this ColumnEncoder.
Initialize this ColumnEncoder.
DataType of the field to be written
True if the field is nullable, false otherwise
Initial estimated number of elements to be written
True if header is to be written to data (typeId etc)
the BufferAllocator to use for the data
initial position of the cursor that caller must use to write
Complex types are written similar to UnsafeRows while respecting platform endianness (format is always little endian) so appropriate for storage.
Complex types are written similar to UnsafeRows while respecting platform endianness (format is always little endian) so appropriate for storage. Also have other minor differences related to size writing and interval type handling. General layout looks like below:
.--------------------------- Optional total size including itself (4 bytes) | .----------------------- Optional number of elements (4 bytes) | | .------------------- Null bitset longs (8 x (N / 8) bytes) | | | | | | .------------- Offsets+Sizes of elements (8 x N bytes) | | | | .------- Variable length elements V V V V V +---+---+-----+-------------+ | | | ... | ... ... ... | +---+---+-----+-------------+ \-----/ \-----------------/ header body
The above generic layout is used for ARRAY and STRUCT types.
The total size of the data is written for top-level complex types. Nested complex objects write their sizes in the "Offsets+Sizes" portion in the respective parent object.
ARRAY types also write the number of elements in the array in the header while STRUCT types skip it since it is fixed in the meta-data.
The null bitset follows the header. To keep the reads aligned at 8 byte boundaries while preserving space, the implementation will combine the header and the null bitset portion, then pad them together at 8 byte boundary (in particular it will consider header as some additional empty fields in the null bitset itself).
After this follows the "Offsets+Sizes" which keeps the offset and size for variable length elements. Fixed length elements less than 8 bytes in size are written directly in the offset+size portion. Variable length elements have their offsets (from start of this array) and sizes encoded in this portion as a long (4 bytes for each of offset and size). Fixed width elements that are greater than 8 bytes are encoded like variable length elements. CalendarInterval is the only type currently that is of that nature whose "months" portion is encoded into the size while the "microseconds" portion is written into variable length part.
MAP types are written as an ARRAY of keys followed by ARRAY of values like in Spark. To keep things simpler both ARRAYs always have the optional size header at their respective starts which together determine the total size of the encoded MAP object. For nested MAP types, the total size is skipped from the "Offsets+Sizes" portion and only the offset is written (which is the start of key ARRAY).
Get the allocator for the final data to be sent for storage.
Get the allocator for the final data to be sent for storage. It is on-heap for now in embedded mode while off-heap for connector mode to minimize copying in both cases. This should be changed to use the matching allocator as per the storage being used by column store in embedded mode.
Write any internal structures (e.g.
Write any internal structures (e.g. dictionary) of the encoder that would normally be written by finish after the header and null bit mask.