org.apache.spark.sql.execution.streaming
Filter out the obsolete logs.
If we delete the old files after compaction at once, there is a race condition in S3: other processes may see the old files are deleted but still cannot see the compaction file using "list".
If we delete the old files after compaction at once, there is a race condition in S3: other
processes may see the old files are deleted but still cannot see the compaction file using
"list". The allFiles
handles this by looking for the next compaction file directly, however,
a live lock may happen if the compaction happens too frequently: one processing keeps deleting
old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it.
Store the metadata for the specified batchId and return true
if successful.
Store the metadata for the specified batchId and return true
if successful. If the batchId's
metadata has already been stored, this method will return false
.
Returns all files except the deleted ones.
A PathFilter
to filter only batch files
A PathFilter
to filter only batch files
Return metadata for batches between startId (inclusive) and endId (inclusive).
Return metadata for batches between startId (inclusive) and endId (inclusive). If startId
is
None
, just return all batches before endId (inclusive).
Return the metadata for the specified batchId if it's stored.
Return the metadata for the specified batchId if it's stored. Otherwise, return None.
the deserialized metadata in a batch file, or None if file not exist.
IllegalArgumentException
when path does not point to a batch file.
Return the latest batch Id and its metadata if exist.
Return the latest batch Id and its metadata if exist.
Get an array of [FileStatus] referencing batch files.
Get an array of [FileStatus] referencing batch files. The array is sorted by most recent batch file first to oldest batch file.
Removes all the log entry earlier than thresholdBatchId (exclusive).
Removes all the log entry earlier than thresholdBatchId (exclusive).
An abstract class for compactible metadata logs. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple serialized metadata lines following.
As reading from many small files is usually pretty slow, also too many small files in one folder will mess the FS, CompactibleFileStreamLog will compact log files every 10 batches by default into a big file. When doing a compaction, it will read all old log files and merge them with the new batch.