org.apache.spark.sql.execution.streaming
Store the metadata for the specified batchId and return true
if successful.
Store the metadata for the specified batchId and return true
if successful. If the batchId's
metadata has already been stored, this method will return false
.
Returns all files except the deleted ones.
Returns all files except the deleted ones.
A PathFilter
to filter only batch files
A PathFilter
to filter only batch files
Filter out the obsolete logs.
Filter out the obsolete logs.
If we delete the old files after compaction at once, there is a race condition in S3: other processes may see the old files are deleted but still cannot see the compaction file using "list".
If we delete the old files after compaction at once, there is a race condition in S3: other
processes may see the old files are deleted but still cannot see the compaction file using
"list". The allFiles
handles this by looking for the next compaction file directly, however,
a live lock may happen if the compaction happens too frequently: one processing keeps deleting
old files while another one keeps retrying. Setting a reasonable cleanup delay could avoid it.
Return metadata for batches between startId (inclusive) and endId (inclusive).
Return metadata for batches between startId (inclusive) and endId (inclusive). If startId
is
None
, just return all batches before endId (inclusive).
Return the metadata for the specified batchId if it's stored.
Return the metadata for the specified batchId if it's stored. Otherwise, return None.
the deserialized metadata in a batch file, or None if file not exist.
IllegalArgumentException
when path does not point to a batch file.
Return the latest batch Id and its metadata if exist.
Return the latest batch Id and its metadata if exist.
Get an array of [FileStatus] referencing batch files.
Get an array of [FileStatus] referencing batch files. The array is sorted by most recent batch file first to oldest batch file.
Removes all the log entry earlier than thresholdBatchId (exclusive).
Removes all the log entry earlier than thresholdBatchId (exclusive).
A special log for FileStreamSink. It will write one log file for each batch. The first line of the log file is the version number, and there are multiple JSON lines following. Each JSON line is a JSON format of SinkFileStatus.
As reading from many small files is usually pretty slow, FileStreamSinkLog will compact log files every "spark.sql.sink.file.log.compactLen" batches into a big file. When doing a compaction, it will read all old log files and merge them with the new batch. During the compaction, it will also delete the files that are deleted (marked by SinkFileStatus.action). When the reader uses
allFiles
to list all files, this method only returns the visible files (drops the deleted files).