org.apache.spark.sql.execution.datasources
A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.
An optional specification of the schema of the data. When present we skip attempting to infer the schema.
A list of column names that the relation is partitioned by. This list is
generally empty during the read path, unless this DataSource is managed
by Hive. In these cases, during resolveRelation
, we will call
getOrInferFileFormatSchema
for file based DataSources to infer the
partitioning. In other cases, if this list is empty, then this table
is unpartitioned.
An optional specification for bucketing (hash-partitioning) of the data.
Optional catalog table reference that can be used to push down operations over the datasource to the catalog service.
An optional specification for bucketing (hash-partitioning) of the data.
Optional catalog table reference that can be used to push down operations over the datasource to the catalog service.
Returns a sink that can be used to continually write data.
Returns a source that can be used to continually read data.
Returns true if there is a single path that has a metadata log indicating which files should be read.
A list of column names that the relation is partitioned by.
A list of column names that the relation is partitioned by. This list is
generally empty during the read path, unless this DataSource is managed
by Hive. In these cases, during resolveRelation
, we will call
getOrInferFileFormatSchema
for file based DataSources to infer the
partitioning. In other cases, if this list is empty, then this table
is unpartitioned.
A list of file system paths that hold data.
A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.
Create a resolved BaseRelation that can be used to read data from or write data into this DataSource
Create a resolved BaseRelation that can be used to read data from or write data into this DataSource
Whether to confirm that the files exist when generating the non-streaming file based datasource. StructuredStreaming jobs already list file existence, and when generating incremental jobs, the batch is considered as a non-streaming file based data source. Since we know that files already exist, we don't need to check them again.
An optional specification of the schema of the data.
An optional specification of the schema of the data. When present we skip attempting to infer the schema.
Writes the given DataFrame out to this DataSource.
Writes the given DataFrame out to this DataSource and returns a BaseRelation for the following reading.
The main class responsible for representing a pluggable Data Source in Spark SQL. In addition to acting as the canonical set of parameters that can describe a Data Source, this class is used to resolve a description to a concrete implementation that can be used in a query plan (either batch or streaming) or to write out data using an external library.
From an end user's perspective a DataSource description can be created explicitly using org.apache.spark.sql.DataFrameReader or CREATE TABLE USING DDL. Additionally, this class is used when resolving a description from a metastore to a concrete implementation.
Many of the arguments to this class are optional, though depending on the specific API being used these optional arguments might be filled in during resolution using either inference or external metadata. For example, when reading a partitioned table from a file system, partition columns will be inferred from the directory layout even if they are not specified.
A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.
An optional specification of the schema of the data. When present we skip attempting to infer the schema.
A list of column names that the relation is partitioned by. This list is generally empty during the read path, unless this DataSource is managed by Hive. In these cases, during
resolveRelation
, we will callgetOrInferFileFormatSchema
for file based DataSources to infer the partitioning. In other cases, if this list is empty, then this table is unpartitioned.An optional specification for bucketing (hash-partitioning) of the data.
Optional catalog table reference that can be used to push down operations over the datasource to the catalog service.