datasources

Type Members

case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPlan] with Product with Serializable

Analyze CreateTable and do some normalization and checking.
Analyze CreateTable and do some normalization and checking. For CREATE TABLE AS SELECT, the SELECT query is also analyzed.
class CatalogFileIndex extends FileIndex

A FileIndex for a metastore catalog table.
case class CreateTable(tableDesc: CatalogTable, mode: SaveMode, query: Option[LogicalPlan]) extends LeafNode with Command with Product with Serializable
case class CreateTempViewUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], replace: Boolean, global: Boolean, provider: String, options: Map[String, String]) extends LeafNode with RunnableCommand with Product with Serializable

Create or replace a local/global temporary view with given data source.
case class DataSource(sparkSession: SparkSession, className: String, paths: Seq[String] = Nil, userSpecifiedSchema: Option[StructType] = None, partitionColumns: Seq[String] = Seq.empty, bucketSpec: Option[BucketSpec] = None, options: Map[String, String] = Map.empty, catalogTable: Option[CatalogTable] = None) extends internal.Logging with Product with Serializable

The main class responsible for representing a pluggable Data Source in Spark SQL.
The main class responsible for representing a pluggable Data Source in Spark SQL. In addition to acting as the canonical set of parameters that can describe a Data Source, this class is used to resolve a description to a concrete implementation that can be used in a query plan (either batch or streaming) or to write out data using an external library.
From an end user's perspective a DataSource description can be created explicitly using org.apache.spark.sql.DataFrameReader or CREATE TABLE USING DDL. Additionally, this class is used when resolving a description from a metastore to a concrete implementation.
Many of the arguments to this class are optional, though depending on the specific API being used these optional arguments might be filled in during resolution using either inference or external metadata. For example, when reading a partitioned table from a file system, partition columns will be inferred from the directory layout even if they are not specified.
paths
A list of file system paths that hold data. These will be globbed before and qualified. This option only works when reading from a FileFormat.
userSpecifiedSchema
An optional specification of the schema of the data. When present we skip attempting to infer the schema.
partitionColumns
A list of column names that the relation is partitioned by. This list is generally empty during the read path, unless this DataSource is managed by Hive. In these cases, during resolveRelation, we will call getOrInferFileFormatSchema for file based DataSources to infer the partitioning. In other cases, if this list is empty, then this table is unpartitioned.
bucketSpec
An optional specification for bucketing (hash-partitioning) of the data.
catalogTable
Optional catalog table reference that can be used to push down operations over the datasource to the catalog service.
case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

Replaces generic operations with specific variants that are designed to work with Spark SQL Data Sources.
trait FileFormat extends AnyRef

Used to read and write data stored in files to/from the InternalRow format.
trait FileIndex extends AnyRef

An interface for objects capable of enumerating the root paths of a relation as well as the partitions of a relation subject to some pruning expressions.
case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends Partition with Product with Serializable

A collection of file blocks that should be read as a single task (possibly from multiple partitioned directories).
class FileScanRDD extends RDD[InternalRow]

An RDD that scans a list of file partitions.
abstract class FileStatusCache extends AnyRef

A cache of the leaf files of partition directories.
A cache of the leaf files of partition directories. We cache these files in order to speed up iterated queries over the same set of partitions. Otherwise, each query would have to hit remote storage in order to gather file statistics for physical planning.
Each resolved catalog table has its own FileStatusCache. When the backing relation for the table is refreshed via refreshTable() or refreshByPath(), this cache will be invalidated.
class FindDataSourceTable extends Rule[LogicalPlan]

Replaces SimpleCatalogRelation with data source table if its table property contains data source information.
class HadoopFileLinesReader extends Iterator[Text] with Closeable

An adaptor from a PartitionedFile to an Iterator of Text, which are all of the lines in that file.
case class HadoopFsRelation(location: FileIndex, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String])(sparkSession: SparkSession) extends BaseRelation with FileRelation with Product with Serializable

Acts as a container for all of the metadata required to read from a datasource.
Acts as a container for all of the metadata required to read from a datasource. All discovery, resolution and merging logic for schemas and partitions has been removed.
location
A FileIndex that can enumerate the locations of all the files that comprise this relation.
partitionSchema
The schema of the columns (if any) that are used to partition the relation
dataSchema
The schema of any remaining columns. Note that if any partition columns are present in the actual data files as well, they are preserved.
bucketSpec
Describes the bucketing (hash-partitioning of the files by some column values).
fileFormat
A file format that can be used to read and write the data in files.
options
Configuration used when reading / writing data.
class InMemoryFileIndex extends PartitioningAwareFileIndex

A FileIndex that generates the list of files to process by recursively listing all the files present in paths.
case class InsertIntoDataSourceCommand(logicalRelation: LogicalRelation, query: LogicalPlan, overwrite: OverwriteOptions) extends LeafNode with RunnableCommand with Product with Serializable

Inserts the results of query in to a relation that extends InsertableRelation.
case class InsertIntoHadoopFsRelationCommand(outputPath: Path, staticPartitionKeys: TablePartitionSpec, customPartitionLocations: Map[TablePartitionSpec, String], partitionColumns: Seq[Attribute], bucketSpec: Option[BucketSpec], fileFormat: FileFormat, refreshFunction: (Seq[TablePartitionSpec]) ⇒ Unit, options: Map[String, String], query: LogicalPlan, mode: SaveMode, catalogTable: Option[CatalogTable]) extends LeafNode with RunnableCommand with Product with Serializable

A command for writing data to a HadoopFsRelation.
A command for writing data to a HadoopFsRelation. Supports both overwriting and appending. Writing to dynamic partitions is also supported.
staticPartitionKeys
partial partitioning spec for write. This defines the scope of partition overwrites: when the spec is empty, all partitions are overwritten. When it covers a prefix of the partition keys, only partitions matching the prefix are overwritten.
customPartitionLocations
mapping of partition specs to their custom locations. The caller should guarantee that exactly those table partitions falling under the specified static partition keys are contained in this map, and that no other partitions are.
case class LogicalRelation(relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, catalogTable: Option[CatalogTable] = None) extends LeafNode with MultiInstanceRelation with Product with Serializable

Used to link a BaseRelation in to a logical query plan.
Used to link a BaseRelation in to a logical query plan.
Note that sometimes we need to use LogicalRelation to replace an existing leaf node without changing the output attributes' IDs. The expectedOutputAttributes parameter is used for this purpose. See https://issues.apache.org/jira/browse/SPARK-10741 for more details.
abstract class OutputWriter extends AnyRef

OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system.
OutputWriter is used together with HadoopFsRelation for persisting rows to the underlying file system. Subclasses of OutputWriter must provide a zero-argument constructor. An OutputWriter instance is created and initialized when a new output file is opened on executor side. This instance is used to persist rows to this single output file.
abstract class OutputWriterFactory extends Serializable

A factory that produces OutputWriters.
A factory that produces OutputWriters. A new OutputWriterFactory is created on driver side for each write job issued when writing to a HadoopFsRelation, and then gets serialized to executor side to create actual OutputWriters on the fly.
case class PartitionDirectory(values: InternalRow, files: Seq[FileStatus]) extends Product with Serializable

A collection of data files from a partitioned relation, along with the partition values in the form of an InternalRow.
case class PartitionPath(values: InternalRow, path: Path) extends Product with Serializable

Holds a directory in a partitioned collection of files as well as as the partition values in the form of a Row.
Holds a directory in a partitioned collection of files as well as as the partition values in the form of a Row. Before scanning, the files at path need to be enumerated.
case class PartitionSpec(partitionColumns: StructType, partitions: Seq[PartitionPath]) extends Product with Serializable
case class PartitionedFile(partitionValues: InternalRow, filePath: String, start: Long, length: Long, locations: Array[String] = Array.empty) extends Product with Serializable

A part (i.e.
A part (i.e. "block") of a single file that should be read, along with partition column values that need to be prepended to each row.
partitionValues
value of partition columns to be prepended to each row.
filePath
path of the file to read
start
the beginning offset (in bytes) of the block.
length
number of bytes to read.
locations
locality information (list of nodes that have the data).
abstract class PartitioningAwareFileIndex extends FileIndex with internal.Logging

An abstract class that represents FileIndexs that are aware of partitioned tables.
An abstract class that represents FileIndexs that are aware of partitioned tables. It provides the necessary methods to parse partition data based on a set of files.
case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) extends (LogicalPlan) ⇒ Unit with Product with Serializable

A rule to do various checks before inserting into or writing to a data source table.
case class PreprocessTableInsertion(conf: SQLConf) extends Rule[LogicalPlan] with Product with Serializable

Preprocess the InsertIntoTable plan.
Preprocess the InsertIntoTable plan. Throws exception if the number of columns mismatch, or specified partition columns are different from the existing partition columns in the target table. It also does data type casting and field renaming, to make sure that the columns to be inserted have the correct data type and fields have the correct names.
class RecordReaderIterator[T] extends Iterator[T] with Closeable

An adaptor from a Hadoop RecordReader to an Iterator over the values returned.
An adaptor from a Hadoop RecordReader to an Iterator over the values returned.
Note that this returns Objects instead of InternalRow because we rely on erasure to pass column batches by pretending they are rows.
case class RefreshResource(path: String) extends LeafNode with RunnableCommand with Product with Serializable
case class RefreshTable(tableIdent: TableIdentifier) extends LeafNode with RunnableCommand with Product with Serializable
class ResolveDataSource extends Rule[LogicalPlan]

Try to replaces UnresolvedRelations with ResolveDataSource.
class SQLHadoopMapReduceCommitProtocol extends HadoopMapReduceCommitProtocol with Serializable with internal.Logging

A variant of HadoopMapReduceCommitProtocol that allows specifying the actual Hadoop output committer using an option specified in SQLConf.
abstract class TextBasedFileFormat extends FileFormat

The base class file format that is based on text file.

Value Members

object BucketingUtils
object CodecStreams
object DataSource extends Serializable
object DataSourceStrategy extends Strategy with internal.Logging

A Strategy for planning scans over data sources defined using the sources API.
object FileFormatWriter extends internal.Logging

A helper object for writing FileFormat data out to a location.
object FileSourceStrategy extends Strategy with internal.Logging

A strategy for planning scans over collections of files that might be partitioned or bucketed by user specified columns.
A strategy for planning scans over collections of files that might be partitioned or bucketed by user specified columns.
At a high level planning occurs in several phases:
- Split filters by when they need to be evaluated.
- Prune the schema of the data requested based on any projections present. Today this pruning is only done on top level columns, but formats should support pruning of nested columns as well.
- Construct a reader function by passing filters and the schema into the FileFormat.
- Using a partition pruning predicates, enumerate the list of files that should be read.
- Split the files into tasks and construct a FileScanRDD.
- Add any projection or filters that must be evaluated after the scan.
Files are assigned into tasks using the following algorithm:
- If the table is bucketed, group files by bucket id into the correct number of partitions.
- If the table is not bucketed or bucketing is turned off:
  - If any file is larger than the threshold, split it into pieces based on that threshold
  - Sort the files by decreasing file size.
  - Assign the ordered files to buckets using the following algorithm. If the current partition is under the threshold with the addition of the next file, add it. If not, open a new bucket and add it. Proceed to the next file.
object FileStatusCache

Use FileStatusCache.getOrCreate() to construct a globally shared file status cache.
object HiveOnlyCheck extends (LogicalPlan) ⇒ Unit

A rule to check whether the functions are supported only when Hive support is enabled
object NoopCache extends FileStatusCache

A non-caching implementation used when partition file status caching is disabled.
object PartitionPath extends Serializable
object PartitionSpec extends Serializable
object PartitioningAwareFileIndex extends internal.Logging
object PartitioningUtils
package csv
package jdbc
package json
package parquet
package text

package datasources

Type Members

case class AnalyzeCreateTable(sparkSession: SparkSession) extends Rule[LogicalPlan] with Product with Serializable

class CatalogFileIndex extends FileIndex

case class CreateTable(tableDesc: CatalogTable, mode: SaveMode, query: Option[LogicalPlan]) extends LeafNode with Command with Product with Serializable

case class CreateTempViewUsing(tableIdent: TableIdentifier, userSpecifiedSchema: Option[StructType], replace: Boolean, global: Boolean, provider: String, options: Map[String, String]) extends LeafNode with RunnableCommand with Product with Serializable

case class DataSourceAnalysis(conf: CatalystConf) extends Rule[LogicalPlan] with Product with Serializable

trait FileFormat extends AnyRef

trait FileIndex extends AnyRef

case class FilePartition(index: Int, files: Seq[PartitionedFile]) extends Partition with Product with Serializable

class FileScanRDD extends RDD[InternalRow]

abstract class FileStatusCache extends AnyRef

class FindDataSourceTable extends Rule[LogicalPlan]

class HadoopFileLinesReader extends Iterator[Text] with Closeable

case class HadoopFsRelation(location: FileIndex, partitionSchema: StructType, dataSchema: StructType, bucketSpec: Option[BucketSpec], fileFormat: FileFormat, options: Map[String, String])(sparkSession: SparkSession) extends BaseRelation with FileRelation with Product with Serializable

class InMemoryFileIndex extends PartitioningAwareFileIndex

case class InsertIntoDataSourceCommand(logicalRelation: LogicalRelation, query: LogicalPlan, overwrite: OverwriteOptions) extends LeafNode with RunnableCommand with Product with Serializable

case class LogicalRelation(relation: BaseRelation, expectedOutputAttributes: Option[Seq[Attribute]] = None, catalogTable: Option[CatalogTable] = None) extends LeafNode with MultiInstanceRelation with Product with Serializable

abstract class OutputWriter extends AnyRef

abstract class OutputWriterFactory extends Serializable

case class PartitionDirectory(values: InternalRow, files: Seq[FileStatus]) extends Product with Serializable

case class PartitionPath(values: InternalRow, path: Path) extends Product with Serializable

case class PartitionSpec(partitionColumns: StructType, partitions: Seq[PartitionPath]) extends Product with Serializable

case class PartitionedFile(partitionValues: InternalRow, filePath: String, start: Long, length: Long, locations: Array[String] = Array.empty) extends Product with Serializable

abstract class PartitioningAwareFileIndex extends FileIndex with internal.Logging

case class PreWriteCheck(conf: SQLConf, catalog: SessionCatalog) extends (LogicalPlan) ⇒ Unit with Product with Serializable

case class PreprocessTableInsertion(conf: SQLConf) extends Rule[LogicalPlan] with Product with Serializable

class RecordReaderIterator[T] extends Iterator[T] with Closeable

case class RefreshResource(path: String) extends LeafNode with RunnableCommand with Product with Serializable

case class RefreshTable(tableIdent: TableIdentifier) extends LeafNode with RunnableCommand with Product with Serializable

class ResolveDataSource extends Rule[LogicalPlan]

class SQLHadoopMapReduceCommitProtocol extends HadoopMapReduceCommitProtocol with Serializable with internal.Logging

abstract class TextBasedFileFormat extends FileFormat

Value Members

object BucketingUtils

object CodecStreams

object DataSource extends Serializable

object DataSourceStrategy extends Strategy with internal.Logging

object FileFormatWriter extends internal.Logging

object FileSourceStrategy extends Strategy with internal.Logging

object FileStatusCache

object HiveOnlyCheck extends (LogicalPlan) ⇒ Unit

object NoopCache extends FileStatusCache

object PartitionPath extends Serializable

object PartitionSpec extends Serializable

object PartitioningAwareFileIndex extends internal.Logging

object PartitioningUtils

package csv

package jdbc

package json

package parquet

package text

Ungrouped