org.apache.spark.sql.execution.datasources.parquet
Figures out a merged Parquet schema with a distributed Spark job.
Figures out a merged Parquet schema with a distributed Spark job.
Note that locality is not taken into consideration here because:
FileSystem
only provides API to retrieve locations of all blocks, which can be
potentially expensive.2. This optimization is mainly useful for S3, where file metadata operations can be pretty slow. And basically locality is not available when using S3 (you can't run computation on S3 nodes).
Reads Spark SQL schema from a Parquet footer.
Reads Spark SQL schema from a Parquet footer. If a valid serialized Spark SQL schema string can be found in the file metadata, returns the deserialized StructType, otherwise, returns a StructType converted from the MessageType stored in this footer.