SnappyData Smart Connector Mode¶
In this mode, the Spark cluster executes in its own independent JVM processes and connects to SnappyData as a Spark data source. Conceptually, this is similar to how Spark applications work with stores like Cassandra, Redis etc. The Smart connector mode also implements several performance optimizations as described in this section.
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in your main program (called the driver program).
Specifically, to run on a cluster, the SparkContext can connect to several types of cluster managers (either local-only runners, Spark’s own standalone cluster manager, Mesos or YARN), which allocate resources across applications. Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors. Finally, SparkContext sends tasks to the executors to run.
Key Points:
-
Can work with SnappyData store from a compatible Spark distribution (2.1.1, 2.1.2 or 2.1.3)
-
Spark application executes in its own independent JVM processes
-
The Spark application connects to SnappyData as a Spark Data source
-
Supports any of the Spark supported resource managers (for example, local/local-cluster, Spark Standalone Manager, YARN or Mesos)
Some of the advantages of this mode are:
Performance
-
When Spark partitions store data in column tables, the connector automatically attempts to localize the partitions into SnappyData store buckets on the local node. The connector uses the same column store format as well as compression techniques in Spark avoiding all data formatting related inefficiencies or unnecessary serialization costs. This is the fastest way to ingest data when Spark and the SnappyData cluster are operating as independent clusters.
-
When storing to Row tables or when the partitioning in Spark is different from the partitioning configured on the table, data batches could be shuffled across nodes. Whenever Spark applications are writing to SnappyData tables, the data is always batched for the highest possible throughput.
-
When queries are executed, while the entire query planning and execution is coordinated by the Spark engine (Catalyst), the smart connector still carries out a number of optimizations, which are listed below:
- Route jobs to same machines as SnappyData data nodes if the executor nodes are co-hosted on the same machines as the data nodes. Job for each partition tries to fetch only from same machine data store where possible.
- Data for column tables is fetched as it is stored in raw columnar byte data format, and optimized code generated plans read the columnar data directly which gives performance equivalent to and exceeding Spark's ColumnVectors while avoiding its most expensive step of having to copy external data to ColumnVectors.
- Colocated joins: If the underlying tables are colocated partition-wise, and executor nodes are co-hosting SnappyData data nodes, then the column batches are fetched from local machines and the join itself is partition-wise and does not require any exchange.
-
Optimized column batch inserts like in the Embedded mode with job routing to same machines as data stores if possible.
Allows creation of SnappySession
which extends SparkSession
in various ways:
- Enhanced SQL parser allowing for DML operations
UPDATE
,DELETE
,PUT INTO
, and DDLTRUNCATE
operation - Enhanced APIs to perform inline row-wise or bulk
UPDATE
,DELETE
andPUT INTO
operations - New
CREATE POLICY
operation (and correspondingDROP POLICY
) to create a security access policy using SQL or API - Tokenization and plan caching that allows reuse of plans and generated code for queries that differ only in constant values and thus substantially improves overall performance of the queries
- Automatically uses the embedded cluster for the hive meta-data by default, so any tables created will show up in the embedded cluster and vice-versa
- Ability to also attach to an external hive meta-store in addition to the embedded one
- Uniform APIs corresponding to DDL operations to
CREATE
,DROP
tables, views and policies, andALTER
,TRUNCATE
tables
Example: Launch a Spark local mode cluster and use Smart Connector to access SnappyData cluster
Step 1: Start the SnappyData cluster: You can either start SnappyData members using the snappy_start_all
script or you can start them individually.
Step 2: Launch the Apache Spark program
In the Local mode
./bin/spark-shell --master local[*] --conf spark.snappydata.connection=localhost:1527 \
--packages "io.snappydata:snappydata-spark-connector_2.11:1.3.1-HF-1"
Note
- The
spark.snappydata.connection
property points to the locator of a running SnappyData cluster. The value of this property is a combination of locator host and JDBC client port on which the locator listens for connections (default is 1527). - In the Smart Connector mode, all
snappydata.*
SQL configuration properties should be prefixed withspark
when running with upstream Apache Spark (not required if running bundled SnappyData's spark). For example,spark.snappydata.column.batchSize
.
This opens a Scala Shell.
Step 3: Import any or all of the following:
- SnappySession
- SparkSession
This starts the SnappyData cluster with Smart Connector mode. Create a SnappySession to interact with the SnappyData store.
// Create a SnappySession to work with SnappyData store
$scala > val snSession = new SnappySession(spark.sparkContext)
The code example for writing a Smart Connector application program is located in SmartConnectorExample
Using External Cluster Manager
Cluster mode
./bin/spark-submit --deploy-mode cluster --class somePackage.someClass --master spark://localhost:7077 \
--conf spark.snappydata.connection=localhost:1527 --packages "io.snappydata:snappydata-spark-connector_2.11:1.3.1-HF-1"
Client mode
./bin/spark-submit --deploy-mode client --class somePackage.someClass --master spark://localhost:7077 \
--conf spark.snappydata.connection=localhost:1527 --packages "io.snappydata:snappydata-spark-connector_2.11:1.3.1-HF-1"
Using YARN as a Cluster Manager
Cluster mode
./spark-submit --master yarn --deploy-mode cluster \
--conf spark.driver.extraClassPath='/home/snappyuser/snappydata-1.3.1-bin/jars/*' \
--conf spark.executor.extraClassPath='/home/snappyuser/snappydata-1.3.1-bin/jars/*' \
--class MainClass SampleProjectYarn.jar
Client mode