Using spark-shell and spark-submit¶

SnappyData, out-of-the-box, colocates Spark executors and the SnappyData store for efficient data intensive computations. You, however, may need to isolate the computational cluster for other reasons. For instance, a computationally intensive Map-reduce machine learning algorithm that needs to iterate over a cached data set repeatedly. Refer to SnappyData Smart Connector Mode for examples.

To support such cases it is also possible to run native Spark jobs that access a SnappyData cluster as a storage layer in a parallel fashion. To connect to the SnappyData store the spark.snappydata.connection property should be provided while starting the Spark-shell.

To run all SnappyData functionalities, you need to create a SnappySession.

// from the SnappyData base directory  
// Start the Spark shell in local mode. Pass SnappyData's locators host:clientPort as a conf parameter.
$ ./bin/spark-shell  --master local[*] --conf spark.snappydata.connection=locatorhost:clientPort --conf spark.ui.port=4041
scala>
 // Try few commands on the spark-shell. Following command shows the tables created using the snappy-sql
scala> val snappy = new org.apache.spark.sql.SnappySession(spark.sparkContext)
scala> val airlineDF = snappy.table("airline").show
scala> val resultset = snappy.sql("select * from airline")

Any Spark application can also use the SnappyData as store and Spark as a computational engine by providing the spark.snappydata.connection property as mentioned below:

// Start the Spark standalone cluster from SnappyData base directory
$ ./sbin/start-all.sh 
// Submit AirlineDataSparkApp to Spark Cluster with snappydata's locator host port.
$ ./bin/spark-submit --class io.snappydata.examples.AirlineDataSparkApp --master spark://masterhost:7077 --conf spark.snappydata.connection=locatorhost:clientPort $SNAPPY_HOME/examples/jars/quickstart.jar

// The results can be seen on the command line.