Accessing Cloud Data¶

AWS¶

To access the data on AWS cloud storage, you must set the credentials using any of the following methods:

Using hive-site.xml File
Setting the Access or Secret Key Properties
Setting the Access or Secret Key Properties through Environment Variable

Using hive-site.xml File¶

Add the following key properties into conf > hive-site.xml file:

   <property>
       <name>fs.s3a.access.key</name>
       <value>Amazon S3 Access Key</value>
   </property>
   <property>
       <name>fs.s3a.secret.key</name>
       <value>Amazon S3 Secret Key</value>
   </property>

Setting the Access or Secret Key Properties¶

Add the following properties in conf/leads file:

     -spark.hadoop.fs.s3a.access.key=<Amazon S3 Access Key>
     -spark.hadoop.fs.s3a.secret.key=<Amazon S3 Secret Key>

Create an external table using the following command: s3a:/// command.

Setting the Access or Secret Key Properties through Environment Variable¶

Set credentials as environment variables:

        export AWS_ACCESS_KEY_ID=<Amazon S3 Access Key>
        export AWS_SECRET_ACCESS_KEY=<Amazon S3 Secret Key>

Accessing Data from AWS Cloud Storage¶

Create an external table using the following command:

create external table staging_parquet using parquet options (path 's3a://<bucketName>/<folderName>');
create table parquet_test using column as select * from staging_parquet;

Unsetting the Access or Secret Key Properties¶

You can run the org.apache.hadoop.fs.FileSystem.closeAll() command on the snappy-scala shell or in the job. This clears the cache. Ensure that there are no queries running on the cluster when you are executing this property. After this you can set the new credentials.

Azure Blob¶

To access the data on Azure Blob storage, you must set the credentials using any of the following methods:

Setting Credentials through hive-site.xml
Setting Credentials through Spark Property

Setting Credentials through hive-site.xml¶

Set the following property in hive-site.xml

    <property> 
        <name>fs.azure.account.key.youraccount.blob.core.windows.net</name> 
        <value>YOUR ACCESS KEY</value>
    </property>

Setting Credentials through Spark Property¶

sc.hadoopConfiguration.set("fs.azure.account.key.<your-storage_account_name>.dfs.core.windows.net", "<YOUR ACCESS KEY>")

Accessing Data from Azure Unsecured BLOB Storage¶

    CREATE EXTERNAL TABLE testADLS1 USING PARQUET Options (path 'wasb://container_name@storage_account_name.blob.core.windows.net/dir/file')

Accessing Data from Azure Secured BLOB Storage¶

    CREATE EXTERNAL TABLE testADLS1 USING PARQUET Options (path 'wasbs://container_name@storage_account_name.blob.core.windows.net/dir/file')

GCS¶

To access the data on GCS cloud storage, you must set the credentials using any of the following methods:

Setting Credentials through hive-site.xml
Setting Credentials through Spark Property

Setting Credentials in hive-site.xml¶

   <property>
       <name>google.cloud.auth.service.account.json.keyfile</name>
       <value>/path/to/keyfile</value>
   </property>

Setting Credentials through Spark property on Shell¶

sc.hadoopConfiguration.set("google.cloud.auth.service.account.json.keyfile","`<json file path>`")

Accessing Data from GCS Cloud Storage¶

CREATE EXTERNAL TABLE airline_ext USING parquet OPTIONS(path 'gs://bucket_name/object_name')

EMR HDFS¶

You can use the following command to access data from EMR HDFS location for a cluster that is running on ec2:

create external table categories using csv options(path 'hdfs://<masternode IP address>/<file_path>');

Azure Data Lake Storage (ADLS)¶

SnappyData supports ADLS gen 2 out of the box. Refer to Hadoop Azure Documentation for more details. The abfs: URLs mentioned in the documentation can be used directly in SnappyData as file paths to store or load data.