1. Getting Started¶

Let us get started to get into Spark SQL. In this module we will see how to launch and use Spark SQL.

Overview of Spark Documentation
Launching and Using Spark SQL
Overview of Spark SQL Properties
Running OS Commands using Spark SQL
Understanding Warehouse Directory
Managing Spark Metastore Databases
Managing Spark Metastore Tables
Retrieve Metadata of Tables
Role of Spark or Hive Metastore

1.1. Overview of Spark Documentation¶

Let us go through the details related to Spark Documentation. It is very important for you to get comfortable with Spark Documentation if you are aspiring for open book certification exams like CCA 175.

Click here to go to latest Spark SQL and Data Frames documentation.
We typically get documentation for latest version.
We can replace latest in the URL with the version of Spark to get specific version’s official documentation.
Also we have resources provided by databricks.

1.2. Launching and Using Spark SQL¶

Let us understand how to launch Spark SQL CLI.

Logon to the gateway node of the cluster.
We have 2 versions of Spark in our labs. One can use spark-sql to launch Spark SQL using 1.6.x and spark2-sql to launch Spark SQL using 2.3.x.
Launch Spark SQL CLI using spark-sql. In clustered mode we might have to add additional arguments. For example

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

One can get help using spark-sql --help
For e. g.: we can use spark-sql --database training_retail to connect to specific database. Here is the example in clustered mode.

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse \
    --database ${USER}_retail

Spark SQL CLI will be launched and will be connected to ${USER}_retail database.
We can validate to which database we are connected to using SELECT current_database()

1.3. Overview of Spark SQL Properties¶

Let us understand details about Spark SQL properties which control Spark SQL run time environment.

Spark SQL inherits properties defined for Spark. There are some Spark SQL related properties as well and these are applicable even for Data Frames.
We can review these properties using Management Tools such as Ambari or Cloudera Manager Web UI
Spark run time behavior is controlled by HDFS Properties files, YARN Properties files, Hive Properties files etc in those clusters where Spark is integrated with Hadoop and Hive.
We can get all the properties using SET; in Spark SQL CLI

Let us review some important properties in Spark SQL.

spark.sql.warehouse.dir
spark.sql.catalogImplementation

We can review the current value using SET spark.sql.warehouse.dir;

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", "/user/itversity/warehouse").
    enableHiveSupport.
    master("yarn").
    appName("Getting Started - Spark SQL").
    getOrCreate

%%sql

SET

%%sql

SET spark.sql.warehouse.dir

Properties with default values does not show up as part of SET command. But we can check and overwrite the values - for example

%%sql

SET spark.sql.shuffle.partitions

We can overwrite property by setting value using the same SET command, eg:

%%sql

SET spark.sql.shuffle.partitions=2

1.4. Running OS Commands using Spark SQL¶

Let us understand how to run OS commands using Spark SQL CLI.

We can run OS commands using ! at the beginning.
- Listing local Files !ls -ltr;
- Listing HDFS Files !hdfs dfs -ls /public/retail_db;

import sys.process._

"ls -ltr"!

import sys.process._

"hdfs dfs -ls /public/retail_db"!

1.5. Understanding Warehouse Directory¶

Let us go through the details related to Spark Metastore Warehouse Directory.

A Database in Spark SQL is nothing but directory in underlying file system like HDFS.
A Spark Metastore Table is nothing but directory in underlying file systems like HDFS.
A Partition of Spark Metastore Table is nothing but directory in underlying file systems like HDFS under table.
Warehouse Directory is the base directory where directories related to databases, tables go by default.
It is controlled by spark.sql.warehouse.dir. You can get the value by saying SET spark.sql.warehouse.dir;

Do not overwrite this property Spark SQL CLI. It will not have any effect.

Underlying directory for a database will have .db extension.

%%sql

SET spark.sql.warehouse.dir

1.6. Managing Spark Metastore Databases¶

Let us undestand how to manage Spark Metastore Databases.

Make a habit of reviewing Language Manual.
We can create database using CREATE DATABASE Command.
For e. g.: CREATE DATABASE itversity_demo;
If the database exists it will fail. If you want to ignore with out throwing error you can use IF NOT EXISTS
e. g.: CREATE DATABASE IF NOT EXISTS itversity_demo;
We can list the databases using SHOW databases;
Spark Metastore is multi tenant database. To switch to a database, you can use USE Command. e. g.: USE itversity_demo;
We can drop empty database by using DROP DATABASE itversity_demo;.
Add cascade to drop all the tables and then the database DROP DATABASE itversity_demo CASCADE;.
We can also specify location while creating the database - CREATE DATABASE itversity_demo LOCATION '/user/itversity/custom/itversity_demo.db'

%%sql

DROP DATABASE IF EXISTS itversity_demo

%%sql

CREATE DATABASE itversity_demo

%%sql

CREATE DATABASE itversity_demo

%%sql

CREATE DATABASE IF NOT EXISTS itversity_demo

%%sql

SHOW databases

%%sql

SELECT current_database()

%%sql

USE itversity_demo

%%sql

SELECT current_database()

val username = System.getProperty("user.name")

import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_demo.db"!

%%sql

CREATE TABLE table_demo (i INT)

import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_demo.db"!

%%sql

DROP DATABASE itversity_demo CASCADE

import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_demo.db"!

%%sql

CREATE DATABASE itversity_demo LOCATION '/user/itversity/custom/itversity_demo.db'

import sys.process._
s"hdfs dfs -ls /user/${username}/custom"! // Directory will be created if it does not exists

import sys.process._
s"hdfs dfs -ls /user/${username}/custom/${username}_demo.db"!

1.7. Managing Spark Metastore Tables¶

Let us create our first Spark Metastore table. We will also have a look into how to list the tables.

We will get into details related to DDL Commands at a later point in time.
For now we will just create our first table. We will get into the details about creating tables as part of subsequent sections.

Use your OS username as prefix for the databases, if you are using our labs

%%sql

SELECT current_database()

%%sql

DROP DATABASE itversity_retail CASCADE

%%sql

CREATE DATABASE itversity_retail

%%sql

USE itversity_retail

%%sql

CREATE TABLE orders (
  order_id INT,
  order_date STRING,
  order_customer_id INT,
  order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','

We can list the tables using SHOW tables;

%%sql

SHOW tables

We can also drop the table using DROP TABLE command. We will get into more details at a later point in time.
We can also truncate the managed tables using TRUNCATE TABLE command.

1.8. Retrieve Metadata of Tables¶

As the table is created, let us understand how to get the metadata of a table.

We can get metadata of Hive Tables using several commands.
- DESCRIBE - e.g.: DESCRIBE orders;
- DESCRIBE EXTENDED - e.g.: DESCRIBE EXTENDED orders;
- DESCRIBE FORMATTED - e.g.: DESCRIBE FORMATTED orders;
DESCRIBE will give only field names and data types.
DESCRIBE EXTENDED will give all the metadata, but not in readable format in Hive. It is same as DESCRIBE FORMATTED in Spark SQL.
DESCRIBE FORMATTED will give metadata in readable format.

As the output is truncated using Jupyter, we will actually see the details using spark-sql

%%sql

SELECT current_database()

%%sql

USE itversity_retail

%%sql

SHOW tables

%%sql

DESCRIBE orders

%%sql

DESCRIBE EXTENDED orders

%%sql

DESCRIBE FORMATTED orders

1.9. Role of Spark or Hive Metastore¶

Let us understand the role of Spark Metastore or Hive Metasore. We need to first understand details related to Metadata generated for Spark Metastore tables.

When we create a Spark Metastore table, there is metadata associated with it.
- Table Name
- Column Names and Data Types
- Location
- File Format
- and more
This metadata has to be stored some where so that Query Engines such as Spark SQL can access the information to serve our queries.

Let us understand where the metadata is stored.

Information is typically stored in relational database and it is called as metastore.
It is extensively used by Hive or Spark SQL engine for syntax and semantics check as well as execution of queries.
In our case it is stored in MySQL Database. Let us review the details by going through relevant properties.

1.10. Exercise - Getting Started with Spark SQL¶

Let’s do a simple exercise to conclude this section.

Launch Spark SQL (don’t use database) (use spark-sql command). Here is the script for our labs. In other environments, you can skip last line. I have also included commands to launch Spark using Scala or Python (for CCA 175 Certification purpose)

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Create Database and exit (make sure to prefix database name with your OS username, e. g.: training_retail for OS user training)
Exit and launch connecting to your database
Create Table orders. You can use below script to create the table.

CREATE TABLE orders (
  order_id INT,
  order_date STRING,
  order_customer_id INT,
  order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

List the tables
Describe the table and review the whole metadata

My Jupyter Book