1. Getting Started¶
Let us get started to get into Spark SQL. In this module we will see how to launch and use Spark SQL.
Overview of Spark Documentation
Launching and Using Spark SQL
Overview of Spark SQL Properties
Running OS Commands using Spark SQL
Understanding Warehouse Directory
Managing Spark Metastore Databases
Managing Spark Metastore Tables
Retrieve Metadata of Tables
Role of Spark or Hive Metastore
1.1. Overview of Spark Documentation¶
Let us go through the details related to Spark Documentation. It is very important for you to get comfortable with Spark Documentation if you are aspiring for open book certification exams like CCA 175.
Click here to go to latest Spark SQL and Data Frames documentation.
We typically get documentation for latest version.
We can replace latest in the URL with the version of Spark to get specific version’s official documentation.
Also we have resources provided by databricks.
1.2. Launching and Using Spark SQL¶
Let us understand how to launch Spark SQL CLI.
Logon to the gateway node of the cluster.
We have 2 versions of Spark in our labs. One can use
spark-sql
to launch Spark SQL using 1.6.x andspark2-sql
to launch Spark SQL using 2.3.x.Launch Spark SQL CLI using
spark-sql
. In clustered mode we might have to add additional arguments. For example
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
One can get help using
spark-sql --help
For e. g.: we can use
spark-sql --database training_retail
to connect to specific database. Here is the example in clustered mode.
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse \
--database ${USER}_retail
Spark SQL CLI will be launched and will be connected to ${USER}_retail database.
We can validate to which database we are connected to using
SELECT current_database()
1.3. Overview of Spark SQL Properties¶
Let us understand details about Spark SQL properties which control Spark SQL run time environment.
Spark SQL inherits properties defined for Spark. There are some Spark SQL related properties as well and these are applicable even for Data Frames.
We can review these properties using Management Tools such as Ambari or Cloudera Manager Web UI
Spark run time behavior is controlled by HDFS Properties files, YARN Properties files, Hive Properties files etc in those clusters where Spark is integrated with Hadoop and Hive.
We can get all the properties using
SET;
in Spark SQL CLI
Let us review some important properties in Spark SQL.
spark.sql.warehouse.dir
spark.sql.catalogImplementation
We can review the current value using
SET spark.sql.warehouse.dir;
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", "/user/itversity/warehouse").
enableHiveSupport.
master("yarn").
appName("Getting Started - Spark SQL").
getOrCreate
%%sql
SET
%%sql
SET spark.sql.warehouse.dir
Properties with default values does not show up as part of
SET
command. But we can check and overwrite the values - for example
%%sql
SET spark.sql.shuffle.partitions
We can overwrite property by setting value using the same SET command, eg:
%%sql
SET spark.sql.shuffle.partitions=2
1.4. Running OS Commands using Spark SQL¶
Let us understand how to run OS commands using Spark SQL CLI.
We can run OS commands using ! at the beginning.
Listing local Files
!ls -ltr;
Listing HDFS Files
!hdfs dfs -ls /public/retail_db;
import sys.process._
"ls -ltr"!
import sys.process._
"hdfs dfs -ls /public/retail_db"!
1.5. Understanding Warehouse Directory¶
Let us go through the details related to Spark Metastore Warehouse Directory.
A Database in Spark SQL is nothing but directory in underlying file system like HDFS.
A Spark Metastore Table is nothing but directory in underlying file systems like HDFS.
A Partition of Spark Metastore Table is nothing but directory in underlying file systems like HDFS under table.
Warehouse Directory is the base directory where directories related to databases, tables go by default.
It is controlled by
spark.sql.warehouse.dir
. You can get the value by sayingSET spark.sql.warehouse.dir;
Do not overwrite this property Spark SQL CLI. It will not have any effect.
Underlying directory for a database will have .db extension.
%%sql
SET spark.sql.warehouse.dir
1.6. Managing Spark Metastore Databases¶
Let us undestand how to manage Spark Metastore Databases.
Make a habit of reviewing Language Manual.
We can create database using CREATE DATABASE Command.
For e. g.:
CREATE DATABASE itversity_demo;
If the database exists it will fail. If you want to ignore with out throwing error you can use IF NOT EXISTS
e. g.:
CREATE DATABASE IF NOT EXISTS itversity_demo;
We can list the databases using
SHOW databases;
Spark Metastore is multi tenant database. To switch to a database, you can use USE Command. e. g.:
USE itversity_demo;
We can drop empty database by using
DROP DATABASE itversity_demo;
.Add cascade to drop all the tables and then the database
DROP DATABASE itversity_demo CASCADE;
.We can also specify location while creating the database -
CREATE DATABASE itversity_demo LOCATION '/user/itversity/custom/itversity_demo.db'
%%sql
DROP DATABASE IF EXISTS itversity_demo
%%sql
CREATE DATABASE itversity_demo
%%sql
CREATE DATABASE itversity_demo
%%sql
CREATE DATABASE IF NOT EXISTS itversity_demo
%%sql
SHOW databases
%%sql
SELECT current_database()
%%sql
USE itversity_demo
%%sql
SELECT current_database()
val username = System.getProperty("user.name")
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_demo.db"!
%%sql
CREATE TABLE table_demo (i INT)
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_demo.db"!
%%sql
DROP DATABASE itversity_demo CASCADE
import sys.process._
s"hdfs dfs -ls /user/${username}/warehouse/${username}_demo.db"!
%%sql
CREATE DATABASE itversity_demo LOCATION '/user/itversity/custom/itversity_demo.db'
import sys.process._
s"hdfs dfs -ls /user/${username}/custom"! // Directory will be created if it does not exists
import sys.process._
s"hdfs dfs -ls /user/${username}/custom/${username}_demo.db"!
1.7. Managing Spark Metastore Tables¶
Let us create our first Spark Metastore table. We will also have a look into how to list the tables.
We will get into details related to DDL Commands at a later point in time.
For now we will just create our first table. We will get into the details about creating tables as part of subsequent sections.
Use your OS username as prefix for the databases, if you are using our labs
%%sql
SELECT current_database()
%%sql
DROP DATABASE itversity_retail CASCADE
%%sql
CREATE DATABASE itversity_retail
%%sql
USE itversity_retail
%%sql
CREATE TABLE orders (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
We can list the tables using
SHOW tables;
%%sql
SHOW tables
We can also drop the table using
DROP TABLE
command. We will get into more details at a later point in time.We can also truncate the managed tables using
TRUNCATE TABLE
command.
1.8. Retrieve Metadata of Tables¶
As the table is created, let us understand how to get the metadata of a table.
We can get metadata of Hive Tables using several commands.
DESCRIBE - e.g.:
DESCRIBE orders;
DESCRIBE EXTENDED - e.g.:
DESCRIBE EXTENDED orders;
DESCRIBE FORMATTED - e.g.:
DESCRIBE FORMATTED orders;
DESCRIBE will give only field names and data types.
DESCRIBE EXTENDED will give all the metadata, but not in readable format in Hive. It is same as DESCRIBE FORMATTED in Spark SQL.
DESCRIBE FORMATTED will give metadata in readable format.
As the output is truncated using Jupyter, we will actually see the details using spark-sql
%%sql
SELECT current_database()
%%sql
USE itversity_retail
%%sql
SHOW tables
%%sql
DESCRIBE orders
%%sql
DESCRIBE EXTENDED orders
%%sql
DESCRIBE FORMATTED orders
1.9. Role of Spark or Hive Metastore¶
Let us understand the role of Spark Metastore or Hive Metasore. We need to first understand details related to Metadata generated for Spark Metastore tables.
When we create a Spark Metastore table, there is metadata associated with it.
Table Name
Column Names and Data Types
Location
File Format
and more
This metadata has to be stored some where so that Query Engines such as Spark SQL can access the information to serve our queries.
Let us understand where the metadata is stored.
Information is typically stored in relational database and it is called as metastore.
It is extensively used by Hive or Spark SQL engine for syntax and semantics check as well as execution of queries.
In our case it is stored in MySQL Database. Let us review the details by going through relevant properties.
1.10. Exercise - Getting Started with Spark SQL¶
Let’s do a simple exercise to conclude this section.
Launch Spark SQL (don’t use database) (use spark-sql command). Here is the script for our labs. In other environments, you can skip last line. I have also included commands to launch Spark using Scala or Python (for CCA 175 Certification purpose)
Using Spark SQL
spark2-sql \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Scala
spark2-shell \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Using Pyspark
pyspark2 \
--master yarn \
--conf spark.ui.port=0 \
--conf spark.sql.warehouse.dir=/user/${USER}/warehouse
Create Database and exit (make sure to prefix database name with your OS username, e. g.: training_retail for OS user training)
Exit and launch connecting to your database
Create Table orders. You can use below script to create the table.
CREATE TABLE orders (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
List the tables
Describe the table and review the whole metadata