5. Pre-defined Functions¶
Let us go through the functions that can be used while processing the data. These are typically applied on columns to get derived values from existing column values.
Overview of Functions
Validating Functions
String Manipulation Functions
Date Manipulation Functions
Overview of Numeric Functions
Data Type Conversion
Handling NULL Values
Using CASE and WHEN
Query Example - Word Count
import org.apache.spark.sql.SparkSession
val spark = SparkSession.
builder.
config("spark.ui.port", "0").
config("spark.sql.warehouse.dir", "/user/itversity/warehouse").
enableHiveSupport.
appName("Spark SQL - Overview of Functions").
master("yarn").
getOrCreate
5.1. Overview of Functions¶
Let us get overview of pre-defined functions in Spark SQL.
We can get list of functions by running
SHOW functions
We can use DESCRIBE command to get the syntax and symantecs of a function -
DESCRIBE FUNCTION substr
Following are the categories of functions that are more commonly used.
String Manipulation
Date Manipulation
Numeric Functions
Type Conversion Functions
CASE and WHEN
and more
%%sql
SHOW functions
spark.sql("SHOW functions").show(300, false)
spark.catalog.listFunctions.show(300, false)
%%sql
DESCRIBE FUNCTION substr
spark.sql("DESCRIBE FUNCTION substr").show(false)
5.2. Validating Functions¶
Let us see how we can validate Spark SQL functions.
Spark SQL follows MySQL style. To validate functions we can just use SELECT clause - e. g.:
SELECT current_date;
Another example -
SELECT substr('Hello World', 1, 5);
If you want to use Oracle style, you can create table by name dual and insert one record.
You can also create temporary view on top of dataframe and start writing SQL Queries. We will see an example with Scala based approach. Here are the code snippets using both Scala as well as Pyspark.
Using Scala
val orders = spark.read.
schema("order_id INT, order_date STRING, order_customer_id INT, order_status STRING").
csv("/public/retail_db/orders")
orders.createOrReplaceTempView("orders_temp")
Using Python
orders = spark.read. \
schema("order_id INT, order_date STRING, order_customer_id INT, order_status STRING"). \
csv("/public/retail_db/orders")
orders.createOrReplaceTempView("orders_temp")
%%sql
SELECT current_date AS current_date
%%sql
SELECT substr('Hello World', 1, 5) AS result
%%sql
USE itversity_retail
%%sql
SELECT current_database()
%%sql
DROP TABLE IF EXISTS dual
%%sql
CREATE TABLE dual (dummy STRING)
%%sql
INSERT INTO dual VALUES ('X')
%%sql
SELECT current_date AS current_date FROM dual
%%sql
SELECT substr('Hello World', 1, 5) AS result FROM dual
Here is how you can validate functions using Data Frame.
Create Data Frame
Create temporary view using Data Frame
Run queries using view with relevant functions
val orders = spark.read.
schema("order_id INT, order_date STRING, order_customer_id INT, order_status STRING").
csv("/public/retail_db/orders")
orders.createOrReplaceTempView("orders_temp")
%%sql
SELECT o.*, lower(order_status) AS order_status_lower FROM orders_temp AS o LIMIT 10
5.3. String Manipulation Functions¶
We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.
Case Conversion -
lower
,upper
,initcap
Getting size of the column value -
length
Extracting Data -
substr
andsplit
Trimming and Padding functions -
trim
,rtrim
,ltrim
,rpad
andlpad
Reversing strings -
reverse
Concatenating multiple strings
concat
andconcat_ws
5.3.1. Case Conversion and Length¶
Let us understand how to perform case conversion functions of a string and also length of a string.
Case Conversion Functions -
lower
,upper
,initcap
%%sql
SELECT lower('hEllo wOrlD') AS lower_result,
upper('hEllo wOrlD') AS upper_result,
initcap('hEllo wOrlD') AS initcap_result
Getting length -
length
%%sql
SELECT length('hEllo wOrlD') AS result
Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.
order_status for some of the orders is in lower case and we will convert every thing to upper case.
%%sql
USE itversity_retail
%%sql
SHOW tables
%%sql
SELECT * FROM orders LIMIT 10
%%sql
SELECT order_id, order_date, order_customer_id,
lower(order_status) AS order_status,
length(order_status) AS order_status_length
FROM orders LIMIT 10
5.3.2. Extracting Data - substr and split¶
Let us understand how to extract data from strings using substr
/substring
and split
.
We can get syntax and symantecs of the functions using
DESCRIBE FUNCTION
We can extract first four characters from string using substr or substring.
%%sql
DESCRIBE FUNCTION substr
%%sql
DESCRIBE FUNCTION substring
spark.sql("DESCRIBE FUNCTION substring").show(false)
%%sql
SELECT substr('2013-07-25 00:00:00.0', 1, 4) AS result
%%sql
SELECT substr('2013-07-25 00:00:00.0', 6, 2) AS result
%%sql
SELECT substr('2013-07-25 00:00:00.0', 9, 2) AS result
%%sql
SELECT substr('2013-07-25 00:00:00.0', 12) AS result
Let us see how we can extract date part from order_date of orders.
%%sql
SELECT * FROM orders LIMIT 10
%%sql
SELECT order_id,
substr(order_date, 1, 10) AS order_date,
order_customer_id,
order_status
FROM orders
Let us understand how to extract the information from the string where there is a delimiter.
split
converts delimited string into array.
%%sql
SELECT split('2013-07-25', '-') AS result
%%sql
SELECT split('2013-07-25', '-')[1] AS result
We can use explode to convert an array into records.
%%sql
SELECT explode(split('2013-07-25', '-')) AS result
5.3.3. Trimming and Padding Functions¶
Let us understand how to trim or remove leading and/or trailing spaces in a string.
ltrim
is used to remove the spaces on the left side of the string.rtrim
is used to remove the spaces on the right side of the string.trim
is used to remove the spaces on both sides of the string.
%%sql
SELECT ltrim(' Hello World') AS result
%%sql
SELECT rtrim(' Hello World ') AS result
%%sql
SELECT length(trim(' Hello World ')) AS result
Let us understand how to use padding to pad characters to a string.
Let us assume that there are 3 fields - year, month and date which are of type integer.
If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.
lpad
is used more often thanrpad
especially when we try to build the date from separate columns.
%%sql
SELECT 2013 AS year, 7 AS month, 25 AS myDate
%%sql
SELECT lpad(7, 2, 0) AS result
%%sql
SELECT lpad(10, 2, 0) AS result
%%sql
SELECT lpad(100, 2, 0) AS result
5.3.4. Reverse and Concatenating multiple strings¶
Let us understand how to reverse a string as well as concatenate multiple strings.
We can use
reverse
to reverse a string.We can concatenate multiple strings using
concat
andconcat_ws
.concat_ws
is typically used if we want to have the same string between all the strings that are being concatenated.
%%sql
SELECT reverse('Hello World') AS result
%%sql
SELECT concat('Hello ', 'World') AS result
%%sql
SELECT concat('Order Status is ', order_status) AS result
FROM orders LIMIT 10
spark.sql("""
SELECT concat('Order Status is ', order_status) AS result
FROM orders_part LIMIT 10
""").show(false)
%%sql
SELECT * FROM (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
%%sql
SELECT concat(year, '-', lpad(month, 2, 0), '-',
lpad(myDate, 2, 0)) AS order_date
FROM
(SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
%%sql
SELECT concat_ws('-', year, lpad(month, 2, 0),
lpad(myDate, 2, 0)) AS order_date
FROM
(SELECT 2013 AS year, 7 AS month, 25 AS myDate) q
5.4. Date Manipulation Functions¶
Let us go through some of the important date manipulation functions.
Getting Current Date and Timestamp
Date Arithmetic such as
date_add
Getting beginning date or time using
trunc
ordate_trunc
Extracting information using
date_format
as well as calendar functions.Dealing with unix timestamp using
from_unixtime
,to_unix_timestamp
5.4.1. Getting Current Date and Timestamp¶
Let us understand how to get the details about current or today’s date as well as current timestamp.
current_date
is the function or operator which will return today’s date.current_timestamp
is the function or operator which will return current time up to milliseconds.These are not like other functions and do not use () at the end.
These are not listed as part of
SHOW functions
and we can get help usingDESCRIBE
.There is a format associated with date and timestamp.
Date -
yyyy-MM-dd
Timestamp -
yyyy-MM-dd HH:mm:ss.SSS
Keep in mind that a date or timestamp in Spark SQL are nothing but special strings containing values using above specified formats. We can apply all string manipulation functions on date or timestamp.
%%sql
SELECT current_date AS current_date
%%sql
SELECT current_date() AS current_date
%%sql
SELECT current_timestamp AS current_timestamp
spark.sql("SELECT current_timestamp AS current_timestamp").show(false)
5.4.2. Date Arithmetic¶
Let us understand how to perform arithmetic on dates or timestamps.
date_add
can be used to add or subtract days.date_sub
can be used to subtract or add days.datediff
can be used to get difference between 2 datesadd_months
can be used add months to a date
%%sql
SELECT date_add(current_date, 32) AS result
%%sql
SELECT date_add('2018-04-15', 730) AS result
%%sql
SELECT date_add('2018-04-15', -730) AS result
%%sql
SELECT date_sub(current_date, 30) AS result
%%sql
SELECT datediff('2019-03-30', '2017-12-31') AS result
%%sql
SELECT datediff('2017-12-31', '2019-03-30') AS result
%%sql
SELECT add_months(current_date, 3) AS result
%%sql
SELECT add_months('2019-01-31', 1) AS result
%%sql
SELECT add_months('2019-05-31', 1) AS result
%%sql
SELECT add_months(current_timestamp, 3) AS result
%%sql
SELECT date_add(current_timestamp, -730) AS result
5.4.3. Beginning Date or Time - trunc and date_trunc¶
Let us understand how to use trunc
and date_trunc
on dates or timestamps and get beginning date of the period.
We can use MM to get beginning date of the month.
YY can be used to get begining date of the year.
We can apply trunc either on date or timestamp, however we cannot apply it other than month or year (such an hour or day).
%%sql
DESCRIBE FUNCTION trunc
spark.sql("DESCRIBE FUNCTION trunc").show(false)
%%sql
SELECT trunc(current_date, 'MM') AS beginning_date_month
%%sql
SELECT trunc('2019-01-23', 'MM') AS beginning_date_month
%%sql
SELECT trunc(current_date, 'YY') AS beginning_date_year
This will not work
%%sql
SELECT trunc(current_timestamp, 'HH') AS doesnt_work
While
trunc
can be used to get beginning time of a given month or year, we can get the beginning time up to Second usingdate_trunc
.
spark.sql("DESCRIBE FUNCTION date_trunc").show(false)
%%sql
SELECT date_trunc('HOUR', current_timestamp) AS hour_beginning
5.4.4. Extracting information using date_format¶
Let us understand how to use date_format
to extract information from date or timestamp.
Here is how we can get date related information such as year, month, day etc from date or timestamp.
spark.sql("DESCRIBE FUNCTION date_format").show(false)
%%sql
SELECT current_timestamp AS current_timestamp
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'yyyy') AS year
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'yy') AS year
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'MM') AS month
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'dd') AS day_of_month
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'DD') AS day_of_year
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'MMM') AS month_name
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'MMMM') AS month_name
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'EE') AS dayname
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'EEEE') AS dayname
Here is how we can get time related information such as hour, minute, seconds, milliseconds etc from timestamp.
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'HH') AS hour24
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'hh') AS hour12
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'mm') AS minutes
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'ss') AS seconds
%%sql
SELECT current_timestamp AS current_timestamp,
date_format(current_timestamp, 'SS') AS millis
Here is how we can get the information from date or timestamp in the format we require.
%%sql
SELECT date_format(current_timestamp, 'yyyyMM') AS current_month
%%sql
SELECT date_format(current_timestamp, 'yyyyMMdd') AS current_date
%%sql
SELECT date_format(current_timestamp, 'yyyy/MM/dd') AS current_date
5.4.5. Extracting information - Calendar functions¶
We can get year, month, day etc from date or timestamp using functions. There are functions such as day
, dayofmonth
, month
, weekofyear
, year
etc available for us.
spark.sql("DESCRIBE FUNCTION day").show(false)
spark.sql("DESCRIBE FUNCTION dayofmonth").show(false)
spark.sql("DESCRIBE FUNCTION month").show(false)
spark.sql("DESCRIBE FUNCTION weekofyear").show(false)
spark.sql("DESCRIBE FUNCTION year").show(false)
Let us see the usage of the functions such as day, dayofmonth, month, weekofyear, year etc.
%%sql
SELECT year(current_date) AS year
%%sql
SELECT month(current_date) AS month
%%sql
SELECT weekofyear(current_date) AS weekofyear
%%sql
SELECT day(current_date) AS day
%%sql
SELECT dayofmonth(current_date) AS dayofmonth
5.4.6. Dealing with Unix Timestamp¶
Let us go through the functions that can be used to deal with Unix Timestamp.
from_unixtime
can be used to convert Unix epoch to regular timestamp.unix_timestamp
orto_unix_timestamp
can be used to convert timestamp to Unix epoch.We can get Unix epoch or Unix timestamp by running
date '+%s'
in Unix/Linux terminalWe can DESCRIBE on the above functions to get details about them.
Let us sww how we can use functions such as from_unixtime
, unix_timestamp
or to_unix_timestamp
to convert between timestamp and Unix timestamp or epoch.
We can unix epoch in Unix/Linux terminal using
date '+%s'
%%sql
SELECT from_unixtime(1556662731) AS timestamp
%%sql
SELECT to_unix_timestamp('2019-04-30 18:18:51') AS unixtime
%%sql
SELECT from_unixtime(1556662731, 'yyyyMM') AS month
%%sql
SELECT from_unixtime(1556662731, 'yyyy-MM-dd') AS date
%%sql
SELECT from_unixtime(1556662731, 'yyyy-MM-dd HH:mm') AS timestamp
%%sql
SELECT from_unixtime(1556662731, 'yyyy-MM-dd hh:mm') AS timestamp
%%sql
SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd') AS date
%%sql
SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd HH:mm:ss') AS timestamp
5.5. Overview of Numeric Functions¶
Here are some of the numeric functions we might use quite often.
abs
- always return positive numbersum
,avg
round
- rounds off to specified precisionceil
,floor
- always return integer.greatest
min
,max
rand
pow
,sqrt
cumedist
,stddev
,variance
Some of the functions highlighted are aggregate functions, eg: sum
, avg
, min
, max
etc.
%%sql
SELECT abs(-10.5), abs(10)
%%sql
USE itversity_retail
%%sql
SHOW tables
%%sql
SELECT order_item_order_id, order_item_subtotal FROM order_items
WHERE order_item_order_id = 2
%%sql
SELECT avg(order_item_subtotal) AS order_revenue_avg FROM order_items
WHERE order_item_order_id = 2
%%sql
SELECT order_item_order_id,
avg(order_item_subtotal) AS order_revenue_avg
FROM order_items
GROUP BY order_item_order_id
LIMIT 10
%%sql
SELECT order_item_order_id,
sum(order_item_subtotal) AS order_revenue_sum
FROM order_items
GROUP BY order_item_order_id
LIMIT 10
%%sql
SELECT
round(10.58) rnd,
floor(10.58) flr,
ceil(10.58) cl
%%sql
SELECT
round(10.44) rnd1,
round(10.44, 1) rnd1,
round(10.46, 1) rnd2,
floor(10.44) flr,
ceil(10.44) cl
%%sql
SELECT avg(order_item_subtotal) AS order_revenue_avg FROM order_items
WHERE order_item_order_id = 2
%%sql
SELECT round(avg(order_item_subtotal), 2) AS order_revenue_avg
FROM order_items
WHERE order_item_order_id = 2
%%sql
SELECT order_item_order_id,
round(avg(order_item_subtotal), 2) AS order_revenue_avg
FROM order_items
GROUP BY order_item_order_id
LIMIT 10
%%sql
SELECT order_item_order_id,
round(sum(order_item_subtotal), 2) AS order_revenue_sum
FROM order_items
GROUP BY order_item_order_id
LIMIT 10
%%sql
SELECT greatest(10, 11, 13, -13)
%%sql
SELECT rand() AS rand
%%sql
SELECT cast(round(rand() * 1) AS int) AS random_int
%%sql
SELECT order_item_order_id,
round(sum(order_item_subtotal), 2) AS order_revenue_sum,
min(order_item_subtotal) AS order_item_subtotal_min,
max(order_item_subtotal) AS order_item_subtotal_max
FROM order_items
GROUP BY order_item_order_id
LIMIT 10
%%sql
SELECT order_item_order_id, order_item_subtotal
FROM order_items
WHERE order_item_order_id = 2
%%sql
SELECT round(sum(order_item_subtotal), 2) AS order_revenue_sum,
min(order_item_subtotal) AS order_item_subtotal_min,
max(order_item_subtotal) AS order_item_subtotal_max
FROM order_items
WHERE order_item_order_id = 2
5.6. Data Type Conversion¶
Let us understand how we can type cast to change the data type of extracted value to its original type.
%%sql
SELECT current_date AS current_date
%%sql
SELECT split(current_date, '-')[1] AS month
%%sql
SELECT cast(split(current_date, '-')[1] AS INT) AS month
%%sql
SELECT cast('0.04' AS FLOAT) AS result
%%sql
SELECT cast('0.04' AS INT) AS zero
%%sql
SELECT cast('xyz' AS INT) AS returns_null
%%sql
CREATE EXTERNAL TABLE IF NOT EXISTS orders_single_column (
s STRING
) LOCATION '/user/itversity/warehouse/itversity_retail.db/orders'
%%sql
SELECT * FROM orders_single_column LIMIT 10
%%sql
SELECT split(s, ',')[0] AS order_id,
split(s, ',')[1] AS order_date,
split(s, ',')[2] AS order_customer_id,
split(s, ',')[3] AS order_status
FROM orders_single_column LIMIT 10
%%sql
SELECT cast(split(s, ',')[0] AS INT) AS order_id,
cast(split(s, ',')[1] AS TIMESTAMP) AS order_date,
cast(split(s, ',')[2] AS INT) AS order_customer_id,
cast(split(s, ',')[3] AS STRING) AS order_status
FROM orders_single_column LIMIT 10
5.7. Handling NULL Values¶
Let us understand how to handle nulls using specific functions in Spark SQL.
By default if we try to add or concatenate null to another column or expression or literal, it will return null.
If we want to replace null with some default value, we can use
nvl
. For not null values, nvl returns the original expression value.Replace commission_pct with 0 if it is null.
We can also use
coalesce
in the place ofnvl
.
coalesce
returns first not null value if we pass multiple arguments to it.nvl2
can be used to perform one action when the value is not null and some other action when the value is null.We want to increase commission_pct by 1 if it is not null and set commission_pct to 2 if it is null.
We can also use
CASE WHEN ELSE END
for any conditional logic.
%%sql
SELECT 1 + NULL AS result
%%sql
SELECT concat('Hello', NULL) AS result
%%sql
SELECT nvl(1, 0) nvl, coalesce(1, 0) AS coalesce
%%sql
SELECT nvl(NULL, 0) nvl , coalesce(1, 0) AS coalesce
%%sql
SELECT coalesce(NULL, NULL, 2, NULL, 3) AS result
%%sql
SELECT nvl(NULL, NULL, 2, NULL, 3) AS result
%%sql
USE itversity_retail
%%sql
DROP TABLE IF EXISTS sales
%%sql
CREATE TABLE IF NOT EXISTS sales(
sales_person_id INT,
sales_amount FLOAT,
commission_pct INT
)
%%sql
INSERT INTO sales VALUES
(1, 1000, 10),
(2, 1500, 8),
(3, 500, NULL),
(4, 800, 5),
(5, 250, NULL)
%%sql
SELECT * FROM sales
%%sql
SELECT s.*,
nvl(commission_pct, 0) AS commission_pct
FROM sales AS s
%%sql
SELECT s.*,
coalesce(commission_pct, 0) AS commission_pct
FROM sales AS s
%%sql
SELECT s.*,
round(sales_amount * commission_pct / 100, 2) AS incorrect_commission_amount
FROM sales AS s
%%sql
SELECT s.*,
round(sales_amount * nvl(commission_pct, 0) / 100, 2) AS commission_amount
FROM sales AS s
%%sql
SELECT s.*,
round(sales_amount * coalesce(commission_pct, 0) / 100, 2) AS commission_amount
FROM sales AS s
%%sql
SELECT s.*,
nvl2(commission_pct, commission_pct + 1, 2) AS commission_pct
FROM sales AS s
%%sql
SELECT s.*,
round(sales_amount * nvl2(commission_pct, commission_pct + 1, 2) / 100, 2) AS commission_amount
FROM sales AS s
%%sql
SELECT s.*,
CASE WHEN commission_pct IS NULL
THEN 2
ELSE commission_pct + 1
END AS commission_pct
FROM sales AS s
%%sql
SELECT s.*,
CASE WHEN commission_pct IS NOT NULL
THEN commission_pct + 1
ELSE 2
END AS commission_pct
FROM sales AS s
%%sql
SELECT s.*,
CASE WHEN commission_pct IS NULL
THEN round((sales_amount * 2 / 100), 2)
ELSE round((sales_amount * (commission_pct + 1)/ 100), 2)
END AS commission_amount
FROM sales AS s
5.8. Using CASE and WHEN¶
At times we might have to select values from multiple columns conditionally.
We can use
CASE
andWHEN
for that.Let us implement this conditional logic to come up with derived order_status.
If order_status is COMPLETE or CLOSED, set COMPLETED
If order_status have PENDING in it, then we will say PENDING
If order_status have PROCESSING or PAYMENT_REVIEW in it, then we will say PENDING
We will set all others as OTHER
We can also have
ELSE
as part ofCASE
andWHEN
.
%%sql
USE itversity_retail
%%sql
SHOW tables
%%sql
SELECT DISTINCT order_status FROM orders LIMIT 10
%%sql
SELECT o.*,
CASE WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
END AS updated_order_status
FROM orders o
LIMIT 10
%%sql
SELECT o.*,
CASE WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
ELSE order_status
END AS updated_order_status
FROM orders o
LIMIT 10
%%sql
SELECT o.*,
CASE
WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
WHEN order_status LIKE '%PENDING%' THEN 'PENDING'
ELSE 'OTHER'
END AS updated_order_status
FROM orders o
LIMIT 10
%%sql
SELECT o.*,
CASE
WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
WHEN order_status LIKE '%PENDING%' OR order_status IN ('PROCESSING', 'PAYMENT_REVIEW')
THEN 'PENDING'
ELSE 'OTHER'
END AS updated_order_status
FROM orders o
LIMIT 10
%%sql
SELECT DISTINCT order_status,
CASE
WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
WHEN order_status LIKE '%PENDING%' OR order_status IN ('PROCESSING', 'PAYMENT_REVIEW')
THEN 'PENDING'
ELSE 'OTHER'
END AS updated_order_status
FROM orders
ORDER BY updated_order_status
5.9. Query Example - Word Count¶
Let us see how we can perform word count using Spark SQL. Using word count as an example we will understand how we can come up with the solution using pre-defined functions available.
Create table by name lines.
Insert data into the table.
Split lines into array of words.
Explode array of words from each line into individual records.
Use group by and get the count. We cannot use
GROUP BY
directly on exploded records and hence we need to use nested sub query.
%%sql
DROP DATABASE IF EXISTS itversity_demo CASCADE
%%sql
CREATE DATABASE IF NOT EXISTS itversity_demo
%%sql
USE itversity_demo
%%sql
CREATE TABLE lines (s STRING)
%%sql
INSERT INTO lines VALUES
('Hello World'),
('How are you'),
('Let us perform the word count'),
('The definition of word count is'),
('to get the count of each word from this data')
%%sql
SELECT * FROM lines
%%sql
SELECT split(s, ' ') AS word_array FROM lines
spark.sql("SHOW functions").show(300, false)
%%sql
SELECT explode(split(s, ' ')) AS words FROM lines
%%sql
SELECT count(1) FROM (SELECT explode(split(s, ' ')) AS words FROM lines)
%%sql
SELECT explode(split(s, ' ')) AS words, count(1) FROM lines
GROUP BY explode(split(s, ' '))
%%sql
SELECT word, count(1) FROM (
SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word
%%sql
SELECT count(1) FROM
(
SELECT word, count(1) FROM (
SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word
)