5. Pre-defined Functions¶

Let us go through the functions that can be used while processing the data. These are typically applied on columns to get derived values from existing column values.

Overview of Functions
Validating Functions
String Manipulation Functions
Date Manipulation Functions
Overview of Numeric Functions
Data Type Conversion
Handling NULL Values
Using CASE and WHEN
Query Example - Word Count

import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", "/user/itversity/warehouse").
    enableHiveSupport.
    appName("Spark SQL - Overview of Functions").
    master("yarn").
    getOrCreate

5.1. Overview of Functions¶

Let us get overview of pre-defined functions in Spark SQL.

We can get list of functions by running SHOW functions
We can use DESCRIBE command to get the syntax and symantecs of a function - DESCRIBE FUNCTION substr
Following are the categories of functions that are more commonly used.
- String Manipulation
- Date Manipulation
- Numeric Functions
- Type Conversion Functions
- CASE and WHEN
- and more

%%sql

SHOW functions

spark.sql("SHOW functions").show(300, false)

spark.catalog.listFunctions.show(300, false)

%%sql

DESCRIBE FUNCTION substr

spark.sql("DESCRIBE FUNCTION substr").show(false)

5.2. Validating Functions¶

Let us see how we can validate Spark SQL functions.

Spark SQL follows MySQL style. To validate functions we can just use SELECT clause - e. g.: SELECT current_date;
Another example - SELECT substr('Hello World', 1, 5);
If you want to use Oracle style, you can create table by name dual and insert one record.
You can also create temporary view on top of dataframe and start writing SQL Queries. We will see an example with Scala based approach. Here are the code snippets using both Scala as well as Pyspark.

Using Scala

val orders = spark.read.
    schema("order_id INT, order_date STRING, order_customer_id INT, order_status STRING").
    csv("/public/retail_db/orders")
orders.createOrReplaceTempView("orders_temp")

Using Python

orders = spark.read. \
    schema("order_id INT, order_date STRING, order_customer_id INT, order_status STRING"). \
    csv("/public/retail_db/orders")
orders.createOrReplaceTempView("orders_temp")

%%sql

SELECT current_date AS current_date

%%sql

SELECT substr('Hello World', 1, 5) AS result

%%sql

USE itversity_retail

%%sql

SELECT current_database()

%%sql

DROP TABLE IF EXISTS dual

%%sql

CREATE TABLE dual (dummy STRING)

%%sql

INSERT INTO dual VALUES ('X')

%%sql

SELECT current_date AS current_date FROM dual

%%sql

SELECT substr('Hello World', 1, 5) AS result FROM dual

Here is how you can validate functions using Data Frame.
- Create Data Frame
- Create temporary view using Data Frame
- Run queries using view with relevant functions

val orders = spark.read.
    schema("order_id INT, order_date STRING, order_customer_id INT, order_status STRING").
    csv("/public/retail_db/orders")

orders.createOrReplaceTempView("orders_temp")

%%sql

SELECT o.*, lower(order_status) AS order_status_lower FROM orders_temp AS o LIMIT 10

5.3. String Manipulation Functions¶

We use string manipulation functions quite extensively. Here are some of the important functions which we typically use.

Case Conversion - lower, upper, initcap
Getting size of the column value - length
Extracting Data - substr and split
Trimming and Padding functions - trim, rtrim, ltrim, rpad and lpad
Reversing strings - reverse
Concatenating multiple strings concat and concat_ws

5.3.1. Case Conversion and Length¶

Let us understand how to perform case conversion functions of a string and also length of a string.

Case Conversion Functions - lower, upper, initcap

%%sql

SELECT lower('hEllo wOrlD') AS lower_result,
    upper('hEllo wOrlD') AS upper_result,
    initcap('hEllo wOrlD') AS initcap_result

Getting length - length

%%sql

SELECT length('hEllo wOrlD') AS result

Let us see how to use these functions on top of the table. We will use orders table which was loaded as part of last section.

order_status for some of the orders is in lower case and we will convert every thing to upper case.

%%sql

USE itversity_retail

%%sql 

SHOW tables

%%sql

SELECT * FROM orders LIMIT 10

%%sql

SELECT order_id, order_date, order_customer_id,
    lower(order_status) AS order_status,
    length(order_status) AS order_status_length
FROM orders LIMIT 10

5.3.2. Extracting Data - substr and split¶

Let us understand how to extract data from strings using substr/substring and split.

We can get syntax and symantecs of the functions using DESCRIBE FUNCTION
We can extract first four characters from string using substr or substring.

%%sql

DESCRIBE FUNCTION substr

%%sql

DESCRIBE FUNCTION substring

spark.sql("DESCRIBE FUNCTION substring").show(false)

%%sql

SELECT substr('2013-07-25 00:00:00.0', 1, 4) AS result

%%sql

SELECT substr('2013-07-25 00:00:00.0', 6, 2) AS result

%%sql

SELECT substr('2013-07-25 00:00:00.0', 9, 2) AS result

%%sql

SELECT substr('2013-07-25 00:00:00.0', 12) AS result

Let us see how we can extract date part from order_date of orders.

%%sql

SELECT * FROM orders LIMIT 10

%%sql

SELECT order_id,
  substr(order_date, 1, 10) AS order_date,
  order_customer_id,
  order_status
FROM orders

Let us understand how to extract the information from the string where there is a delimiter.

split converts delimited string into array.

%%sql

SELECT split('2013-07-25', '-') AS result

%%sql

SELECT split('2013-07-25', '-')[1] AS result

We can use explode to convert an array into records.

%%sql

SELECT explode(split('2013-07-25', '-')) AS result

5.3.3. Trimming and Padding Functions¶

Let us understand how to trim or remove leading and/or trailing spaces in a string.

ltrim is used to remove the spaces on the left side of the string.
rtrim is used to remove the spaces on the right side of the string.
trim is used to remove the spaces on both sides of the string.

%%sql

SELECT ltrim('     Hello World') AS result

%%sql

SELECT rtrim('     Hello World       ') AS result

%%sql

SELECT length(trim('     Hello World       ')) AS result

Let us understand how to use padding to pad characters to a string.

Let us assume that there are 3 fields - year, month and date which are of type integer.
If we have to concatenate all the 3 fields and create a date, we might have to pad month and date with 0.
lpad is used more often than rpad especially when we try to build the date from separate columns.

%%sql

SELECT 2013 AS year, 7 AS month, 25 AS myDate

%%sql

SELECT lpad(7, 2, 0) AS result

%%sql

SELECT lpad(10, 2, 0) AS result

%%sql

SELECT lpad(100, 2, 0) AS result

5.3.4. Reverse and Concatenating multiple strings¶

Let us understand how to reverse a string as well as concatenate multiple strings.

We can use reverse to reverse a string.
We can concatenate multiple strings using concat and concat_ws.
concat_ws is typically used if we want to have the same string between all the strings that are being concatenated.

%%sql

SELECT reverse('Hello World') AS result

%%sql

SELECT concat('Hello ', 'World') AS result

%%sql

SELECT concat('Order Status is ', order_status) AS result
FROM orders LIMIT 10

spark.sql("""
    SELECT concat('Order Status is ', order_status) AS result
    FROM orders_part LIMIT 10
""").show(false)

%%sql

SELECT * FROM (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q

%%sql

SELECT concat(year, '-', lpad(month, 2, 0), '-',
              lpad(myDate, 2, 0)) AS order_date
FROM
    (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q

%%sql

SELECT concat_ws('-', year, lpad(month, 2, 0),
              lpad(myDate, 2, 0)) AS order_date
FROM
    (SELECT 2013 AS year, 7 AS month, 25 AS myDate) q

5.4. Date Manipulation Functions¶

Let us go through some of the important date manipulation functions.

Getting Current Date and Timestamp
Date Arithmetic such as date_add
Getting beginning date or time using trunc or date_trunc
Extracting information using date_format as well as calendar functions.
Dealing with unix timestamp using from_unixtime, to_unix_timestamp

5.4.1. Getting Current Date and Timestamp¶

Let us understand how to get the details about current or today’s date as well as current timestamp.

current_date is the function or operator which will return today’s date.
current_timestamp is the function or operator which will return current time up to milliseconds.
These are not like other functions and do not use () at the end.
These are not listed as part of SHOW functions and we can get help using DESCRIBE.
There is a format associated with date and timestamp.
- Date - yyyy-MM-dd
- Timestamp - yyyy-MM-dd HH:mm:ss.SSS
Keep in mind that a date or timestamp in Spark SQL are nothing but special strings containing values using above specified formats. We can apply all string manipulation functions on date or timestamp.

%%sql

SELECT current_date AS current_date

%%sql

SELECT current_date() AS current_date

%%sql

SELECT current_timestamp AS current_timestamp

spark.sql("SELECT current_timestamp AS current_timestamp").show(false)

5.4.2. Date Arithmetic¶

Let us understand how to perform arithmetic on dates or timestamps.

date_add can be used to add or subtract days.
date_sub can be used to subtract or add days.
datediff can be used to get difference between 2 dates
add_months can be used add months to a date

%%sql

SELECT date_add(current_date, 32) AS result

%%sql

SELECT date_add('2018-04-15', 730) AS result

%%sql

SELECT date_add('2018-04-15', -730) AS result

%%sql

SELECT date_sub(current_date, 30) AS result

%%sql

SELECT datediff('2019-03-30', '2017-12-31') AS result

%%sql

SELECT datediff('2017-12-31', '2019-03-30') AS result

%%sql

SELECT add_months(current_date, 3) AS result

%%sql

SELECT add_months('2019-01-31', 1) AS result

%%sql

SELECT add_months('2019-05-31', 1) AS result

%%sql

SELECT add_months(current_timestamp, 3) AS result

%%sql

SELECT date_add(current_timestamp, -730) AS result

5.4.3. Beginning Date or Time - trunc and date_trunc¶

Let us understand how to use trunc and date_trunc on dates or timestamps and get beginning date of the period.

We can use MM to get beginning date of the month.
YY can be used to get begining date of the year.
We can apply trunc either on date or timestamp, however we cannot apply it other than month or year (such an hour or day).

%%sql

DESCRIBE FUNCTION trunc

spark.sql("DESCRIBE FUNCTION trunc").show(false)

%%sql

SELECT trunc(current_date, 'MM') AS beginning_date_month

%%sql

SELECT trunc('2019-01-23', 'MM') AS beginning_date_month

%%sql

SELECT trunc(current_date, 'YY') AS beginning_date_year 

This will not work

%%sql

SELECT trunc(current_timestamp, 'HH') AS doesnt_work

While trunc can be used to get beginning time of a given month or year, we can get the beginning time up to Second using date_trunc.

spark.sql("DESCRIBE FUNCTION date_trunc").show(false)

%%sql

SELECT date_trunc('HOUR', current_timestamp) AS hour_beginning

5.4.4. Extracting information using date_format¶

Let us understand how to use date_format to extract information from date or timestamp.

Here is how we can get date related information such as year, month, day etc from date or timestamp.

spark.sql("DESCRIBE FUNCTION date_format").show(false)

%%sql

SELECT current_timestamp AS current_timestamp

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'yyyy') AS year

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'yy') AS year

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'MM') AS month

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'dd') AS day_of_month

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'DD') AS day_of_year

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'MMM') AS month_name

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'MMMM') AS month_name

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'EE') AS dayname

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'EEEE') AS dayname

Here is how we can get time related information such as hour, minute, seconds, milliseconds etc from timestamp.

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'HH') AS hour24

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'hh') AS hour12

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'mm') AS minutes

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'ss') AS seconds

%%sql

SELECT current_timestamp AS current_timestamp, 
    date_format(current_timestamp, 'SS') AS millis

Here is how we can get the information from date or timestamp in the format we require.

%%sql

SELECT date_format(current_timestamp, 'yyyyMM') AS current_month

%%sql

SELECT date_format(current_timestamp, 'yyyyMMdd') AS current_date

%%sql

SELECT date_format(current_timestamp, 'yyyy/MM/dd') AS current_date

5.4.5. Extracting information - Calendar functions¶

We can get year, month, day etc from date or timestamp using functions. There are functions such as day, dayofmonth, month, weekofyear, year etc available for us.

spark.sql("DESCRIBE FUNCTION day").show(false)

spark.sql("DESCRIBE FUNCTION dayofmonth").show(false)

spark.sql("DESCRIBE FUNCTION month").show(false)

spark.sql("DESCRIBE FUNCTION weekofyear").show(false)

spark.sql("DESCRIBE FUNCTION year").show(false)

Let us see the usage of the functions such as day, dayofmonth, month, weekofyear, year etc.

%%sql

SELECT year(current_date) AS year

%%sql

SELECT month(current_date) AS month

%%sql

SELECT weekofyear(current_date) AS weekofyear

%%sql

SELECT day(current_date) AS day

%%sql

SELECT dayofmonth(current_date) AS dayofmonth

5.4.6. Dealing with Unix Timestamp¶

Let us go through the functions that can be used to deal with Unix Timestamp.

from_unixtime can be used to convert Unix epoch to regular timestamp.
unix_timestamp or to_unix_timestamp can be used to convert timestamp to Unix epoch.
We can get Unix epoch or Unix timestamp by running date '+%s' in Unix/Linux terminal
We can DESCRIBE on the above functions to get details about them.

Let us sww how we can use functions such as from_unixtime, unix_timestamp or to_unix_timestamp to convert between timestamp and Unix timestamp or epoch.

We can unix epoch in Unix/Linux terminal using date '+%s'

%%sql

SELECT from_unixtime(1556662731) AS timestamp

%%sql

SELECT to_unix_timestamp('2019-04-30 18:18:51') AS unixtime

%%sql

SELECT from_unixtime(1556662731, 'yyyyMM') AS month

%%sql

SELECT from_unixtime(1556662731, 'yyyy-MM-dd') AS date

%%sql

SELECT from_unixtime(1556662731, 'yyyy-MM-dd HH:mm') AS timestamp

%%sql

SELECT from_unixtime(1556662731, 'yyyy-MM-dd hh:mm') AS timestamp

%%sql

SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd') AS date

%%sql

SELECT to_unix_timestamp('20190430 18:18:51', 'yyyyMMdd HH:mm:ss') AS timestamp

5.5. Overview of Numeric Functions¶

Here are some of the numeric functions we might use quite often.

abs - always return positive number
sum, avg
round - rounds off to specified precision
ceil, floor - always return integer.
greatest
min, max
rand
pow, sqrt
cumedist, stddev, variance

Some of the functions highlighted are aggregate functions, eg: sum, avg, min, max etc.

%%sql

SELECT abs(-10.5), abs(10)

%%sql

USE itversity_retail

%%sql

SHOW tables

%%sql

SELECT order_item_order_id, order_item_subtotal FROM order_items
WHERE order_item_order_id = 2

%%sql

SELECT avg(order_item_subtotal) AS order_revenue_avg FROM order_items
WHERE order_item_order_id = 2

%%sql

SELECT order_item_order_id, 
    avg(order_item_subtotal) AS order_revenue_avg 
FROM order_items
GROUP BY order_item_order_id
LIMIT 10

%%sql

SELECT order_item_order_id, 
    sum(order_item_subtotal) AS order_revenue_sum
FROM order_items
GROUP BY order_item_order_id
LIMIT 10

%%sql

SELECT
    round(10.58) rnd,
    floor(10.58) flr,
    ceil(10.58) cl

%%sql

SELECT
    round(10.44) rnd1,
    round(10.44, 1) rnd1,
    round(10.46, 1) rnd2,
    floor(10.44) flr,
    ceil(10.44) cl

%%sql

SELECT avg(order_item_subtotal) AS order_revenue_avg FROM order_items
WHERE order_item_order_id = 2

%%sql

SELECT round(avg(order_item_subtotal), 2) AS order_revenue_avg 
FROM order_items
WHERE order_item_order_id = 2

%%sql

SELECT order_item_order_id, 
    round(avg(order_item_subtotal), 2) AS order_revenue_avg 
FROM order_items
GROUP BY order_item_order_id
LIMIT 10

%%sql

SELECT order_item_order_id, 
    round(sum(order_item_subtotal), 2) AS order_revenue_sum
FROM order_items
GROUP BY order_item_order_id
LIMIT 10

%%sql

SELECT greatest(10, 11, 13, -13)

%%sql

SELECT rand() AS rand

%%sql

SELECT cast(round(rand() * 1) AS int) AS random_int

%%sql

SELECT order_item_order_id, 
    round(sum(order_item_subtotal), 2) AS order_revenue_sum,
    min(order_item_subtotal) AS order_item_subtotal_min,
    max(order_item_subtotal) AS order_item_subtotal_max 
FROM order_items
GROUP BY order_item_order_id
LIMIT 10

%%sql

SELECT order_item_order_id, order_item_subtotal
FROM order_items
WHERE order_item_order_id = 2

%%sql

SELECT round(sum(order_item_subtotal), 2) AS order_revenue_sum,
    min(order_item_subtotal) AS order_item_subtotal_min,
    max(order_item_subtotal) AS order_item_subtotal_max 
FROM order_items
WHERE order_item_order_id = 2

5.6. Data Type Conversion¶

Let us understand how we can type cast to change the data type of extracted value to its original type.

%%sql

SELECT current_date AS current_date

%%sql

SELECT split(current_date, '-')[1] AS month

%%sql

SELECT cast(split(current_date, '-')[1] AS INT) AS month

%%sql

SELECT cast('0.04' AS FLOAT) AS result

%%sql

SELECT cast('0.04' AS INT) AS zero

%%sql

SELECT cast('xyz' AS INT) AS returns_null

%%sql

CREATE EXTERNAL TABLE IF NOT EXISTS orders_single_column (
    s STRING
) LOCATION '/user/itversity/warehouse/itversity_retail.db/orders'

%%sql

SELECT * FROM orders_single_column LIMIT 10

%%sql

SELECT split(s, ',')[0] AS order_id,
    split(s, ',')[1] AS order_date,
    split(s, ',')[2] AS order_customer_id,
    split(s, ',')[3] AS order_status
FROM orders_single_column LIMIT 10

%%sql

SELECT cast(split(s, ',')[0] AS INT) AS order_id,
    cast(split(s, ',')[1] AS TIMESTAMP) AS order_date,
    cast(split(s, ',')[2] AS INT) AS order_customer_id,
    cast(split(s, ',')[3] AS STRING) AS order_status
FROM orders_single_column LIMIT 10

5.7. Handling NULL Values¶

Let us understand how to handle nulls using specific functions in Spark SQL.

By default if we try to add or concatenate null to another column or expression or literal, it will return null.
If we want to replace null with some default value, we can use nvl. For not null values, nvl returns the original expression value.
- Replace commission_pct with 0 if it is null.
- We can also use coalesce in the place of nvl.
coalesce returns first not null value if we pass multiple arguments to it.
nvl2 can be used to perform one action when the value is not null and some other action when the value is null.
- We want to increase commission_pct by 1 if it is not null and set commission_pct to 2 if it is null.
We can also use CASE WHEN ELSE END for any conditional logic.

%%sql

SELECT 1 + NULL AS result

%%sql

SELECT concat('Hello', NULL) AS result

%%sql

SELECT nvl(1, 0) nvl, coalesce(1, 0) AS coalesce

%%sql

SELECT nvl(NULL, 0) nvl , coalesce(1, 0) AS coalesce

%%sql

SELECT coalesce(NULL, NULL, 2, NULL, 3) AS result

%%sql

SELECT nvl(NULL, NULL, 2, NULL, 3) AS result

%%sql

USE itversity_retail

%%sql

DROP TABLE IF EXISTS sales

%%sql

CREATE TABLE IF NOT EXISTS sales(
    sales_person_id INT,
    sales_amount FLOAT,
    commission_pct INT
)

%%sql

INSERT INTO sales VALUES
    (1, 1000, 10),
    (2, 1500, 8),
    (3, 500, NULL),
    (4, 800, 5),
    (5, 250, NULL)

%%sql

SELECT * FROM sales

%%sql

SELECT s.*, 
    nvl(commission_pct, 0) AS commission_pct
FROM sales AS s

%%sql

SELECT s.*, 
    coalesce(commission_pct, 0) AS commission_pct
FROM sales AS s

%%sql

SELECT s.*, 
    round(sales_amount * commission_pct / 100, 2) AS incorrect_commission_amount
FROM sales AS s

%%sql

SELECT s.*, 
    round(sales_amount * nvl(commission_pct, 0) / 100, 2) AS commission_amount
FROM sales AS s

%%sql

SELECT s.*, 
    round(sales_amount * coalesce(commission_pct, 0) / 100, 2) AS commission_amount
FROM sales AS s

%%sql

SELECT s.*, 
    nvl2(commission_pct, commission_pct + 1, 2) AS commission_pct
FROM sales AS s

%%sql

SELECT s.*, 
    round(sales_amount * nvl2(commission_pct, commission_pct + 1, 2) / 100, 2) AS commission_amount
FROM sales AS s

%%sql

SELECT s.*, 
    CASE WHEN commission_pct IS NULL 
        THEN 2
        ELSE commission_pct + 1
    END AS commission_pct
FROM sales AS s

%%sql

SELECT s.*, 
    CASE WHEN commission_pct IS NOT NULL 
        THEN commission_pct + 1
        ELSE 2
    END AS commission_pct
FROM sales AS s

%%sql

SELECT s.*, 
    CASE WHEN commission_pct IS NULL 
        THEN round((sales_amount * 2 / 100), 2)
        ELSE round((sales_amount * (commission_pct + 1)/ 100), 2)
    END AS commission_amount
FROM sales AS s

5.8. Using CASE and WHEN¶

At times we might have to select values from multiple columns conditionally.

We can use CASE and WHEN for that.
Let us implement this conditional logic to come up with derived order_status.
- If order_status is COMPLETE or CLOSED, set COMPLETED
- If order_status have PENDING in it, then we will say PENDING
- If order_status have PROCESSING or PAYMENT_REVIEW in it, then we will say PENDING
- We will set all others as OTHER
We can also have ELSE as part of CASE and WHEN.

%%sql

USE itversity_retail

%%sql

SHOW tables

%%sql

SELECT DISTINCT order_status FROM orders LIMIT 10

%%sql

SELECT o.*,
    CASE WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
    END AS updated_order_status
FROM orders o
LIMIT 10

%%sql

SELECT o.*,
    CASE WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
    ELSE order_status
    END AS updated_order_status
FROM orders o
LIMIT 10

%%sql

SELECT o.*,
    CASE 
        WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
        WHEN order_status LIKE '%PENDING%' THEN 'PENDING'
        ELSE 'OTHER'
    END AS updated_order_status
FROM orders o
LIMIT 10

%%sql

SELECT o.*,
    CASE 
        WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
        WHEN order_status LIKE '%PENDING%' OR order_status IN ('PROCESSING', 'PAYMENT_REVIEW')
            THEN 'PENDING'
        ELSE 'OTHER'
    END AS updated_order_status
FROM orders o
LIMIT 10

%%sql

SELECT DISTINCT order_status,
    CASE 
        WHEN order_status IN ('COMPLETE', 'CLOSED') THEN 'COMPLETED'
        WHEN order_status LIKE '%PENDING%' OR order_status IN ('PROCESSING', 'PAYMENT_REVIEW')
            THEN 'PENDING'
        ELSE 'OTHER'
    END AS updated_order_status
FROM orders
ORDER BY updated_order_status

5.9. Query Example - Word Count¶

Let us see how we can perform word count using Spark SQL. Using word count as an example we will understand how we can come up with the solution using pre-defined functions available.

Create table by name lines.
Insert data into the table.
Split lines into array of words.
Explode array of words from each line into individual records.
Use group by and get the count. We cannot use GROUP BY directly on exploded records and hence we need to use nested sub query.

%%sql

DROP DATABASE IF EXISTS itversity_demo CASCADE

%%sql

CREATE DATABASE IF NOT EXISTS itversity_demo

%%sql

USE itversity_demo

%%sql

CREATE TABLE lines (s STRING)

%%sql

INSERT INTO lines VALUES
  ('Hello World'),
  ('How are you'),
  ('Let us perform the word count'),
  ('The definition of word count is'),
  ('to get the count of each word from this data')

%%sql

SELECT * FROM lines

%%sql

SELECT split(s, ' ') AS word_array FROM lines

spark.sql("SHOW functions").show(300, false)

%%sql

SELECT explode(split(s, ' ')) AS words FROM lines

%%sql

SELECT count(1) FROM (SELECT explode(split(s, ' ')) AS words FROM lines)

%%sql

SELECT explode(split(s, ' ')) AS words, count(1) FROM lines
GROUP BY explode(split(s, ' '))

%%sql

SELECT word, count(1) FROM (
  SELECT explode(split(s, ' ')) AS word FROM lines
) q
GROUP BY word

%%sql

SELECT count(1) FROM
(
    SELECT word, count(1) FROM (
        SELECT explode(split(s, ' ')) AS word FROM lines
    ) q
    GROUP BY word
)

My Jupyter Book