2. Quick Recap of Python

Let us quickly recap of some of the core programming concepts of Python before we get into Spark.

2.1. Data Engineering Life Cycle

Let us first understand the Data Engineering Life Cycle. We typically read the data, process it by applying business rules and write the data back to different targets

  • Read the data from different sources.

    • Files

    • Databases

    • Mainframes

    • APIs

  • Processing the data

    • Row Level Transformations

    • Aggregations

    • Sorting

    • Ranking

    • Joining multiple data sets

  • Write data to different targets.

    • Files

    • Databases

    • Mainframes

    • APIs

2.2. Python CLI or Jupyter Notebook

We can use Python CLI or Jupyter Notebook to explore APIs.

  • We can launch Python CLI using python command.

  • We can launch the Jupyter Notebook using the jupyter notebook command.

  • A web service will be started on port number 8888 by default.

  • We can go to the browser and connect to the web server using IP address and port number.

  • We should be able to explore code in interactive fashion.

  • We can issue magic commands such as %%sh to run shell commands, %%md to document using markdown etc.

2.2.1. Tasks

Let us perform these tasks to just recollect how to use Python CLI or Jupyter Notebook.

  • Create variables i and j assigning 10 and 20.5 respectively.

i = 10
j = 20.5
  • Add the values and assign it to res.

res = i + j
print(str(res))
  • Get the type of i, j and res.

type(i)
type(j)
type(res)
  • Get the help on int.

help(int)
  • Get the help on startswith that is available on str.

help(str.startswith)

2.3. Basic Programming Constructs

Let us recollect some of the basic programming constructs of Python.

  • Comparison Operations (==, !=, <, >, <=, >=, etc)

    • All the comparison operators return a True or False (Boolean value)

  • Conditionals (if)

    • We typically use comparison operators as part of conditionals.

  • Loops (for)

    • We can iterate through collection using for i in l where l is a standard collection such as list or set.

    • Python provides special function called as range which will return a collection of integers between the given range. It excludes the upper bound value.

  • In Python, scope is defined by indentation.

2.3.1. Tasks

Let us perform few tasks to quickly recap basic programming constructs of Python.

  • Get all the odd numbers between 1 and 15.

list(range(1, 16, 2))
  • Print all those numbers which are divisible by 3 from the above list.

for i in list(range(1, 16, 2)):
    if(i%3 == 0): print(i)

2.4. Developing Functions

Let us understand how to develop functions using Python as programming language.

  • Function starts with def followed by function name.

  • Parameters can be of different types.

    • Required

    • Keyword

    • Variable Number

    • Functions

  • Functions which take another function as an argument is called higher order functions.

2.4.1. Tasks

Let us perform few tasks to understand how to develop functions in Python.

  • Sum of integers between lower bound and upper bound using formula.

def sumOfN(n):
    return int((n * (n + 1)) / 2)

sumOfN(10)
def sumOfIntegers(lb, ub):
    return sumOfN(ub) - sumOfN(lb -1)

sumOfIntegers(5, 10)
  • Sum of integers between lower bound and upper bound using loops.

def sumOfIntegers(lb, ub):
    total = 0
    for e in range(lb, ub + 1):
        total += e
    return total

sumOfIntegers(1, 10)
  • Sum of squares of integers between lower bound and upper bound using loops.

def sumOfSquares(lb, ub):
    total = 0
    for e in range(lb, ub + 1):
        total += e * e
    return total

sumOfSquares(2, 4)
  • Sum of the even numbers between lower bound and upper bound using loops.

def sumOfEvens(lb, ub):
    total = 0
    for e in range(lb, ub + 1):
        total += e if e%2==0 else 0
    return total

sumOfEvens(2, 4)

2.5. Lambda Functions

Let us recap details related to lambda functions.

  • We can develop functions with out names. They are called Lambda Functions and also known as Anonymous Functions.

  • We typically use them to pass as arguments to higher order functions which takes functions as arguments

2.5.1. Tasks

Let us perform few tasks related to lambda functions.

  • Create a generic function mySum which is supposed to perform arithmetic using integers within a range.

    • It takes 3 arguments - lb, ub and f.

    • Function f should be invoked inside the function on each element within the range.

def mySum(lb, ub, f):
    total = 0
    for e in range(lb, ub + 1):
        total += f(e)
    return total
  • Sum of integers between lower bound and upper bound using mySum.

mySum(2, 4, lambda i: i)
  • Sum of squares of integers between lower bound and upper bound using mySum.

mySum(2, 4, lambda i: i * i)
  • Sum of the even numbers between lower bound and upper bound using mySum.

mySum(2, 4, lambda i: i if i%2 == 0 else 0)

2.6. Overview of Collections and Tuples

Let’s quickly recap about Collections and Tuples in Python. We will primarily talk about collections and tuples that comes as part of Python standard library such as list, set, dict and tuple.

  • Group of elements with length and index - list

  • Group of unique elements - set

  • Group of key value pairs - dict

  • While list, set and dict contain group of homogeneous elements, tuple contains group of heterogeneous elements.

  • We can consider list, set and dict as a table in a database and tuple as a row or record in a given table.

  • Typically we create list of tuples or set of tuples and dict is nothing but collection of tuples with 2 elements and key is unique.

  • We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as len, sum, min, max etc for aggregating data in collections.

2.6.1. Tasks

Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.

  • Create a collection of orders by reading data from a file.

%%sh
ls -ltr /data/retail_db/orders/part-00000
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
    read(). \
    splitlines()
  • Get all unique order statuses. Make sure data is sorted in alphabetical order.

sorted(set(map(lambda o: o.split(",")[3], orders)))
  • Get count of all unique dates.

len(list(map(lambda o: o.split(",")[1], orders)))
  • Sort the data in orders in ascending order by order_customer_id and then order_date.

sorted(orders, key=lambda k: (int(k.split(",")[2]), k.split(",")[1]))
  • Create a collection of order_items by reading data from a file.

order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
    read(). \
    splitlines()
  • Get revenue for a given order_item_order_id.

def get_order_revenue(order_items, order_id):
    order_items_filtered = filter(lambda oi: 
                                  int(oi.split(",")[1]) == 2, 
                                  order_items
                                 )
    order_items_map = map(lambda oi: 
                          float(oi.split(",")[4]), 
                          order_items_filtered
                         )
    return round(sum(order_items_map), 2)

get_order_revenue(order_items, 2)

2.7. Overview of Pandas Data Frames

While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. This process is tedious at times.

  • With Data Frames we can define the structure.

  • Data Frame is nothing but group of rows where each row have multiple attributes with names.

  • Data Frame is similar to a Database Table or Spreadsheet with Header.

  • Pandas provide rich and simple functions to convert data in files into Data Frames and process them

  • Data can be read from files into Data Frame using functions such as read_csv.

  • We can perform all standard operations on Data Frames.

    • Projection or Selection

    • Filtering

    • Aggregations

    • Joins

    • Sorting

2.7.1. Tasks

Let us perform few tasks to recap the usage of Pandas Data Frames.

  • Read order items data from the location on your system. In mine it is /data/retail_db/order_items/part-00000. Use the information below to define schema.

  • It has 6 fields with the below names in the same order as specified below.

  • order_item_id

  • order_item_order_id

  • order_item_product_id

  • order_item_quantity

  • order_item_subtotal

  • order_item_product_price

import pandas as pd
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = pd. \
    read_csv(order_items_path,
             names=["order_item_id", "order_item_order_id",
                    "order_item_product_id", "order_item_quantity",
                    "order_item_subtotal", "order_item_product_price"
                   ]
            )
  • Project order_item_order_id and order_item_subtotal

order_items[["order_item_id", "order_item_subtotal"]]
  • Filter for order_item_order_id 2

order_items.query("order_item_order_id == 2")
  • Compute revenue for order_item_order_id 2

order_items. \
    query("order_item_order_id == 2")["order_item_subtotal"]. \
    sum()
  • Get number of items and revenue for each order id. Give alias to the order revenue as revenue.

order_items. \
    groupby("order_item_order_id")["order_item_subtotal"]. \
    sum()
order_items. \
    groupby("order_item_order_id")["order_item_subtotal"]. \
    agg(['sum', 'count']). \
    rename(columns={'sum': 'revenue'})

2.8. Limitations of Pandas

We can use Pandas for data processing. It provides rich APIs to read data from different sources, process the data and then write it to different targets.

  • Pandas works well for light weight data processing.

  • Pandas is typically single threaded, which means only one process take care of processing the data.

  • As data volume grows, the processing time might grow exponentially and also run into resource contention.

  • It is not trivial to use distributed processing using Pandas APIs. We will end up struggling with multi threading rather than business logic.

  • There are Distributed Computing Frameworks such as Hadoop Map Reduce, Spark etc to take care of data processing at scale on multi node Hadoop or Spark Clusters.

  • Both Hadoop Map Reduce and Spark comes with Distributed Computing Frameworks as well as APIs.

Pandas is typically used for light weight Data Processing and Spark is used for Data Processing at Scale.

2.9. Development Life Cycle

Let us understand the development life cycle. We typically use IDEs such as PyCharm to develop Python based applications.

  • Create Project - retail

  • Choose the interpreter 3.x

  • Make sure plugins such as pandas are installed.

  • Create config.py script for externalizing run time parameters such as input path, output path etc.

  • Create app folder for the source code.

2.9.1. Tasks

Let us develop a simple application to understand end to end development life cycle.

  • Read the data from order_items

  • Get revenue for each order id

  • Save the output which contain order id and revenue to a file.

Click here for the complete code for the above tasks.

2.10. Exercises

Let us perform few exercises to understand how to process the data. We will use LinkedIn data to perform some basic data processing using Python.

  • Get LinkedIn archive.

    • Go to https://linkedin.com

    • Me on top -> Settings & Privacy

    • Then go to “How LinkedIn users your data” -> Getting a copy of your data

    • Register and download. You will get a link as part of the email.

  • Data contain multiple CSV files. We will limit the analysis to Contacts.csv and Connections.csv.

  • Get the number of contacts with out email ids.

  • Get the number of contacts from each source.

  • Get the number of connections with each title.

  • Get the number of connections from each company.

  • Get the number of contacts for each month in the year 2018.

  • Use Postgres or MySQL as databases (you can setup in your laptop) and write connections data to the database