2. Quick Recap of Python¶
Let us quickly recap of some of the core programming concepts of Python before we get into Spark.
2.1. Data Engineering Life Cycle¶
Let us first understand the Data Engineering Life Cycle. We typically read the data, process it by applying business rules and write the data back to different targets
Read the data from different sources.
Files
Databases
Mainframes
APIs
Processing the data
Row Level Transformations
Aggregations
Sorting
Ranking
Joining multiple data sets
Write data to different targets.
Files
Databases
Mainframes
APIs
2.2. Python CLI or Jupyter Notebook¶
We can use Python CLI or Jupyter Notebook to explore APIs.
We can launch Python CLI using
python
command.We can launch the Jupyter Notebook using the
jupyter notebook
command.A web service will be started on port number 8888 by default.
We can go to the browser and connect to the web server using IP address and port number.
We should be able to explore code in interactive fashion.
We can issue magic commands such as %%sh to run shell commands, %%md to document using markdown etc.
2.2.1. Tasks¶
Let us perform these tasks to just recollect how to use Python CLI or Jupyter Notebook.
Create variables i and j assigning 10 and 20.5 respectively.
i = 10
j = 20.5
Add the values and assign it to res.
res = i + j
print(str(res))
Get the type of i, j and res.
type(i)
type(j)
type(res)
Get the help on int.
help(int)
Get the help on startswith that is available on str.
help(str.startswith)
2.3. Basic Programming Constructs¶
Let us recollect some of the basic programming constructs of Python.
Comparison Operations (==, !=, <, >, <=, >=, etc)
All the comparison operators return a True or False (Boolean value)
Conditionals (if)
We typically use comparison operators as part of conditionals.
Loops (for)
We can iterate through collection using for i in l where l is a standard collection such as list or set.
Python provides special function called as range which will return a collection of integers between the given range. It excludes the upper bound value.
In Python, scope is defined by indentation.
2.3.1. Tasks¶
Let us perform few tasks to quickly recap basic programming constructs of Python.
Get all the odd numbers between 1 and 15.
list(range(1, 16, 2))
Print all those numbers which are divisible by 3 from the above list.
for i in list(range(1, 16, 2)):
if(i%3 == 0): print(i)
2.4. Developing Functions¶
Let us understand how to develop functions using Python as programming language.
Function starts with def followed by function name.
Parameters can be of different types.
Required
Keyword
Variable Number
Functions
Functions which take another function as an argument is called higher order functions.
2.4.1. Tasks¶
Let us perform few tasks to understand how to develop functions in Python.
Sum of integers between lower bound and upper bound using formula.
def sumOfN(n):
return int((n * (n + 1)) / 2)
sumOfN(10)
def sumOfIntegers(lb, ub):
return sumOfN(ub) - sumOfN(lb -1)
sumOfIntegers(5, 10)
Sum of integers between lower bound and upper bound using loops.
def sumOfIntegers(lb, ub):
total = 0
for e in range(lb, ub + 1):
total += e
return total
sumOfIntegers(1, 10)
Sum of squares of integers between lower bound and upper bound using loops.
def sumOfSquares(lb, ub):
total = 0
for e in range(lb, ub + 1):
total += e * e
return total
sumOfSquares(2, 4)
Sum of the even numbers between lower bound and upper bound using loops.
def sumOfEvens(lb, ub):
total = 0
for e in range(lb, ub + 1):
total += e if e%2==0 else 0
return total
sumOfEvens(2, 4)
2.5. Lambda Functions¶
Let us recap details related to lambda functions.
We can develop functions with out names. They are called Lambda Functions and also known as Anonymous Functions.
We typically use them to pass as arguments to higher order functions which takes functions as arguments
2.5.1. Tasks¶
Let us perform few tasks related to lambda functions.
Create a generic function mySum which is supposed to perform arithmetic using integers within a range.
It takes 3 arguments - lb, ub and f.
Function f should be invoked inside the function on each element within the range.
def mySum(lb, ub, f):
total = 0
for e in range(lb, ub + 1):
total += f(e)
return total
Sum of integers between lower bound and upper bound using mySum.
mySum(2, 4, lambda i: i)
Sum of squares of integers between lower bound and upper bound using mySum.
mySum(2, 4, lambda i: i * i)
Sum of the even numbers between lower bound and upper bound using mySum.
mySum(2, 4, lambda i: i if i%2 == 0 else 0)
2.6. Overview of Collections and Tuples¶
Let’s quickly recap about Collections and Tuples in Python. We will primarily talk about collections and tuples that comes as part of Python standard library such as list, set, dict and tuple.
Group of elements with length and index - list
Group of unique elements - set
Group of key value pairs - dict
While list, set and dict contain group of homogeneous elements, tuple contains group of heterogeneous elements.
We can consider list, set and dict as a table in a database and tuple as a row or record in a given table.
Typically we create list of tuples or set of tuples and dict is nothing but collection of tuples with 2 elements and key is unique.
We typically use Map Reduce APIs to process the data in collections. There are also some pre-defined functions such as len, sum, min, max etc for aggregating data in collections.
2.6.1. Tasks¶
Let us perform few tasks to quickly recap details about Collections and Tuples in Python. We will also quickly recap about Map Reduce APIs.
Create a collection of orders by reading data from a file.
%%sh
ls -ltr /data/retail_db/orders/part-00000
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
read(). \
splitlines()
Get all unique order statuses. Make sure data is sorted in alphabetical order.
sorted(set(map(lambda o: o.split(",")[3], orders)))
Get count of all unique dates.
len(list(map(lambda o: o.split(",")[1], orders)))
Sort the data in orders in ascending order by order_customer_id and then order_date.
sorted(orders, key=lambda k: (int(k.split(",")[2]), k.split(",")[1]))
Create a collection of order_items by reading data from a file.
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
read(). \
splitlines()
Get revenue for a given order_item_order_id.
def get_order_revenue(order_items, order_id):
order_items_filtered = filter(lambda oi:
int(oi.split(",")[1]) == 2,
order_items
)
order_items_map = map(lambda oi:
float(oi.split(",")[4]),
order_items_filtered
)
return round(sum(order_items_map), 2)
get_order_revenue(order_items, 2)
2.7. Overview of Pandas Data Frames¶
While collections are typically the group of objects or tuples or simple strings, we need to parse them to further process the data. This process is tedious at times.
With Data Frames we can define the structure.
Data Frame is nothing but group of rows where each row have multiple attributes with names.
Data Frame is similar to a Database Table or Spreadsheet with Header.
Pandas provide rich and simple functions to convert data in files into Data Frames and process them
Data can be read from files into Data Frame using functions such as read_csv.
We can perform all standard operations on Data Frames.
Projection or Selection
Filtering
Aggregations
Joins
Sorting
2.7.1. Tasks¶
Let us perform few tasks to recap the usage of Pandas Data Frames.
Read order items data from the location on your system. In mine it is /data/retail_db/order_items/part-00000. Use the information below to define schema.
It has 6 fields with the below names in the same order as specified below.
order_item_id
order_item_order_id
order_item_product_id
order_item_quantity
order_item_subtotal
order_item_product_price
import pandas as pd
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = pd. \
read_csv(order_items_path,
names=["order_item_id", "order_item_order_id",
"order_item_product_id", "order_item_quantity",
"order_item_subtotal", "order_item_product_price"
]
)
Project order_item_order_id and order_item_subtotal
order_items[["order_item_id", "order_item_subtotal"]]
Filter for order_item_order_id 2
order_items.query("order_item_order_id == 2")
Compute revenue for order_item_order_id 2
order_items. \
query("order_item_order_id == 2")["order_item_subtotal"]. \
sum()
Get number of items and revenue for each order id. Give alias to the order revenue as revenue.
order_items. \
groupby("order_item_order_id")["order_item_subtotal"]. \
sum()
order_items. \
groupby("order_item_order_id")["order_item_subtotal"]. \
agg(['sum', 'count']). \
rename(columns={'sum': 'revenue'})
2.8. Limitations of Pandas¶
We can use Pandas for data processing. It provides rich APIs to read data from different sources, process the data and then write it to different targets.
Pandas works well for light weight data processing.
Pandas is typically single threaded, which means only one process take care of processing the data.
As data volume grows, the processing time might grow exponentially and also run into resource contention.
It is not trivial to use distributed processing using Pandas APIs. We will end up struggling with multi threading rather than business logic.
There are Distributed Computing Frameworks such as Hadoop Map Reduce, Spark etc to take care of data processing at scale on multi node Hadoop or Spark Clusters.
Both Hadoop Map Reduce and Spark comes with Distributed Computing Frameworks as well as APIs.
Pandas is typically used for light weight Data Processing and Spark is used for Data Processing at Scale.
2.9. Development Life Cycle¶
Let us understand the development life cycle. We typically use IDEs such as PyCharm to develop Python based applications.
Create Project - retail
Choose the interpreter 3.x
Make sure plugins such as pandas are installed.
Create config.py script for externalizing run time parameters such as input path, output path etc.
Create app folder for the source code.
2.10. Exercises¶
Let us perform few exercises to understand how to process the data. We will use LinkedIn data to perform some basic data processing using Python.
Get LinkedIn archive.
Go to https://linkedin.com
Me on top -> Settings & Privacy
Then go to “How LinkedIn users your data” -> Getting a copy of your data
Register and download. You will get a link as part of the email.
Data contain multiple CSV files. We will limit the analysis to Contacts.csv and Connections.csv.
Get the number of contacts with out email ids.
Get the number of contacts from each source.
Get the number of connections with each title.
Get the number of connections from each company.
Get the number of contacts for each month in the year 2018.
Use Postgres or MySQL as databases (you can setup in your laptop) and write connections data to the database