1. Getting Started¶
1.1. Platforms to Practice¶
Let us understand different platforms we can leverage to practice Apache Spark using Python.
Local Setup
Databricks Platform
Setting up your own cluster
Cloud based labs
1.2. Setup Spark Locally - Ubuntu¶
Let us setup Spark Locally on Ubuntu.
Install latest version of Anaconda
Make sure Jupyter Notebook is setup and validated.
Setup Spark and Validate.
Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
Launch Jupyter Notebook using pyspark command.
Setup PyCharm (IDE) for application development.
1.3. Setup Spark Locally - Mac¶
1.3.1. Let us setup Spark Locally on Ubuntu.¶
Install latest version of Anaconda
Make sure Jupyter Notebook is setup and validated.
Setup Spark and Validate.
Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
Launch Jupyter Notebook using pyspark command.
Setup PyCharm (IDE) for application development.
1.4. Signing up for ITVersity Labs¶
1.5. Using ITVersity Labs¶
Let us understand how to submit the Spark Jobs in ITVersity Labs.
As we are using Python we can also use the help command to get the documentation - for example help(spark.read.csv)
1.6. Interacting with File Systems¶
Let us understand how to interact with file system using %fs command from Databricks Notebook.
We can access datasets using %fs magic command in Databricks notebook
By default, we will see files under dbfs
We can list the files using ls command - e. g.: (%fs ls)
Databricks provides lot of datasets for free under databricks-datasets
If the cluster is integrated with AWS or Azure Blob we can access files by specifying the appropriate protocol (e.g.: s3:// for s3)
List of commands available under %fs
Copying files or directories -cp
Moving files or directories - mv
Creating directories - mkdirs
Deleting files and directories - rm
We can copy or delete directories recursively using -r or –recursive
1.7. Getting File Metadata¶
Let us review the source location to get number of files and the size of the data we are going to process.
Location of airlines data dbfs:/databricks-datasets/airlines
We can get first 1000 files using %fs ls dbfs:/databricks-datasets/airlines
Location contain 1919 Files, however we will not be able to see all the details using %fs command.
Databricks File System commands does not have capability to understand metadata of files such as size in details.
When Spark Cluster is started, it will create 2 objects - spark and sc
sc is of type SparkContext and spark is of type SparkSession
Spark uses HDFS APIs to interact with the file system and we can access HDFS APIs using sc._jsc and sc._jvm to get file metadata.
Here are the steps to get the file metadata.
Get Hadoop Configuration using sc._jsc.hadoopConfiguration() - let’s say conf
We can pass conf to sc._jvm.org.apache.hadoop.fs.FileSystem. get to get FileSystem object - let’s say fs
We can build path object by passing the path as string to sc._jvm.org.apache.hadoop.fs.Path
We can invoke listStatus on top of fs by passing path which will return an array of FileStatus objects - let’s say files.
Each FileStatus object have all the metadata of each file.
We can use len on files to get number of files.
We can use getLen on each FileStatus object to get the size of each file. Cumulative size of all files can be achieved using sum(map(lambda file: file.getLen(), files))
Let us first get list of files
%fs ls dbfs:/databricks-datasets/airlines
Here is the consolidated script to get number of files and cumulative size of all files in a given folder.
conf = sc._jsc.hadoopConfiguration()
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(conf)
path = sc._jvm.org.apache.hadoop.fs.Path("dbfs:/databricks-datasets/airlines")
files = fs.listStatus(path)
sum(map(lambda file: file.getLen(), files))/1024/1024/1024