# Getting Started

## Platforms to Practice

Let us understand different platforms we can leverage to practice Apache Spark using Python.

* Local Setup
* Databricks Platform
* Setting up your own cluster
* Cloud based labs


## Setup Spark Locally - Ubuntu

Let us setup Spark Locally on Ubuntu.

* Install latest version of Anaconda
* Make sure Jupyter Notebook is setup and validated.
* Setup Spark and Validate.
* Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
* Launch Jupyter Notebook using <mark> pyspark </mark> command.
* Setup PyCharm (IDE) for application development.


## Setup Spark Locally - Mac

### Let us setup Spark Locally on Ubuntu.

* Install latest version of Anaconda
* Make sure Jupyter Notebook is setup and validated.
* Setup Spark and Validate.
* Setup Environment Variables to integrate Pyspark with Jupyter Notebook.
* Launch Jupyter Notebook using <mark> pyspark </mark> command.
* Setup PyCharm (IDE) for application development.


## Signing up for ITVersity Labs

* 


## Using ITVersity Labs

Let us understand how to submit the Spark Jobs in ITVersity Labs.

* As we are using Python we can also use the help command to get the documentation - for example <mark> help(spark.read.csv)</mark>


## Interacting with File Systems

Let us understand how to interact with file system using %fs command from Databricks Notebook.

* We can access datasets using %fs magic command in Databricks notebook
* By default, we will see files under dbfs
* We can list the files using ls command - e. g.: <mark> (%fs ls)</mark>
* Databricks provides lot of datasets for free under databricks-datasets
* If the cluster is integrated with AWS or Azure Blob we can access files by specifying the appropriate protocol (e.g.: s3:// for s3)
* List of commands available under %fs

 * Copying files or directories <mark>-cp</mark>
 * Moving files or directories <mark>- mv </mark>
 * Creating directories <mark> - mkdirs </mark> 
 * Deleting files and directories <mark> - rm </mark>
 * We can copy or delete directories recursively using <mark> -r</mark> or <mark>--recursive</mark>

## Getting File Metadata

Let us review the source location to get number of files and the size of the data we are going to process.

* Location of airlines data dbfs:/databricks-datasets/airlines
* We can get first 1000 files using %fs ls dbfs:/databricks-datasets/airlines
* Location contain 1919 Files, however we will not be able to see all the details using %fs command.
* Databricks File System commands does not have capability to understand metadata of files such as size in details.
* When Spark Cluster is started, it will create 2 objects - spark and sc
* sc is of type SparkContext and spark is of type SparkSession
* Spark uses HDFS APIs to interact with the file system and we can access HDFS APIs using sc._jsc and sc._jvm to get file metadata.


* Here are the steps to get the file metadata.
 
 * Get Hadoop Configuration using <mark> sc._jsc.hadoopConfiguration()</mark> - let's say <mark>conf</mark>
  * We can pass conf to <mark> sc._jvm.org.apache.hadoop.fs.FileSystem.</mark> get to get FileSystem object - let's say <mark>fs</mark>
  * We can build <mark> path</mark>  object by passing the path as string to <mark>sc._jvm.org.apache.hadoop.fs.Path</mark>
  * We can invoke <mark>listStatus</mark> on top of fs by passing path which will return an array of FileStatus objects - let's say files.  
  * Each <mark>FileStatus</mark> object have all the metadata of each file.
  * We can use <mark>len</mark> on files to get number of files.
  * We can use <mark>getLen</mark> on each <mark>FileStatus</mark> object to get the size of each file. 
  Cumulative size of all files can be achieved using <mark>sum(map(lambda file: file.getLen(), files))</mark>
  
  Let us first get list of files 
  
  
    


  


In [None]:
%fs ls dbfs:/databricks-datasets/airlines

Here is the consolidated script to get number of files and cumulative size of all files in a given folder.

In [None]:
conf = sc._jsc.hadoopConfiguration()
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(conf)
path = sc._jvm.org.apache.hadoop.fs.Path("dbfs:/databricks-datasets/airlines")

files = fs.listStatus(path)
sum(map(lambda file: file.getLen(), files))/1024/1024/1024