Initializing Spark


>>> from pyspark import SparkContext
>>> sc = SparkContext(master =  'local[2]')

Inspect SparkContext

>>> sc.version #Retrieve SparkContext version
>>> sc.pythonVer #Retrieve Python version
>>> sc.master #Master URL to connect to
>>> str(sc.sparkHome) #Path where Spark is installed on worker nodes
>>> str(sc.sparkUser()) #Retrieve name of the Spark User running SparkContext
>>> sc.appName #Return application name
>>> sc.applicationId #Retrieve application ID
>>> sc.defaultParallelism #Return default level of parallelism
>>> sc.defaultMinPartitions  #Default minimum number of partitions forRDDs


>>> from pyspark import SparkConf, SparkContext
>>> conf = (SparkConf()
 .setMaster( "local" )
 .setAppName( "MyApp" )
 .set( "spark.executor.memory", "1g" ))
>>> sc = SparkContext(conf = conf)

Using The Shell

In the PySpark shell, a special interpreter aware SparkContext is already created in the variable called sc.

$ ./bin/spark shell --master local[2]
$ ./bin/pyspark --master local[4] --py files code.py

Set which master the context connects to with the --master argument, and add Python .zip, .egg or .py files to the
runtime path by passing a comma separated list to --py-files

Loading Data

Parallelized Collections

>>> rdd = sc.parallelize([('a' ,7),('a' ,2),('b' ,2)])
>>> rdd2 = sc.parallelize([( 'a',2),( 'd',1),( 'b',1)])
>>> rdd3 = sc.parallelize(range(100))
>>> rdd4 = sc.parallelize([( "a",[ "x","y" , "z"]),
 ( "b",[ "p" ,"r" ])])

External Data

Read either one text file from HDFS, a local file system or or any Hadoop supported file system URI with textFile(),
or read in a directory of text files with wholeTextFiles()
>>> textFile = sc.textFile("/my/directory/*.txt" )
>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Retrieving RDD Information

Basic Information

>>> rdd.getNumPartitions() #List the number of partitions
>>> rdd.count() #Count RDD instances 3
>>> rdd.countByKey() #Count RDD instances by key
 defaultdict(<type 'int' >, { 'a':2, 'b':1})
>>> rdd.countByValue() #Count RDD instances by value
 defaultdict(<type 'int'>, {( 'b' ,2):1,( 'a' ,2):1,( 'a' ,7):1})
>>> rdd.collectAsMap() #Return (key,value) pairs as adictionary
 { 'a' : 2, 'b' : 2}
>>> rdd3.sum() #Sum of RDD elements 4950
>>> sc.parallelize([]).isEmpty()  #Check whether RDD is empty


>>> rdd3.max() #Maximum value of RDD elements

>>> rdd3.min() #Minimum value of RDD elements

>>> rdd3.mean() #Mean value of RDD elements

>>> rdd3.stdev() #Standard deviation of RDD elements

>>> rdd3.variance() #Compute variance of RDD elements

>>> rdd3.histogram(3) #Compute histogram by bins

>>> rdd3.stats() #Summary statistics (count, mean, stdev, max & min)

Applying Functions

>>> def g(x): print(x)
>>> rdd.foreach(g) #Apply a function to all RDD elements
 ( 'a', 7)
 ( 'b', 2)
 ( 'a', 2)

Selecting Data


>>> rdd.collect() #Return a list with all RDD elements
 [('a' , 7), ( 'a', 2), ( 'b', 2)]
>>> rdd.take(2)  #Take first 2 RDD elements
 [( 'a', 7), ('a' , 2)]
>>> rdd.first() #Take first RDD element
 ( 'a', 7)
>>> rdd.top(2) #Take top 2 RDD elements
 [( 'b', 2), ( 'a', 7)]


>>> rdd3.sample(False, 0.15, 81).collect() #Return sampled subset of rdd3



>>> rdd.filter(lambda x: "a" in x).collect() #Filter the RDD
 [( 'a',7),('a' ,2)]

>>> rdd5.distinct().collect() #Return distinct RDD values
 [ 'a',2, 'b',7]

>>> rdd.keys().collect() #Return (key,value) RDD's keys
 ['a' ,'a' , 'b']


>>> def g(x): print(x)

>>> rdd.foreach(g) *#Apply a function to all RDD elements*
 ( 'a', 7)

 ( 'b', 2)

 ( 'a', 2)

Reshaping Data


>>> rdd.reduceByKey(lambda x,y : x+y).collect()  #Merge the rdd values foreach key
 [('a',9), ('b',2)]
>>> rdd.reduce(lambda a, b: a + b)   #Merge the rdd values

Grouping by

>>> rdd3.groupBy(lambda x: x % 2) .mapValues(list) .collect()  #Return RDD of grouped values
>>> rdd.groupByKey().mapValues(list).collect()  #Group rdd by key
 [('a' ,[7,2]), ( 'b',[2])]


>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))
>>> combOp = (lambda x,y:(x[0]+y[0],x[1]+y[1]))
# Aggregate RDD elements of eachpartition and then the results
>>> rdd3.aggregate((0,0),seqOp,combOp) 
#Aggregate values of each RDD key
>>> rdd.aggregateByKey((0,0),seqop,combop).collect()
 [('a' ,(9,2)), ('b' ​,(2,1))]
#Aggregate the elements of eachp artition, and then the results
>>> rdd3.fold(0,add)
# Merge the values for each key
>>> rdd.foldByKey(0, add).collect()
 [('a' ,9), ('b' ,2)]
#Create tuples of RDD elements byapplying a function
>>> rdd3.keyBy(lambda x: x+x).collect() 


>>> rdd2.sortBy(lambda x: x[1]).collect()  #Sort RDD by given function 
 [('d' ,1), ('b' ,1), ('a',2)]
>>> rdd2.sortByKey().collect()   #Sort (key, value) RDD by key
 [('a' ,2), ('b' ,1), ('d' ,1)]

Mathematical Operations

>>> rdd.subtract(rdd2).collect()  #Return each rdd value not contained in rdd2
 [('b' ,2), ('a' ,7)]
#Return each (key,value) pair of rdd2 with no matching key in rdd
>>> rdd2.subtractByKey(rdd).collect() 
 [( 'b', 1)]
>>> rdd.cartesian(rdd2).collect()  #Return the Cartesian product of rdd and rdd2


>>> rdd.repartition(4) #New RDD with 4 partitions
>>> rdd.coalesce(1) #Decrease the number of partitions in the RDD to 1

Stopping SparkContext

>>> sc.stop()


$ ./bin/spark submit examples/src/main/python/pi.py


>>> rdd.saveAsTextFile("rdd.txt")
>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child" ,"org.apache.hadoop.mapred.TextOutputFormat")


dplyr cheat sheet is a quick reference for dplyr that is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.


A regular expression is a sequence of characters that specifies a search pattern.


PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications using Python APIs


Go is a statically typed, compiled programming language designed at Google by Robert Griesemer, Rob Pike, and Ken Thompson.


Watchman exists to watch files and record when they change. It can also trigger actions (such as rebuilding assets) when matching files change.


Rollup Rollup is a module bundler for JavaScript which compiles small pieces of code into something larger and more complex, such as a library or application.


Bluebird is a fully-featured Promise library for JavaScript. The strongest feature of Bluebird is that it allows you to "promisify" other Node modules in order to use them asynchronously. Promisify is a concept applied to callback functions.


Flow is a static type checker for your JavaScript code. It does a lot of work to make you more productive. Making you code faster, smarter, more confidently, and to a bigger scale.

Node.Js Api

Node.js® is a JavaScript runtime built on Chrome's V8 JavaScript engine.

C Preprocessor

Quick reference for the C macro preprocessor, which can be used independent of C/C++.


Sass is a preprocessor scripting language that is interpreted or compiled into Cascading Style Sheets. SassScript is the scripting language itself. Sass consists of two syntaxes.


Active Admin is a Ruby on Rails plugin for generating administration style interfaces. It abstracts common business application patterns to make it simple for developers to implement beautiful and elegant interfaces with very little effort.


The Fetch standard defines requests, responses, and the process that binds them: fetching.


$ http POST http://example.com name="John" Host:example.com — JSON, cookies, files, auth, and other httpie examples.


RSpec is a computer domain-specific language testing tool written in the programming language Ruby to test Ruby code. It is a behavior-driven development framework which is extensively used in production applications.


Sauce Labs allows users to run tests in the cloud on more than 700 different browser platforms, operating systems, and device combinations.


Jasmine is a behavior-driven development framework for testing JavaScript code.


Sequelize is a promise-based Node.js ORM for Postgres, MySQL, MariaDB, SQLite and Microsoft SQL Server.


Ubuntu is a Linux distribution based on Debian and composed mostly of free and open-source software. Ubuntu is officially released in three editions: Desktop, Server, and Core for Internet of things devices and robots.

Rails Models

Ruby on Rails, or Rails, is a server-side web application framework written in Ruby under the MIT License. Rail is a model–view–controller framework, providing default structures for a database, a web service, and web pages.

Git Log

Git is software for tracking changes in any set of files, usually used for coordinating work among programmers collaboratively developing source code during software development.


Ruby on Rails, or Rails, is a server-side web application framework written in Ruby under the MIT License. Rails is a model–view–controller framework, providing default structures for a database, a web service, and web pages.

Ansible Examples

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code.

Bolt Quickstart

A quick guide to getting started writing Bolt tasks


RSpec Rails defines ten different types of specs for testing different parts of a typical Rails application.


rsync is a utility for efficiently transferring and synchronizing files between a computer and an external hard drive and across networked computers by comparing the modification times and sizes of files.

Ledger Cli

Ledger is a command-line based double-entry bookkeeping application. Accounting data is stored in a plain text file, using a simple format, which the users prepare themselves using other tools.


Homebrew is a free and open-source software package management system that simplifies the installation of software on Apple's operating system macOS as well as Linux.


Bundler provides a consistent environment for Ruby projects by tracking and installing the exact gems and versions that are needed.