Month: April 2017

Iterative TCP Client

Iterative TCP Client

Unix Socket Programming provides the possibility to write program that follow several patterns such as: blocking, non-blocking, multiplexing, signal-driver, asynchronous, with the idea to obtain software optimized for the application needed.

Here we focus the attention on the blocking pattern, hence look at the left-side column of this functions stack on the image and go down for the explanation.

Once the socket is created and connected to the server, we send the request to the server with write() function and then we will block the execution flow waiting for a reply, using the read() function — the important part of this process is, in fact, related to the write()/read() pair functions. Now let us continue the process: once the reply comes to the client, it manage the reply (this depends on what the application does) and iterate again towards the write() to send a new request. The process exits when a particular reply coming from the server is read and processed.

Find the source code of this program on this repository:


Spark MLlib

Spark MLlib


Spark MLlib is the spark component that provide machine learning and data mining algorithms for pre-processing techniques, classification, clustering and itemset mining.


Spark MLlib is based on a set of basic local and distributed data types like:
1. local vector,
2. labeled point,
3. distributed matrix,
4. Dataframe(s), built on top of these.


This library make us of:
1. DataFrame, used as ML Dataset;
2. Transformer, used to transform a DataFrame to another;
3. Estimator, an alghorithm used to build a model having a DataFrame as input;
4. Pipeline, used to create a chain of Transformers and Estimators to build a workflow;
5. Parameter, used to be specified (Transformer and Estimators shared a common API to specify it).


The Spark MLlib provide only a limited set of classification algorithms:
1. Logistic regression
2. Decision trees
3. SVMs
4. Naive Bayes


The Spark MLlib provides only a limited set of clustering algorithms:
1. K-means
2. Gaussian mixture


Spark MLlib provides also a set of regression algorithms.


The Spark MLlib provide the followin association rule mining algorithm:
1. FP-Growth

Spark: broadcast variables

Spark: broadcast variables

Spark supports broadcast variables, which are read-only shared variables that are sent to the nodes of the cluster and used inside the performed actions.

They are usually used to share large lookup-tables.

Spark: accumulators

Spark: accumulators

Spark provides a type of shared variables called accumulators, that can be added through an associative operation and be efficiently supported in parallel. They are used to avoid parallel copies over the cluster. But let us to see how.

When a function is passed and used by a cluster node, it works only on the copies of the variables of that function. So, to avoid this scenario, or in contexts where it is neeed to have a counter, an accumulator as shared variable is crucial.

Spark natively supports accumulators of numeric types, but programmers can use new data types defines as they want.


Spark: persistence and cache

Spark: persistence and cache

Spark computes the content of an RDD each time an action is invoked on it, but if the same RDD is used many times, its content is computed every time.

So the problem is evident and badly relevant, but there is a pretty solution: the cache. To be specific, we can ask to Spark to persist some data (RDDs) into cache and this avoid to compute content every time. When the RDD is used for the first time in the application, and it has been asked to use the cache, the RDD is created and stored in several nodes along the cluster, so when it will be used again the portions of the RDD will be already avaialble at each node.

This persistence of data can be done at different levels: different levels of storage, and it is possible to specify to the application to use either the main memory or the local disks, or even a mixture of both.

Spark also, has an automatic feature that monitors the cache usage on each node and in case of older data it decide to drop out them, accordingly to the least-recently-used fashion.

Spark: key-value pairs

Spark: key-value pairs

Spark supports RDDs of Key-Value pairs, they can be created appying the transformations, and then they are characterized by specific operations.

The list of transformations that can be performed on a Pair RDDs are:
1. ReduceByKey()
2. FoldByKey()
3. CombineBykey()
4. GroupByKey()
5. MapValues()
6. FlatMapValues()
7. Key()
8. Value()
9. SortByKey()

The list of transformations that can be performed on two pair RDDs are:
1. SubtractByKey()
2. Join()
3. CoGroup()

The list of actions that can be performed on a pair RDDs are:
1. CountByKey()
2. CollectAsMap()
3. Lookup()

The list of transformations that can be performed on a double RDDs are:
1. mapToDouble()
2. flatMapToDouble()

The list of actions that can be transformed on a double RDDs are:
1. sum()
2. mean()
3. stdev()
4. variance()
5. max()
6. min()

HOW-TO Run a Spark Application

HOW-TO Run a Spark Application


Spark programs are executed by usage of the spark-submit command!

This post is part of the Big Data Trail, so if you want to have the more general view of the whole topic, please follow the hypertext.

Before reading this, also, you need to be conscious of:
1. Big Data
2. Overview of Spark

And after reading this, you will learn:
1. the spark-submit command and its options
2. how to run a spark app on a local machine

Some of the commonly used options are:
–class: the entry point for your application
–master: the master URL for the cluster
–deploy-mode: whether to deploy your driver on the worker nodes or locally as an external client
–conf: arbitrary Spark configuration property in key=value format
But also others releated to the application:
application-jar: path to a jar including your application with dependencies
application-arguments: arguments passed to the main method of main class
Others releated to the executors:
–num-executors, to specify the number
–executor-cores, to specify the number of cores for each executor
–executor-memory, to specify the main memory per each executor
Even, others releated to the driver:
–driver-cores, to specify the number of cored to be used by the driver
–driver-memory, to specify the main memory for the driver

Other relevant resources are available at:

Spark: Actions for RDD-based programming

Spark: Actions for RDD-based programming


The Spark actions are used to store the handled data, espressed as RDD, into either a local Java variable, paying attention on the size of the returned value, or an output file.

The basic actions that are possible to perform are:

  1. collect()
  2. count()
  3. countByValue()
  4. take()
  5. top()
  6. takeSample()
  7. reduce()
  8. fold()
  9. aggregate()
  10. foreach()

Let us to see them better.

  1. collect()
    It is used to store several objects into a local Java list
  2. count()
    It is used to count the number of elements on an RDD
    See: example 
  3. countByValue()
    It count the number of occurrences of a key and put it in a local Java map
    See: example 
  4. take()
    It is used take the first n elements of a RDD
  5. top()
    It is used to take the first largest n elements of a RDD
  6. takeSample()
    It is used to take n random elements from the objects
  7. reduce()
    It is used to obtain a single Java object combining the objects with a user provided function
  8. fold()
    It is like the reduce() but is characterized by the “zero” value
  9. aggregate()
    It is used to crate a single java object combining the objects with two user provided functions
  10. foreach()
    It is used to initiate a computation like inserting RDD elements into a database.