Spark SQL

Spark SQL

Spark SQL is the Spark component for structured data processing using a programming abstraction called DataFrame that can act as Distributed SQL query engine.


DataFrame is a distributed collection of data organized into named columns, and it is equivalent to a relational table. They can be built from different sources:
Structured textual data files (csv files, json files)
Existing RDDs
Hive Tables
External Relation Databases

Creating a DataFrame from json files

Spark SQL provides an API that allows creating a DataFrame directly from a textual file where each line contains a json object (hence, the input is not properly a standard json file).

See the example to understand it better.

Creating a DataFrame from existing RDD

Spark SQL provides an API that allows creating a DataFrame from an existing RDD.

DataFrame Operations

There are a set of operations that can be done using the DataFrames:
1. show(), used to show elements of a DataFrame
2. printSchema(), used to show the schema of the DataFrame
3. count(), used to show the number of elements of a DataFrame
4. distinct(), return a DataFrame without duplicates
5. select(), used to select specific elements frome the DataFrame
6. filter(), used to select elements with a specific constraint

Leave a Reply

Your email address will not be published. Required fields are marked *