Spark SQL is the Spark component for structured data processing using a programming abstraction called DataFrame that can act as Distributed SQL query engine.
DataFrame is a distributed collection of data organized into named columns, and it is equivalent to a relational table. They can be built from different sources:
– Structured textual data files (csv files, json files)
– Existing RDDs
– Hive Tables
– External Relation Databases
Creating a DataFrame from json files
Spark SQL provides an API that allows creating a DataFrame directly from a textual file where each line contains a json object (hence, the input is not properly a standard json file).
See the example to understand it better.
Creating a DataFrame from existing RDD
Spark SQL provides an API that allows creating a DataFrame from an existing RDD.
There are a set of operations that can be done using the DataFrames:
1. show(), used to show elements of a DataFrame
2. printSchema(), used to show the schema of the DataFrame
3. count(), used to show the number of elements of a DataFrame
4. distinct(), return a DataFrame without duplicates
5. select(), used to select specific elements frome the DataFrame
6. filter(), used to select elements with a specific constraint