Sree's Blog: March 2018

scala> spark.version

scala> spark.readStream.format("rate").load.write

scala> import org.apache.spark.sql.streaming._ // to get Trigger

scala> import scala.concurrent.duration._ // to get .seconds for Integer

scala> spark.readStream.format("rate").load.writeStream.format("console").trigger(Trigger.ProcessingTime(10.seconds)).start

Note: Spark Structured Streaming by default gets triggered every milli sec

Note: Spark Structured Streaming has No "Streaming" tab in Web UI

Here the .start actually starts a DataStreamWriter which is when the streaming starts --> meaning, u cant have .show on it

Note: format "rate --> https://jaceklaskowski.gitbooks.io/spark-structured-streaming/content/spark-sql-streaming-RateStreamSource.html

RateStreamSource is a streaming source that generates consecutive numbers with timestamp that can be useful for testing and PoCs.

---- Notes before March 2018----------------------

# If provided paths are partition directories, please set "basePath" in the options

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table.

If there are multiple root directories, please load them separately and then union them.

Actually data path had multiple dirs, one set of partition & another set of dated dirs

Means, there were paths like this --

/pathToMainData/mktCd=<xyz>/event_mo=<202008>/

/pathToMainData/mktCd=<xyz>/event_mo=<202009>/

/pathToMainData/mktCd=<xyz>/<abc_date1_xyz>/

Partition - repartition, coalece and minPartitions

Set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions of textFile

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv", minPartitions=24)

Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce, otherwise you can use repartition.

rating_data_raw = sc.textFile("/<path_to_csv_file>.csv").repartition(24)

https://mapr.com/blog/getting-started-spark-web-ui/

The scheduler splits the RDD graph into stages, based on the transformations.

The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.

Stages will have multiple tasks

Here is a summary of the components of execution:



Task: a unit of execution that runs on a single machine
Stage: a group of tasks, based on partitions of the input data, which will perform the same computation in parallel
Job: has one or more stages
Pipelining: collapsing of RDDs into a single stage, when RDD transformations can be computed without data movement
DAG: Logical graph of RDD operations
RDD: Parallel dataset with partitions

Here is how a Spark application runs:



A Spark application runs as independent processes, coordinated by the SparkContext object in the driver program.
The task scheduler launches tasks via the cluster manager; in this case it’s YARN.
The cluster manager assigns tasks to workers, one task per partition.
A task applies its unit of work to the elements in its partition, and outputs a new partition.
Partitions can be read from an HDFS block, HBase or other source and cached on a worker node (data does not have to be written to disk between tasks like with MapReduce).


Results are sent back to the driver application.

Spark Listener Events (y tube watch?v=mVP9sZ6K__Y)

Sree's Blog

Tuesday, March 13, 2018

Spark Learn