Tuesday, March 13, 2018

Spark Learn



scala> spark.version
scala> spark.readStream.format("rate").load.write
scala> import org.apache.spark.sql.streaming._ // to get Trigger
scala> import scala.concurrent.duration._ // to get .seconds for Integer
scala> spark.readStream.format("rate").load.writeStream.format("console").trigger(Trigger.ProcessingTime(10.seconds)).start
Note: Spark Structured Streaming by default gets triggered every milli sec 
Note: Spark Structured Streaming has No "Streaming" tab in Web UI
Here the .start actually starts a DataStreamWriter which is when the streaming starts --> meaning, u cant have .show on it
RateStreamSource is a streaming source that generates consecutive numbers with timestamp that can be useful for testing and PoCs.

---- Notes before March 2018----------------------

If provided paths are partition directories, please set "basePath" in the options 
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. 
If there are multiple root directories, please load them separately and then union them.

Actually data path had multiple dirs, one set of partition & another set of dated dirs
Means, there were paths like this --
/pathToMainData/mktCd=<xyz>/event_mo=<202008>/

/pathToMainData/mktCd=<xyz>/event_mo=<202009>/ 

/pathToMainData/mktCd=<xyz>/<abc_date1_xyz>/ 

 

 

Partition - repartition, coalece and minPartitions
Set the minimal number of partitions that will be created, while reading the file, by setting it in the optional parameter minPartitions of textFile
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv", minPartitions=24)
Another way to achieve this is by using repartition or coalesce, if you need to reduce the number of partition you may use coalesce, otherwise you can use repartition.
rating_data_raw = sc.textFile("/<path_to_csv_file>.csv").repartition(24)

The scheduler splits the RDD graph into stages, based on the transformations. 
The narrow transformations (transformations without data movement) will be grouped (pipe-lined) together into a single stage. This physical plan has two stages, with everything before ShuffledRDD in the first stage.
Stages will have multiple tasks


Here is a summary of the components of execution:
  • Task: a unit of execution that runs on a single machine
  • Stage: a group of tasks, based on partitions of the input data, which will perform the same computation in parallel
  • Job: has one or more stages
  • Pipelining: collapsing of RDDs into a single stage, when RDD transformations can be computed without data movement
  • DAG: Logical graph of RDD operations
  • RDD: Parallel dataset with partitions


Here is how a Spark application runs:
  • A Spark application runs as independent processes, coordinated by the SparkContext object in the driver program.
  • The task scheduler launches tasks via the cluster manager; in this case it’s YARN.
  • The cluster manager assigns tasks to workers, one task per partition.
  • A task applies its unit of work to the elements in its partition, and outputs a new partition.
    • Partitions can be read from an HDFS block, HBase or other source and cached on a worker node (data does not have to be written to disk between tasks like with MapReduce).
  • Results are sent back to the driver application.

Spark Listener Events (y tube watch?v=mVP9sZ6K__Y)