Spark is one of the big data engines that are trending in recent times. One of the main reasons is that is because of its ability to process real-time streaming data. Its advantages over traditional MapReduce are:
- It is faster than MapReduce
- Well equipped with Machine Learning abilities.
- Supports multiple programming languages.
|Stores data in local disk||Stores data in-memory|
|External schedulers required||Schedules tasks itself|
|High latency||Low latency
|Slow speed||Faster speed|
|Suitable for batch processing||Suitable for real-time processing|
However, in spite of having all these advantages over Hadoop, we often get stuck in certain situations which arise due to inefficient codes are written for applications. The situations and their solutions are discussed below:
- Always try to use reducebykey instead of groupbykey
- Reduce should be lesser than TreeReduce
- Always try to lower the side of maps as much as possible
- Try not to shuffle more
- Try to keep away from Skews as well as partitions too
Do not let the jobs to slow down:
When the application is shuffled, it takes more time around 4 long hours to run. This makes the system slow.
There are two stages of aggregation and they are:
- action on the salted keys
- action on the unsalted keys
So we have to remove the isolated keys and then accumulation should be used which will decrease the data used as a result we can huge information can be saved from being shuffled.
Avoid wrong dimensions of executors
In any particular Spark jobs, executors are the executing nodes that are responsible for processing singular tasks in the job. These executors provide in-memory storage for RDDs that are cached by user programs through Block Manager. They are created at the very starting of the particular Spark application and are on for the whole application span. After processing the entity works, the deliver the output to the driver. The mistakes that we do during the writing of the Spark application with the executors are that we take the wrong size executors. Things that we go wrong in the assigning of the following:
- Number of Executors
- Cores of each executor
- Memory for each executor
Normally people use 6 executors, 16 cores each and 64 GB of RAM.
When using 16 core for each executor, the total number of cores for 16 executors become 96. And the memory per node becomes 64/16 i.e. 4 GB for each executor. Hence if it becomes most granular for using smallest size executors we fail to make use of the advantages of processing all the tasks in the same java virtual machine. But in the same calculation if it becomes least granular also it becomes a problem because no memory remains free for overhead for OS/Hadoop daemons. And instead of 16 cores, if we use 15 cores also it will result in bad throughput. So the perfect number is 5 cores per executor. Because for 6*15 we have 90 cores, so the number of executors will be 90/5 i.e 18. So leaving one executor for AM, we have 17 remaining, so executors in 1 node will be 3, Hence, RAM= 63/3= 21 GB, 21 x (1-0.07) ~ 19 GB. Therefore for correct application people should use 17 executors, 5 cores each and 19 GB of RAM.
The above tips should be followed in order to avoid mistakes while writing an Apache Spark Application.
Thanks for dropping by !!! Feel free to comment to this post or you can also send me an email at firstname.lastname@example.org