Learn Develop Data Engineering

Tuesday, September 8, 2020

Repartition vs Coaleasce in Apache spark

What is shuffling in spark ??

As we know spark is shuffling the data when we are using wide transformation or any action. and result to this data needs to be traversed through the network from different machines which requires data serialization and de-serialization. To control the partition spark has two transformations Repartition and Coaleasce.

By default, when we perform a shuffle Spark will output two hundred shuffle partitions. This can be controlled using below config parameter.

spark.conf.set("spark.sql.shuffle.partitions", "100")

Repartition: Repartition returns a new RDD that has exactly numPartitions partitions passed on this transformation. It can increase or decrease the level of parallelism in the RDD.

Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in the RDD, consider using `coalesce`,which can avoid performing a shuffle. TODO Fix the Shuffle+Repartition data loss issue described in SPARK-23207.

Coaleasce: Coaleasce returns a new RDD that is reduced into `numPartitions` partitions passed on this transformation. This results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions. If a larger number of partitions is requested, it will stay at the current number of partitions.

However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1).

To avoid this, you can pass shuffle = true. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is). With shuffle = true, you can actually coalesce to a larger number of partitions. This is useful if you have a small number of partitions, say 100, potentially with a few partitions being abnormally large. Calling coalesce(1000, shuffle = true) will result in 1000 partitions with the data distributed using a hash partitioner. The optional partition coalesce passed in must be serializable.

Coaleasce creates unequal partition, however Repartition create equal number of partition which help to process data further and avoid data skew problem.

So it's purely on the use case you are looking for, test it before deploying in production.

Keep Learning! Keep Rocking!!

Tuesday, September 24, 2019

what is Mapreduce

Mapreduce is a processing engine in Hadoop. It can process only batch data. It means bounded data.
Internally it process disk to disk. So It's very very slow.
Manually optimize everything, allows different ecosystems like HIve, Pig, and more to process the data.

what is YARN

YARN is a distributed OS also called Cluster manager to process huge amount of data paralelly and quickly.
At a time process different types of data such as Batch process, streaming, iterative data and more.
It's unified stack.

what is HDFS

HDFS is a file system to store the data in reliable manner. It consists of two types of nodes called NameNode and DataNode to store metadata and actual data.

HDFS is a block-structured file system. Just like Linux file systems, HDFS splits a file into fixed-size blocks, also known as partitions or splits. The default block size is 128 MB.