What is ‘secondary sort’ in Hadoop, and how does it work?

Secondary sorting is non-trivial and it’s easier to explain it with the help of some pictures.

First things first let’s look at the definition

Secondary sort is a technique that allows the MapReduce programmer to control the order that the values show up within a reduce function call.

Now what the heck does that mean ? A concrete example helps.

Let’s assume that our secondary sorting is on a composite key made of Last Name and First Name.

First name + Last name
First name + Last name

Now let’s look at the steps involved in secondary sorting.

Steps in secondary sorting
Steps in secondary sorting

 

The partitioner and the group comparator use only natural key, the partitioner uses it to channel all records with the same natural key to a single reducer. This partitioning happens in the Map Phase, data from various Map tasks are received by reducers where they are grouped and then sent to the reduce method. This grouping is where the group comparator comes into picture, if we would not have specified a custom group comparator then Hadoop would have used the default implementation which would have considered the entire composite key, which would have led to incorrect results.

Finally just reviewing the steps involved in a MR Job and relating it to secondary sorting should help us clear out the lingering doubts.

Overview
Overview

 

 

Source: Stackoverflow