Data skew is an actual problem to be resolved for MapReduce. Existing Hadoop’s reduce task scheduler is not only locality unaware, but also partitioning skew unaware. The parallel and distributed computation features may cause some unforeseen problems. Data skew is a typical such problem, and the high runtime complexity amplifies the skew and leads to highly varying execution times of the reducers. Partitioning skew causes shuffle skew, where some reduce tasks receive more data than others. The shuffle skew problem can degrade performance, because a job might get delayed by a reduce task fetching large input data. In the presence of data skew, we can use a reducer placement method to minimize all-to-all communications between mappers and reducers, whose basic idea is to place related map and reduce tasks on the same node or cluster or rack.This research poit addresses space scheduling in MapReduce. We analyze the source of data skew and conclude that partitioning skew exists within certain Hadoop applications. The node at which a reduce task is scheduled can effectively mitigate the shuffle skew problem. In these cases, reducer placement can decrease the traffic between mappers and reducers and upgrade system performance. Some algorithms are released, which synthesize the network locations and sizes of reducers’ partitions in their scheduling decisions in order to mitigate network traffic and improve MapReduce performance. Overall, this research poit introduces several ways to avoid scheduling delay, scheduling skew, poor system utilization, and low degree of parallelism.
Cross-Rack Communication Optimization
In Hadoop framework, a user needs to provide two functions, i.e., mapper and reducer, to process data. Mappers produce a set of files and send to all the reducers, and a reducer will receive files from all the mappers, which is an all-to-all communication model. Cross-rack communication happens if a mapper and a reducer reside in different racks, which is very often in today’s data center environments. Typically, Hadoop runs in a datacenter environment in which machines are organized in racks. Each rack has a top-of-rack switch and each top-of-rack switch is connected to a root switch. Every cross-rack communication needs to travel through the root switch and hence the root switch becomes a bottleneck. MapReduce employs an all-to-all communication model between mappers and reducers. This results in saturation of network bandwidth of top-of-rack switch in the shuffle phase and straggles some reducers and increases job execution time.
Data Locality and Distributed Reduce Tasks
MapReduce assumes the master-slave architecture and a tree-style network topology. Nodes are spread over different racks encompassed in one or many data centers. A salient point is that the bandwidth between two nodes is dependent on their relative locations in the network topology. For example, nodes that are in the same rack have higher bandwidth between them as opposed to nodes that are off-rack. As such, it pays to minimize data shuffling across racks. The master in MapReduce is responsible for scheduling map tasks and reduce tasks on slave nodes after receiving requests from slaves for that regard. Hadoop attempts to schedule map tasks in proximity to input splits in order to avoid transferring them over the network. In contrast, Hadoop schedules reduce tasks at requesting slaves without any data locality consideration. As a result, unnecessary data might get shuffled in the network causing performance degradation.
MapReduce Network Traffic
The network is typically a bottleneck in MapReduce-based systems. By scheduling reducers at their center-of-gravity nodes, we argue for reduced network traffic which can possibly allow more MapReduce jobs to co-exist in the same system.