Skip to content

Commit 4b06dc2

Browse files
committed
docs: updates for MapReduce
1 parent 8ca487e commit 4b06dc2

File tree

2 files changed

+77
-40
lines changed

2 files changed

+77
-40
lines changed

map-reduce/README.md

+77-40
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,26 @@
22
title: "MapReduce Pattern in Java"
33
shortTitle: MapReduce
44
description: "Learn the MapReduce pattern in Java with real-world examples, class diagrams, and tutorials. Understand its intent, applicability, benefits, and known uses to enhance your design pattern knowledge."
5-
category: Structural
5+
category: Functional
66
language: en
77
tag:
8-
- Delegation
8+
- Concurrency
9+
- Data processing
10+
- Data transformation
11+
- Functional decomposition
12+
- Immutable
13+
- Multithreading
14+
- Scalability
915
---
1016

1117
## Also known as
1218

13-
* Split-Apply-Combine Strategy
14-
* Scatter-Gather Pattern
19+
* Map-Reduce
20+
* Divide and Conquer for Data Processing
1521

1622
## Intent of Map Reduce Design Pattern
1723

18-
MapReduce aims to process and generate large datasets with a parallel, distributed algorithm on a cluster. It divides the workload into two main phases: Map and Reduce, allowing for efficient parallel processing of data.
24+
To efficiently process large-scale datasets by dividing computation into two phases: map and reduce, which can be executed in parallel and distributed across multiple nodes.
1925

2026
## Detailed Explanation of Map Reduce Pattern with Real-World Examples
2127

@@ -29,19 +35,22 @@ In plain words
2935
3036
Wikipedia says
3137

32-
> "MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster".
33-
MapReduce consists of two main steps:
34-
The "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
35-
The "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
36-
This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
38+
> MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. MapReduce consists of two main steps:
39+
The Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
40+
41+
Flowchart
42+
43+
![MapReduce flowchart](./etc/mapreduce-flowchart.png)
3744

3845
## Programmatic Example of Map Reduce in Java
3946

4047
### 1. Map Phase (Splitting & Processing Data)
4148

42-
* The Mapper takes an input string, splits it into words, and counts occurrences.
43-
* Output: A map {word → count} for each input line.
49+
* Each input string is split into words, normalized, and counted.
50+
* Output: A map `{word → count}` for each input string.
51+
4452
#### `Mapper.java`
53+
4554
```java
4655
public class Mapper {
4756
public static Map<String, Integer> map(String input) {
@@ -57,13 +66,17 @@ public class Mapper {
5766
}
5867
}
5968
```
69+
6070
Example Input: ```"Hello world hello"```
6171
Output: ```{hello=2, world=1}```
6272

63-
### 2. Shuffle Phase (Grouping Data by Key)
73+
### 2. Shuffle Phase – Grouping Words Across Inputs
74+
75+
* Takes results from all mappers and groups values by word.
76+
* Output: A map `{word → list of counts}`.
6477

65-
* The Shuffler collects key-value pairs from multiple mappers and groups values by key.
6678
#### `Shuffler.java`
79+
6780
```java
6881
public class Shuffler {
6982
public static Map<String, List<Integer>> shuffleAndSort(List<Map<String, Integer>> mapped) {
@@ -78,14 +91,18 @@ public class Shuffler {
7891
}
7992
}
8093
```
94+
8195
Example Input:
96+
8297
```
8398
[
8499
{"hello": 2, "world": 1},
85100
{"hello": 1, "java": 1}
86101
]
87102
```
103+
88104
Output:
105+
89106
```
90107
{
91108
"hello": [2, 1],
@@ -94,10 +111,13 @@ Output:
94111
}
95112
```
96113

97-
### 3. Reduce Phase (Aggregating Results)
114+
### 3. Reduce Phase – Aggregating Counts
115+
116+
* Sums the list of counts for each word.
117+
* Output: A sorted list of word counts in descending order.
98118

99-
* The Reducer sums up occurrences of each word.
100119
#### `Reducer.java`
120+
101121
```java
102122
public class Reducer {
103123
public static List<Map.Entry<String, Integer>> reduce(Map<String, List<Integer>> grouped) {
@@ -112,15 +132,19 @@ public class Reducer {
112132
}
113133
}
114134
```
135+
115136
Example Input:
137+
116138
```
117139
{
118140
"hello": [2, 1],
119141
"world": [1],
120142
"java": [1]
121143
}
122144
```
145+
123146
Output:
147+
124148
```
125149
[
126150
{"hello": 3},
@@ -129,10 +153,12 @@ Output:
129153
]
130154
```
131155

132-
### 4. Running the Full MapReduce Process
156+
### 4. MapReduce Coordinator – Running the Whole Pipeline
157+
158+
* Coordinates map, shuffle, and reduce phases.
133159

134-
* The MapReduce class coordinates the three steps.
135160
#### `MapReduce.java`
161+
136162
```java
137163
public class MapReduce {
138164
public static List<Map.Entry<String, Integer>> mapReduce(List<String> inputs) {
@@ -148,10 +174,12 @@ public class MapReduce {
148174
}
149175
```
150176

151-
### 4. Main Execution (Calling MapReduce)
177+
### 5. Main Execution – Example Usage
178+
179+
* Runs the MapReduce process and prints results.
152180

153-
* The Main class executes the MapReduce pipeline and prints the final word count.
154181
#### `Main.java`
182+
155183
```java
156184
public static void main(String[] args) {
157185
List<String> inputs = Arrays.asList(
@@ -168,6 +196,7 @@ public class MapReduce {
168196
```
169197

170198
Output:
199+
171200
```
172201
hello: 4
173202
world: 2
@@ -183,10 +212,11 @@ fun: 1
183212
## When to Use the Map Reduce Pattern in Java
184213

185214
Use MapReduce when:
186-
* Processing large datasets that don't fit into a single machine's memory
187-
* Performing computations that can be parallelized
188-
* Dealing with fault-tolerant and distributed computing scenarios
189-
* Analyzing log files, web crawl data, or scientific data
215+
216+
* When processing large datasets that can be broken into independent chunks.
217+
* When data operations can be naturally divided into map (transformation) and reduce (aggregation) phases.
218+
* When horizontal scalability and parallelization are essential, especially in distributed or big data environments.
219+
* When leveraging Java-based distributed computing platforms like Hadoop or Spark.
190220

191221
## Map Reduce Pattern Java Tutorials
192222

@@ -197,32 +227,39 @@ Use MapReduce when:
197227

198228
Benefits:
199229

200-
* Scalability: Can process vast amounts of data across multiple machines
201-
* Fault-tolerance: Handles machine failures gracefully
202-
* Simplicity: Abstracts complex distributed computing details
230+
* Enables massive scalability by distributing processing across nodes.
231+
* Encourages a functional style, promoting immutability and stateless operations.
232+
* Simplifies complex data workflows by separating transformation (map) from aggregation (reduce).
233+
* Fault-tolerant due to isolated, recoverable processing tasks.
203234

204235
Trade-offs:
205236

206-
* Overhead: Not efficient for small datasets due to setup and coordination costs
207-
* Limited flexibility: Not suitable for all types of computations or algorithms
208-
* Latency: Batch-oriented nature may not be suitable for real-time processing needs
237+
* Requires a suitable problem structure — not all tasks fit the map/reduce paradigm.
238+
* Data shuffling between map and reduce phases can be performance-intensive.
239+
* Higher complexity in debugging and optimizing distributed jobs.
240+
* Intermediate I/O can become a bottleneck in large-scale operations.
209241

210242
## Real-World Applications of Map Reduce Pattern in Java
211243

212-
* Google's original implementation for indexing web pages
213-
* Hadoop MapReduce for big data processing
214-
* Log analysis in large-scale systems
215-
* Genomic sequence analysis in bioinformatics
244+
* Hadoop MapReduce: Java-based framework for distributed data processing using MapReduce.
245+
* Apache Spark: Utilizes similar map and reduce transformations in its RDD and Dataset APIs.
246+
* Elasticsearch: Uses MapReduce-style aggregation pipelines for querying distributed data.
247+
* Google Bigtable: Underlying storage engine influenced by MapReduce principles.
248+
* MongoDB Aggregation Framework: Conceptually applies MapReduce in its data pipelines.
216249

217250
## Related Java Design Patterns
218251

219-
* Chaining Pattern
220-
* Master-Worker Pattern
221-
* Pipeline Pattern
252+
* [Master-Worker](https://java-design-patterns.com/patterns/master-worker/): Similar distribution of tasks among workers, with a master coordinating job execution.
253+
* [Pipeline](https://java-design-patterns.com/patterns/pipeline/): Can be used to chain multiple MapReduce operations into staged transformations.
254+
* [Iterator](https://java-design-patterns.com/patterns/iterator/): Often used under the hood to process input streams lazily in map and reduce steps.
222255

223256
## References and Credits
224257

225-
* [What is MapReduce](https://www.ibm.com/think/topics/mapreduce)
226-
* [Wy MapReduce is not dead](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
227-
* [Scalabe Distributed Data Processing Solutions](https://tcpp.cs.gsu.edu/curriculum/?q=system%2Ffiles%2Fch07.pdf)
258+
* [Big Data: Principles and Paradigms](https://amzn.to/3RJIGPZ)
259+
* [Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://amzn.to/3E6VhtD)
260+
* [Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale](https://amzn.to/4ij2y7F)
261+
* [Java 8 in Action: Lambdas, Streams, and functional-style programming](https://amzn.to/3QCmGXs)
228262
* [Java Design Patterns: A Hands-On Experience with Real-World Examples](https://amzn.to/3HWNf4U)
263+
* [Programming Pig: Dataflow Scripting with Hadoop](https://amzn.to/4cAU36K)
264+
* [What is MapReduce (IBM)](https://www.ibm.com/think/topics/mapreduce)
265+
* [Why MapReduce is not dead (Codemotion)](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
88.4 KB
Loading

0 commit comments

Comments
 (0)