You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: map-reduce/README.md
+77-40
Original file line number
Diff line number
Diff line change
@@ -2,20 +2,26 @@
2
2
title: "MapReduce Pattern in Java"
3
3
shortTitle: MapReduce
4
4
description: "Learn the MapReduce pattern in Java with real-world examples, class diagrams, and tutorials. Understand its intent, applicability, benefits, and known uses to enhance your design pattern knowledge."
5
-
category: Structural
5
+
category: Functional
6
6
language: en
7
7
tag:
8
-
- Delegation
8
+
- Concurrency
9
+
- Data processing
10
+
- Data transformation
11
+
- Functional decomposition
12
+
- Immutable
13
+
- Multithreading
14
+
- Scalability
9
15
---
10
16
11
17
## Also known as
12
18
13
-
*Split-Apply-Combine Strategy
14
-
*Scatter-Gather Pattern
19
+
*Map-Reduce
20
+
*Divide and Conquer for Data Processing
15
21
16
22
## Intent of Map Reduce Design Pattern
17
23
18
-
MapReduce aims to process and generate large datasets with a parallel, distributed algorithm on a cluster. It divides the workload into two main phases: Map and Reduce, allowing for efficient parallel processing of data.
24
+
To efficiently process large-scale datasets by dividing computation into two phases: map and reduce, which can be executed in parallel and distributed across multiple nodes.
19
25
20
26
## Detailed Explanation of Map Reduce Pattern with Real-World Examples
21
27
@@ -29,19 +35,22 @@ In plain words
29
35
30
36
Wikipedia says
31
37
32
-
> "MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster".
33
-
MapReduce consists of two main steps:
34
-
The "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.
35
-
The "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.
36
-
This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
38
+
> MapReduce is a programming model and associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. MapReduce consists of two main steps:
39
+
The Map step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. The Reduce step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. This approach allows for efficient processing of vast amounts of data across multiple machines, making it a fundamental technique in big data analytics and distributed computing.
* Enables massive scalability by distributing processing across nodes.
231
+
* Encourages a functional style, promoting immutability and stateless operations.
232
+
* Simplifies complex data workflows by separating transformation (map) from aggregation (reduce).
233
+
* Fault-tolerant due to isolated, recoverable processing tasks.
203
234
204
235
Trade-offs:
205
236
206
-
* Overhead: Not efficient for small datasets due to setup and coordination costs
207
-
* Limited flexibility: Not suitable for all types of computations or algorithms
208
-
* Latency: Batch-oriented nature may not be suitable for real-time processing needs
237
+
* Requires a suitable problem structure — not all tasks fit the map/reduce paradigm.
238
+
* Data shuffling between map and reduce phases can be performance-intensive.
239
+
* Higher complexity in debugging and optimizing distributed jobs.
240
+
* Intermediate I/O can become a bottleneck in large-scale operations.
209
241
210
242
## Real-World Applications of Map Reduce Pattern in Java
211
243
212
-
* Google's original implementation for indexing web pages
213
-
* Hadoop MapReduce for big data processing
214
-
* Log analysis in large-scale systems
215
-
* Genomic sequence analysis in bioinformatics
244
+
* Hadoop MapReduce: Java-based framework for distributed data processing using MapReduce.
245
+
* Apache Spark: Utilizes similar map and reduce transformations in its RDD and Dataset APIs.
246
+
* Elasticsearch: Uses MapReduce-style aggregation pipelines for querying distributed data.
247
+
* Google Bigtable: Underlying storage engine influenced by MapReduce principles.
248
+
* MongoDB Aggregation Framework: Conceptually applies MapReduce in its data pipelines.
216
249
217
250
## Related Java Design Patterns
218
251
219
-
*Chaining Pattern
220
-
*Master-Worker Pattern
221
-
*Pipeline Pattern
252
+
*[Master-Worker](https://java-design-patterns.com/patterns/master-worker/): Similar distribution of tasks among workers, with a master coordinating job execution.
253
+
*[Pipeline](https://java-design-patterns.com/patterns/pipeline/): Can be used to chain multiple MapReduce operations into staged transformations.
254
+
*[Iterator](https://java-design-patterns.com/patterns/iterator/): Often used under the hood to process input streams lazily in map and reduce steps.
222
255
223
256
## References and Credits
224
257
225
-
*[What is MapReduce](https://www.ibm.com/think/topics/mapreduce)
226
-
*[Wy MapReduce is not dead](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
227
-
*[Scalabe Distributed Data Processing Solutions](https://tcpp.cs.gsu.edu/curriculum/?q=system%2Ffiles%2Fch07.pdf)
258
+
*[Big Data: Principles and Paradigms](https://amzn.to/3RJIGPZ)
259
+
*[Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems](https://amzn.to/3E6VhtD)
260
+
*[Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale](https://amzn.to/4ij2y7F)
261
+
*[Java 8 in Action: Lambdas, Streams, and functional-style programming](https://amzn.to/3QCmGXs)
228
262
*[Java Design Patterns: A Hands-On Experience with Real-World Examples](https://amzn.to/3HWNf4U)
263
+
*[Programming Pig: Dataflow Scripting with Hadoop](https://amzn.to/4cAU36K)
264
+
*[What is MapReduce (IBM)](https://www.ibm.com/think/topics/mapreduce)
265
+
*[Why MapReduce is not dead (Codemotion)](https://www.codemotion.com/magazine/ai-ml/big-data/mapreduce-not-dead-heres-why-its-still-ruling-in-the-cloud/)
0 commit comments