@@ -40,301 +40,4 @@ Issue tracking: https://jira.mongodb.org/browse/HADOOP/
40
40
41
41
Discussion: http://groups.google.com/group/mongodb-user/
42
42
43
- ## Building the Adapter
44
-
45
- The Mongo-Hadoop adapter uses the
46
- [ SBT Build Tool] ( https://github.com/harrah/xsbt ) tool for
47
- compilation. SBT provides superior support for discrete configurations
48
- targeting multiple Hadoop versions. The distribution includes
49
- self-bootstrapping copy of SBT in the distribution as ` sbt ` . Create a
50
- copy of the jar files using the following command:
51
-
52
- ./sbt package
53
-
54
- The MongoDB Hadoop Adapter supports a number of Hadoop releases. You
55
- can change the Hadoop version supported by the build by modifying the
56
- value of ` hadoopRelease ` in the ` build.sbt ` file. For instance, set
57
- this value to:
58
-
59
- hadoopRelease in ThisBuild := "cdh3"
60
-
61
- Configures a build against Cloudera CDH3u3, while:
62
-
63
- hadoopRelease in ThisBuild := "0.21"
64
-
65
- Configures a build against Hadoop 0.21 from the mainline Apache distribution.
66
-
67
- Unfortunately, we are not aware of any Maven repositories that contain
68
- artifacts for Hadoop 0.21 at present. You may need to resolve these
69
- dependencies by hand if you choose to build using this
70
- configuration. We also publish releases to the central Maven
71
- repository with artifacts customized using the dependent release
72
- name. Our "default" build has no artifact name attached and supports
73
- Hadoop 1.0.
74
-
75
- After building, you will need to place the "core" jar and the
76
- "mongo-java-driver" in the ` lib ` directory of each Hadoop server.
77
-
78
- The MongoDB-Hadoop Adapter supports the following releases. Valid keys
79
- for configuration and Maven artifacts appear below each release.
80
-
81
- ### Cloudera Release 3
82
-
83
- This derives from Apache Hadoop 0.20.2, but includes many custom
84
- patches. Patches include binary streaming, and Pig 0.8.1. This
85
- target compiles * ALL* Modules, including Streaming.
86
-
87
- - cdh3
88
- - Maven artifact: "org.mongodb" / "mongo-hadoop_cdh3u3"
89
-
90
- ### Apache Hadoop 0.20.205.0
91
-
92
- This includes Pig 0.9.2 and does * NOT* support Hadoop Streaming.
93
-
94
- - 0.20
95
- - 0.20.x
96
- - Maven artifact: "org.mongodb" / "mongo-hadoop_0.20.205.0"
97
-
98
- ### Apache Hadoop 1.0.0
99
-
100
- This includes Pig 0.9.1 and does * NOT* support Hadoop Streaming.
101
-
102
- - 1.0
103
- - 1.0.x
104
- - Maven artifact: "org.mongodb" / "mongo-hadoop_1.0.0"
105
-
106
- ## Apache Hadoop 0.21.0
107
-
108
- This includes Pig 0.9.1 and Hadoop Streaming.
109
-
110
- - 0.21
111
- - 0.21.x
112
-
113
- This build is ** not** published to Maven because of upstream
114
- dependency availability.
115
-
116
- ### Apache Hadoop 0.23
117
-
118
- Support is * forthcoming* .
119
-
120
- This is an alpha branch with ongoing work by
121
- [ Hortonworks] ( http://hortonworks.com ) . Apache Hadoop 0.23 is "newer"
122
- than Apache Hadoop 1.0.
123
-
124
- The MongoDB Hadoop Adapter currently supports the following features.
125
-
126
- ## Hadoop MapReduce
127
-
128
- Provides working * Input* and * Output* adapters for MongoDB. You may
129
- configure these adapters with XML or programatically. See the
130
- WordCount examples for demonstrations of both approaches. You can
131
- specify a query, fields and sort specs in the XML config as JSON or
132
- programatically as a DBObject.
133
-
134
- ### Splitting up MongoDB Source Data for the InputFormat
135
-
136
- The MongoDB Hadoop Adapter makes it possible to create multiple
137
- * InputSplits* on source data originating from MongoDB to
138
- optimize/paralellize input processing for Mappers.
139
-
140
- If '* mongo.input.split.create_input_splits* ' is ** false** (it defaults
141
- to ** true** ) then MongoHadoop will use ** no** splits. Hadoop will
142
- treat the entire collection as a single, giant, * Input* . This is
143
- primarily intended for debugging purposes.
144
-
145
- When true, as by default, the following possible behaviors exist:
146
-
147
- 1 . For unsharded the source collections, MongoHadoop follows the
148
- "unsharded split" path. (See below.)
149
-
150
- 2 . For sharded source collections:
151
-
152
- * If '* mongo.input.split.read_shard_chunks* ' is ** true**
153
- (defaults ** true** ) then we pull the chunk specs from the
154
- configuration server, and turn each shard chunk into an * Input
155
- Split* . Basically, this means the mongodb sharding system does
156
- 99% of the preconfig work for us and is a good thing™
157
-
158
- * If '* mongo.input.split.read_shard_chunks* ' is ** false** and
159
- '* mongo.input.split.read_from_shards* ' is ** true** (it defaults
160
- to ** false** ) then we connect to the ` mongod ` or replica set
161
- for each shard individually and each shard becomes an input
162
- split. The entire content of the collection on the shard is one
163
- split. Only use this configuration in rare situations.
164
-
165
- * If '* mongo.input.split.read_shard_chunks* ' is ** true** and
166
- '* mongo.input.split.read_from_shards* ' is ** true** (it defaults
167
- to ** false** ) MongoHadoop reads the chunk boundaries from
168
- the config server but then reads data directly from the shards
169
- without using the ` mongos ` . While this may seem like a good
170
- idea, it can cause erratic behavior if MongoDB balances chunks
171
- during a Hadoop job. This is not a recommended configuration
172
- for write-heavy applications but may provide effective
173
- parallelism in read-heavy apps.
174
-
175
- * If both '* mongo.input.split.create_input_splits* ' and '* mongo.input.split.read_from_shards* ' are
176
- ** false** then we pretend there is no sharding and use
177
- the "unsharded split" path. When '* mongo.input.split.read_shard_chunks* ' is
178
- ** false** MongoHadoop reads everything through mongos as a
179
- single split.
180
-
181
- ### "Unsharded Splits"
182
-
183
- "Unsharded Splits" refers to the method that MongoHadoop uses to
184
- calculate new splits. You may use "Unsharded splits" with sharded
185
- MongoDB options.
186
-
187
- This is only used:
188
-
189
- - for unsharded collections when
190
- '* mongo.input.split.create_input_splits* ' is ** true** .
191
-
192
- - for sharded collections when
193
- '* mongo.input.split.create_input_splits* ' is ** true** * and*
194
- '* mongo.input.split.read_shard_chunks* ' is ** false** .
195
-
196
- In these cases, MongoHadoop generates multiple InputSplits. Users
197
- have control over two factors in this system.
198
-
199
- * * mongo.input.split_size* - Controls the maximum number of megabytes
200
- of each split. The current default is 8, based on assumptions
201
- prior experience with Hadoop. MongoDB's default of 64 megabytes
202
- may be a bit too large for most deployments.
203
-
204
- * * mongo.input.split.split_key_pattern* - Is a MongoDB key pattern
205
- that follows [ the same rules as shard key selection] ( http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-ShardKeys ) .
206
- This key pattern has some requirements, (i.e. must have an index,
207
- unique, and present in all documents.) MongoHadoop uses this key to
208
- determine split points. It defaults to ` { _id: 1 } ` but you may find
209
- that it's more ideal to optimize the mapper distribution by
210
- configuring this value.
211
-
212
- For all three paths, you may specify a custom query filter for the
213
- input data. * mongo.input.query* represents a JSON document containing
214
- a MongoDB query. This will be properly combined with the index
215
- filtering on input splits allowing you to MapReduce a subset of your
216
- data but still get efficient splitting.
217
-
218
- ### Pig
219
-
220
- MongoHadoop includes the MongoStorage and the MongoLoader module for Pig.
221
- Examples of loading and storing to a MongoDB from Pig can be found
222
- in ` examples/pigtutorial ` .
223
-
224
- ## Examples
225
-
226
- ### WordCount
227
-
228
- There are two example WordCount processes for Hadoop MapReduce in ` examples/wordcount `
229
- Both read strings from MongoDB and save the count of word frequency.
230
-
231
- These examples read documents in the ` test ` database, stored in the
232
- collection named ` in ` . They will count the frequency defined in field
233
- ` x ` .
234
-
235
- The examples save results in db ` test ` , collection ` out ` .
236
-
237
- ` WordCount.java ` is a programatically configured MapReduce job, where
238
- all of the configuration params are setup in the Java code. You can
239
- run this with the ant task ` wordcount ` .
240
-
241
- ` WordCountXMLConfig.java ` configures the MapReduce job using only XML
242
- files, with JSON for queries. See
243
- ` examples/wordcount/src/main/resources/mongo-wordcount.xml ` for the
244
- example configuration. You can run this with the ant task
245
- ` wordcountXML ` , or with a Hadoop command that resembles the following:
246
-
247
- hadoop jar core/target/mongo-hadoop-core-1.0.0-rc0.jar com.mongodb.hadoop.examples.WordCountXMLConfig -conf examples/wordcount/src/main/resources/mongo-wordcount.xml
248
-
249
- You will need to copy the ` mongo-java-driver.jar ` file into your
250
- Hadoop ` lib ` directory before this will work.
251
-
252
- ### Treasury Yield
253
-
254
- The treasury yield example demonstrates working with a more complex
255
- input BSON document and calculating an average.
256
-
257
- It uses a database of daily US Treasury Bid Curves from 1990 to
258
- Sept. 2010 and runs them through to calculate annual averages.
259
-
260
- There is a JSON file ` examples/treasury_yield/src/main/resources/yield_historical_in.json `
261
- which you should import into the ` yield_historical.in ` collection in
262
- the ` demo ` db.
263
-
264
- You may import the sample data into the ` mongos ` host by issuing the
265
- following command:
266
-
267
- mongoimport --db demo --collection yield_historical.in --type json --file examples/treasury_yield/src/main/resources/yield_historical_in.json
268
-
269
- This command assumes that ` mongos ` is running on the localhost
270
- interface on port ` 27017 ` . You'll need to setup the mongo-hadoop and
271
- mongo-java-driver jars in your Hadoop installations "lib"
272
- directory. After importing the data, run the test with the following
273
- command on the Hadoop master:
274
-
275
- hadoop jar core/target/mongo-hadoop-core-1.0.0-rc0.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml
276
-
277
- To confirm the test ran successfully, look at the ` demo ` database and
278
- query the ` yield_historical.out collection ` .
279
-
280
- ### Pig
281
-
282
- The MongoHadoop distribution includes a modified version of the Pig
283
- Tutorial from the Pig distribution for testing.
284
-
285
- This script differs from the pig tutorial in that it loads the data
286
- from a MongoDB and saves the results to a MongoDB.
287
-
288
- The use of Pig assumes you have Hadoop and Pig installed and
289
- configured on your system.
290
-
291
- To populate your MongoDB with the relevant data for the example
292
- configure the script ` examples/pigtutorial/populateMongo.pig ` to
293
- include the connection URI to your MongoDB, then run:
294
-
295
- pig -x local examples/pigtutorial/populateMongo.pig
296
-
297
- Next configure the script ` examples/pigtutorial/test.pig ` to include
298
- the connection URI to your MongoDB. Make sure you've built
299
- using ` ant jar ` , then run:
300
-
301
- pig -x local examples/pigtutorial/test.pig
302
-
303
- You should find the data and the results in your MongoDB.
304
-
305
- NOTE - Make sure these version artifacts on the built jars match those
306
- in the script
307
-
308
-
309
- ## KNOWN ISSUES
310
-
311
- ### Open Issues
312
-
313
- * You cannot configure bare regexes (e.g. /^foo/) in the config xml as
314
- they won't parse. Use {"$regex": "^foo", "$options": ""}
315
- instead. .. Make sure to omit the slashes.
316
-
317
- * [ HADOOP-19 - MongoStorage fails when tuples w/i bags are not named] ( https://jira.mongodb.org/browse/HADOOP-19 )
318
-
319
- This is due to an open Apache bug, [ PIG-2509] ( https://issues.apache.org/jira/browse/PIG-2509 ) .
320
-
321
- ### Streaming
322
-
323
- Streaming support in MongoHadoop ** requires** that the Hadoop
324
- distribution include the patches for the following issues:
325
-
326
- * [ HADOOP-1722 - Make streaming to handle non-utf8 byte array] ( https://issues.apache.org/jira/browse/HADOOP-1722 )
327
- * [ HADOOP-5450 - Add support for application-specific typecodes to typed bytes] ( https://issues.apache.org/jira/browse/HADOOP-5450 )
328
- * [ MAPREDUCE-764 - TypedBytesInput's readRaw() does not preserve custom type codes] ( https://issues.apache.org/jira/browse/MAPREDUCE-764 )
329
-
330
- The mainline Apache Hadoop distribution merged these patches for the
331
- 0.21.0 release. We have verified as well that the
332
- [ Cloudera] ( http://cloudera.com ) distribution (while based on 0.20.x
333
- still) includes these patches in CDH3 Update 1+ (We build against
334
- Update 3 now); anecdotal evidence (which needs confirmation) indicates
335
- they may have been there since CDH2, and likely exist in CDH3 as well.
336
-
337
-
338
- By default, The Mongo-Hadoop project builds against Apache 0.20.203 which does * not* include these patches. To build/enable Streaming support you must build against either Cloudera CDH3u1 or Hadoop 0.21.0.
339
-
340
- Additionally, note that Hadoop 1.0 is based on the 0.20 release. As such, it * does not include* the patches necessary for streaming. This is frustrating and upsetting but unfortunately out of our hands. We are working on attempting to get these patches backported into a future release or finding an additional workaround.
43
+ Documentation and Build Details: http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html
0 commit comments