Skip to content
This repository was archived by the owner on Jan 29, 2022. It is now read-only.

Commit a73fb53

Browse files
committed
Update master
1 parent ff31c3d commit a73fb53

File tree

1 file changed

+1
-298
lines changed

1 file changed

+1
-298
lines changed

README.md

+1-298
Original file line numberDiff line numberDiff line change
@@ -40,301 +40,4 @@ Issue tracking: https://jira.mongodb.org/browse/HADOOP/
4040

4141
Discussion: http://groups.google.com/group/mongodb-user/
4242

43-
## Building the Adapter
44-
45-
The Mongo-Hadoop adapter uses the
46-
[SBT Build Tool](https://github.com/harrah/xsbt) tool for
47-
compilation. SBT provides superior support for discrete configurations
48-
targeting multiple Hadoop versions. The distribution includes
49-
self-bootstrapping copy of SBT in the distribution as `sbt`. Create a
50-
copy of the jar files using the following command:
51-
52-
./sbt package
53-
54-
The MongoDB Hadoop Adapter supports a number of Hadoop releases. You
55-
can change the Hadoop version supported by the build by modifying the
56-
value of `hadoopRelease` in the `build.sbt` file. For instance, set
57-
this value to:
58-
59-
hadoopRelease in ThisBuild := "cdh3"
60-
61-
Configures a build against Cloudera CDH3u3, while:
62-
63-
hadoopRelease in ThisBuild := "0.21"
64-
65-
Configures a build against Hadoop 0.21 from the mainline Apache distribution.
66-
67-
Unfortunately, we are not aware of any Maven repositories that contain
68-
artifacts for Hadoop 0.21 at present. You may need to resolve these
69-
dependencies by hand if you choose to build using this
70-
configuration. We also publish releases to the central Maven
71-
repository with artifacts customized using the dependent release
72-
name. Our "default" build has no artifact name attached and supports
73-
Hadoop 1.0.
74-
75-
After building, you will need to place the "core" jar and the
76-
"mongo-java-driver" in the `lib` directory of each Hadoop server.
77-
78-
The MongoDB-Hadoop Adapter supports the following releases. Valid keys
79-
for configuration and Maven artifacts appear below each release.
80-
81-
### Cloudera Release 3
82-
83-
This derives from Apache Hadoop 0.20.2, but includes many custom
84-
patches. Patches include binary streaming, and Pig 0.8.1. This
85-
target compiles *ALL* Modules, including Streaming.
86-
87-
- cdh3
88-
- Maven artifact: "org.mongodb" / "mongo-hadoop_cdh3u3"
89-
90-
### Apache Hadoop 0.20.205.0
91-
92-
This includes Pig 0.9.2 and does *NOT* support Hadoop Streaming.
93-
94-
- 0.20
95-
- 0.20.x
96-
- Maven artifact: "org.mongodb" / "mongo-hadoop_0.20.205.0"
97-
98-
### Apache Hadoop 1.0.0
99-
100-
This includes Pig 0.9.1 and does *NOT* support Hadoop Streaming.
101-
102-
- 1.0
103-
- 1.0.x
104-
- Maven artifact: "org.mongodb" / "mongo-hadoop_1.0.0"
105-
106-
## Apache Hadoop 0.21.0
107-
108-
This includes Pig 0.9.1 and Hadoop Streaming.
109-
110-
- 0.21
111-
- 0.21.x
112-
113-
This build is **not** published to Maven because of upstream
114-
dependency availability.
115-
116-
### Apache Hadoop 0.23
117-
118-
Support is *forthcoming*.
119-
120-
This is an alpha branch with ongoing work by
121-
[Hortonworks](http://hortonworks.com). Apache Hadoop 0.23 is "newer"
122-
than Apache Hadoop 1.0.
123-
124-
The MongoDB Hadoop Adapter currently supports the following features.
125-
126-
## Hadoop MapReduce
127-
128-
Provides working *Input* and *Output* adapters for MongoDB. You may
129-
configure these adapters with XML or programatically. See the
130-
WordCount examples for demonstrations of both approaches. You can
131-
specify a query, fields and sort specs in the XML config as JSON or
132-
programatically as a DBObject.
133-
134-
### Splitting up MongoDB Source Data for the InputFormat
135-
136-
The MongoDB Hadoop Adapter makes it possible to create multiple
137-
*InputSplits* on source data originating from MongoDB to
138-
optimize/paralellize input processing for Mappers.
139-
140-
If '*mongo.input.split.create_input_splits*' is **false** (it defaults
141-
to **true**) then MongoHadoop will use **no** splits. Hadoop will
142-
treat the entire collection as a single, giant, *Input*. This is
143-
primarily intended for debugging purposes.
144-
145-
When true, as by default, the following possible behaviors exist:
146-
147-
1. For unsharded the source collections, MongoHadoop follows the
148-
"unsharded split" path. (See below.)
149-
150-
2. For sharded source collections:
151-
152-
* If '*mongo.input.split.read_shard_chunks*' is **true**
153-
(defaults **true**) then we pull the chunk specs from the
154-
configuration server, and turn each shard chunk into an *Input
155-
Split*. Basically, this means the mongodb sharding system does
156-
99% of the preconfig work for us and is a good thing™
157-
158-
* If '*mongo.input.split.read_shard_chunks*' is **false** and
159-
'*mongo.input.split.read_from_shards*' is **true** (it defaults
160-
to **false**) then we connect to the `mongod` or replica set
161-
for each shard individually and each shard becomes an input
162-
split. The entire content of the collection on the shard is one
163-
split. Only use this configuration in rare situations.
164-
165-
* If '*mongo.input.split.read_shard_chunks*' is **true** and
166-
'*mongo.input.split.read_from_shards*' is **true** (it defaults
167-
to **false**) MongoHadoop reads the chunk boundaries from
168-
the config server but then reads data directly from the shards
169-
without using the `mongos`. While this may seem like a good
170-
idea, it can cause erratic behavior if MongoDB balances chunks
171-
during a Hadoop job. This is not a recommended configuration
172-
for write-heavy applications but may provide effective
173-
parallelism in read-heavy apps.
174-
175-
* If both '*mongo.input.split.create_input_splits*' and '*mongo.input.split.read_from_shards*' are
176-
**false** then we pretend there is no sharding and use
177-
the "unsharded split" path. When '*mongo.input.split.read_shard_chunks*' is
178-
**false** MongoHadoop reads everything through mongos as a
179-
single split.
180-
181-
### "Unsharded Splits"
182-
183-
"Unsharded Splits" refers to the method that MongoHadoop uses to
184-
calculate new splits. You may use "Unsharded splits" with sharded
185-
MongoDB options.
186-
187-
This is only used:
188-
189-
- for unsharded collections when
190-
'*mongo.input.split.create_input_splits*' is **true**.
191-
192-
- for sharded collections when
193-
'*mongo.input.split.create_input_splits*' is **true** *and*
194-
'*mongo.input.split.read_shard_chunks*' is **false**.
195-
196-
In these cases, MongoHadoop generates multiple InputSplits. Users
197-
have control over two factors in this system.
198-
199-
* *mongo.input.split_size* - Controls the maximum number of megabytes
200-
of each split. The current default is 8, based on assumptions
201-
prior experience with Hadoop. MongoDB's default of 64 megabytes
202-
may be a bit too large for most deployments.
203-
204-
* *mongo.input.split.split_key_pattern* - Is a MongoDB key pattern
205-
that follows [the same rules as shard key selection](http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-ShardKeys).
206-
This key pattern has some requirements, (i.e. must have an index,
207-
unique, and present in all documents.) MongoHadoop uses this key to
208-
determine split points. It defaults to `{ _id: 1 }` but you may find
209-
that it's more ideal to optimize the mapper distribution by
210-
configuring this value.
211-
212-
For all three paths, you may specify a custom query filter for the
213-
input data. *mongo.input.query* represents a JSON document containing
214-
a MongoDB query. This will be properly combined with the index
215-
filtering on input splits allowing you to MapReduce a subset of your
216-
data but still get efficient splitting.
217-
218-
### Pig
219-
220-
MongoHadoop includes the MongoStorage and the MongoLoader module for Pig.
221-
Examples of loading and storing to a MongoDB from Pig can be found
222-
in `examples/pigtutorial`.
223-
224-
## Examples
225-
226-
### WordCount
227-
228-
There are two example WordCount processes for Hadoop MapReduce in `examples/wordcount`
229-
Both read strings from MongoDB and save the count of word frequency.
230-
231-
These examples read documents in the `test` database, stored in the
232-
collection named `in`. They will count the frequency defined in field
233-
`x`.
234-
235-
The examples save results in db `test`, collection `out`.
236-
237-
`WordCount.java` is a programatically configured MapReduce job, where
238-
all of the configuration params are setup in the Java code. You can
239-
run this with the ant task `wordcount`.
240-
241-
`WordCountXMLConfig.java` configures the MapReduce job using only XML
242-
files, with JSON for queries. See
243-
`examples/wordcount/src/main/resources/mongo-wordcount.xml` for the
244-
example configuration. You can run this with the ant task
245-
`wordcountXML`, or with a Hadoop command that resembles the following:
246-
247-
hadoop jar core/target/mongo-hadoop-core-1.0.0-rc0.jar com.mongodb.hadoop.examples.WordCountXMLConfig -conf examples/wordcount/src/main/resources/mongo-wordcount.xml
248-
249-
You will need to copy the `mongo-java-driver.jar` file into your
250-
Hadoop `lib` directory before this will work.
251-
252-
### Treasury Yield
253-
254-
The treasury yield example demonstrates working with a more complex
255-
input BSON document and calculating an average.
256-
257-
It uses a database of daily US Treasury Bid Curves from 1990 to
258-
Sept. 2010 and runs them through to calculate annual averages.
259-
260-
There is a JSON file `examples/treasury_yield/src/main/resources/yield_historical_in.json`
261-
which you should import into the `yield_historical.in` collection in
262-
the `demo` db.
263-
264-
You may import the sample data into the `mongos` host by issuing the
265-
following command:
266-
267-
mongoimport --db demo --collection yield_historical.in --type json --file examples/treasury_yield/src/main/resources/yield_historical_in.json
268-
269-
This command assumes that `mongos` is running on the localhost
270-
interface on port `27017`. You'll need to setup the mongo-hadoop and
271-
mongo-java-driver jars in your Hadoop installations "lib"
272-
directory. After importing the data, run the test with the following
273-
command on the Hadoop master:
274-
275-
hadoop jar core/target/mongo-hadoop-core-1.0.0-rc0.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml
276-
277-
To confirm the test ran successfully, look at the `demo` database and
278-
query the `yield_historical.out collection`.
279-
280-
### Pig
281-
282-
The MongoHadoop distribution includes a modified version of the Pig
283-
Tutorial from the Pig distribution for testing.
284-
285-
This script differs from the pig tutorial in that it loads the data
286-
from a MongoDB and saves the results to a MongoDB.
287-
288-
The use of Pig assumes you have Hadoop and Pig installed and
289-
configured on your system.
290-
291-
To populate your MongoDB with the relevant data for the example
292-
configure the script `examples/pigtutorial/populateMongo.pig` to
293-
include the connection URI to your MongoDB, then run:
294-
295-
pig -x local examples/pigtutorial/populateMongo.pig
296-
297-
Next configure the script `examples/pigtutorial/test.pig` to include
298-
the connection URI to your MongoDB. Make sure you've built
299-
using `ant jar`, then run:
300-
301-
pig -x local examples/pigtutorial/test.pig
302-
303-
You should find the data and the results in your MongoDB.
304-
305-
NOTE - Make sure these version artifacts on the built jars match those
306-
in the script
307-
308-
309-
## KNOWN ISSUES
310-
311-
### Open Issues
312-
313-
* You cannot configure bare regexes (e.g. /^foo/) in the config xml as
314-
they won't parse. Use {"$regex": "^foo", "$options": ""}
315-
instead. .. Make sure to omit the slashes.
316-
317-
* [HADOOP-19 - MongoStorage fails when tuples w/i bags are not named](https://jira.mongodb.org/browse/HADOOP-19)
318-
319-
This is due to an open Apache bug, [PIG-2509](https://issues.apache.org/jira/browse/PIG-2509).
320-
321-
### Streaming
322-
323-
Streaming support in MongoHadoop **requires** that the Hadoop
324-
distribution include the patches for the following issues:
325-
326-
* [HADOOP-1722 - Make streaming to handle non-utf8 byte array](https://issues.apache.org/jira/browse/HADOOP-1722)
327-
* [HADOOP-5450 - Add support for application-specific typecodes to typed bytes](https://issues.apache.org/jira/browse/HADOOP-5450)
328-
* [MAPREDUCE-764 - TypedBytesInput's readRaw() does not preserve custom type codes](https://issues.apache.org/jira/browse/MAPREDUCE-764)
329-
330-
The mainline Apache Hadoop distribution merged these patches for the
331-
0.21.0 release. We have verified as well that the
332-
[Cloudera](http://cloudera.com) distribution (while based on 0.20.x
333-
still) includes these patches in CDH3 Update 1+ (We build against
334-
Update 3 now); anecdotal evidence (which needs confirmation) indicates
335-
they may have been there since CDH2, and likely exist in CDH3 as well.
336-
337-
338-
By default, The Mongo-Hadoop project builds against Apache 0.20.203 which does *not* include these patches. To build/enable Streaming support you must build against either Cloudera CDH3u1 or Hadoop 0.21.0.
339-
340-
Additionally, note that Hadoop 1.0 is based on the 0.20 release. As such, it *does not include* the patches necessary for streaming. This is frustrating and upsetting but unfortunately out of our hands. We are working on attempting to get these patches backported into a future release or finding an additional workaround.
43+
Documentation and Build Details: http://api.mongodb.org/hadoop/MongoDB%2BHadoop+Connector.html

0 commit comments

Comments
 (0)