You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Drops the inline callouts from the docs. This is when you write `<1>`
anywhere but the end of a line. Asciidoctor doesn't support them and
we'd very much like to move to Asciidoctor to generate the docs because
it is being actively maintained.
<2> Hive column `date` mapped in {es} to `@timestamp`
83
-
<3> Hive column `url` mapped in {es} to `url_123`
81
+
<1> Hive column `date` mapped in {es} to `@timestamp`; Hive column `url` mapped in {es} to `url_123`
84
82
85
83
TIP: {es} accepts only lower-case field name and, as such, {eh} will always convert Hive column names to lower-case. This poses no issue as Hive is **case insensitive**
86
84
however it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.
@@ -97,7 +95,7 @@ CREATE EXTERNAL TABLE artists (
97
95
name STRING,
98
96
links STRUCT<url:STRING, picture:STRING>)
99
97
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'<1>
Copy file name to clipboardExpand all lines: docs/src/reference/asciidoc/core/pig.adoc
+22-20Lines changed: 22 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ In order to use {eh}, its jar needs to be in Pig's classpath. There are various
18
18
REGISTER /path/elasticsearch-hadoop.jar;
19
19
----
20
20
21
-
NOTE: the command expects a proper URI that can be found either on the local file-system or remotely. Typically it's best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed
21
+
NOTE: The command expects a proper URI that can be found either on the local file-system or remotely. Typically it's best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed
22
22
on various machines.
23
23
24
24
As an alternative, when using the command-line, one can register additional jars through the `-Dpig.additional.jars` option (that accepts an URI as well):
@@ -44,9 +44,10 @@ With Pig, one can specify the <<configuration,configuration>> properties (as an
44
44
45
45
[source,sql]
46
46
----
47
-
STORE B INTO 'radio/artists'<1> USING org.elasticsearch.hadoop.pig.EsStorage
48
-
('es.http.timeout = 5m<2>',
49
-
'es.index.auto.create = false' <3>);
47
+
STORE B INTO 'radio/artists' <1>
48
+
USING org.elasticsearch.hadoop.pig.EsStorage
49
+
('es.http.timeout = 5m', <2>
50
+
'es.index.auto.create = false'); <3>
50
51
----
51
52
52
53
<1> {eh} configuration (target resource)
@@ -163,12 +164,10 @@ For example:
163
164
[source,sql]
164
165
----
165
166
STORE B INTO '...' USING org.elasticsearch.hadoop.pig.EsStorage(
<2> Pig column `date` mapped in {es} to `@timestamp`
171
-
<3> Pig column `url` mapped in {es} to `url_123`
170
+
<1> Pig column `date` mapped in {es} to `@timestamp`; Pig column `uRL` mapped in {es} to `url`
172
171
173
172
TIP: {es} accepts only lower-case field name and, as such, {eh} will always convert Pig column names to lower-case. Because Pig is **case sensitive**, {eh} handles the reverse
174
173
field mapping as well. It is recommended to use the default Pig style and use upper-case names only for commands and avoid mixed-case names.
@@ -186,11 +185,13 @@ A = LOAD 'src/test/resources/artists.dat' USING PigStorage()
186
185
-- transform data
187
186
B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links;
188
187
-- save the result to Elasticsearch
189
-
STORE B INTO 'radio/artists'<1> USING org.elasticsearch.hadoop.pig.EsStorage(<2>);
188
+
STORE B INTO 'radio/artists'<1>
189
+
USING org.elasticsearch.hadoop.pig.EsStorage(); <2>
190
190
----
191
191
192
192
<1> {es} resource (index and type) associated with the given storage
193
-
<2> additional configuration parameters can be passed here - in this case the defaults are used
193
+
<2> additional configuration parameters can be passed inside the `()` - in this
194
+
case the defaults are used
194
195
195
196
[float]
196
197
==== Writing existing JSON to {es}
@@ -213,9 +214,9 @@ IMPORTANT: Make sure the data is properly encoded, in `UTF-8`. The field content
213
214
214
215
[source,sql]
215
216
----
216
-
A = LOAD '/resources/artists.json' USING PigStorage() AS (json:chararray<1>);"
217
+
A = LOAD '/resources/artists.json' USING PigStorage() AS (json:chararray);" <1>
217
218
STORE B INTO 'radio/artists'
218
-
USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'<2>...);
219
+
USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'...); <2>
219
220
----
220
221
221
222
<1> Load the (JSON) data as a single field (`json`)
@@ -229,8 +230,9 @@ One can index the data to a different resource, depending on the 'row' being rea
229
230
[source,sql]
230
231
----
231
232
A = LOAD 'src/test/resources/media.dat' USING PigStorage()
232
-
AS (name:chararray, type:chararray <1>, year: chararray);
233
-
STORE B INTO 'my-collection/{type}'<2> USING org.elasticsearch.hadoop.pig.EsStorage();
233
+
AS (name:chararray, type:chararray, year: chararray); <1>
234
+
STORE B INTO 'my-collection/{type}' <2>
235
+
USING org.elasticsearch.hadoop.pig.EsStorage();
234
236
----
235
237
236
238
<1> Tuple field used by the resource pattern. Any of the declared fields can be used.
@@ -256,8 +258,8 @@ the table declaration can be as follows:
256
258
257
259
[source,sql]
258
260
----
259
-
A = LOAD '/resources/media.json' USING PigStorage() AS (json:chararray<1>);"
260
-
STORE B INTO 'my-collection/{media_type}'<2>
261
+
A = LOAD '/resources/media.json' USING PigStorage() AS (json:chararray);" <1>
262
+
STORE B INTO 'my-collection/{media_type}'<2>
261
263
USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true');
262
264
----
263
265
@@ -272,15 +274,15 @@ As you would expect, loading the data is straight forward:
272
274
[source,sql]
273
275
----
274
276
-- execute Elasticsearch query and load data into Pig
275
-
A = LOAD 'radio/artists'<1>
276
-
USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'<2>);
277
+
A = LOAD 'radio/artists'<1>
278
+
USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'); <2>
277
279
DUMP A;
278
280
----
279
281
280
282
<1> {es} resource
281
283
<2> search query to execute
282
284
283
-
IMPORTANT: Due to a https://issues.apache.org/jira/browse/PIG-3646[bug] in Pig, +LoadFunc+tions are not aware of any schema associated with them. This means +EsStorage+ is forced to fully the documents
285
+
IMPORTANT: Due to a https://issues.apache.org/jira/browse/PIG-3646[bug] in Pig, +LoadFunctions+ are not aware of any schema associated with them. This means +EsStorage+ is forced to fully parse the documents
284
286
from Elasticsearch before passing the data to Pig for projection. In practice, this has little impact as long as a document top-level fields are used; for nested fields consider extracting the values
285
287
yourself in Pig.
286
288
@@ -303,7 +305,7 @@ Pig internally uses native java types for most of its types and {eh} abides to t
303
305
| `double` | `double`
304
306
| `float` | `float`
305
307
| `bytearray` | `binary`
306
-
| `tuple` | `array` or `map` (depending on <<tuple-names,this>> settings)
308
+
| `tuple` | `array` or `map` (depending on <<tuple-names,this>> setting)
0 commit comments