You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Drops the inline callouts from the docs. This is when you write `<1>`
anywhere but the end of a line. Asciidoctor doesn't support them and
we'd very much like to move to Asciidoctor to generate the docs because
it is being actively maintained.
<2> Hive column `date` mapped in {es} to `@timestamp`
83
-
<3> Hive column `url` mapped in {es} to `url_123`
81
+
<1> Hive column `date` mapped in {es} to `@timestamp`; Hive column `url` mapped in {es} to `url_123`
84
82
85
83
TIP: Hive is case **insensitive** while {es} is not. The loss of information can create invalid queries (as the column in Hive might not match the one in {es}). To avoid this, {eh} will always convert Hive column names to lower-case.
86
84
This being said, it is recommended to use the default Hive style and use upper-case names only for Hive commands and avoid mixed-case names.
@@ -99,7 +97,7 @@ CREATE EXTERNAL TABLE artists (
99
97
name STRING,
100
98
links STRUCT<url:STRING, picture:STRING>)
101
99
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'<1>
Copy file name to clipboardExpand all lines: docs/src/reference/asciidoc/core/pig.adoc
+23-21Lines changed: 23 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ In order to use {eh}, its jar needs to be in Pig's classpath. There are various
18
18
REGISTER /path/elasticsearch-hadoop.jar;
19
19
----
20
20
21
-
NOTE: the command expects a proper URI that can be found either on the local file-system or remotely. Typically it's best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed
21
+
NOTE: The command expects a proper URI that can be found either on the local file-system or remotely. Typically it's best to use a distributed file-system (like HDFS or Amazon S3) and use that since the script might be executed
22
22
on various machines.
23
23
24
24
As an alternative, when using the command-line, one can register additional jars through the `-Dpig.additional.jars` option (that accepts an URI as well):
@@ -44,9 +44,10 @@ With Pig, one can specify the <<configuration,configuration>> properties (as an
44
44
45
45
[source,sql]
46
46
----
47
-
STORE B INTO 'radio/artists'<1> USING org.elasticsearch.hadoop.pig.EsStorage
48
-
('es.http.timeout = 5m<2>',
49
-
'es.index.auto.create = false' <3>);
47
+
STORE B INTO 'radio/artists' <1>
48
+
USING org.elasticsearch.hadoop.pig.EsStorage
49
+
('es.http.timeout = 5m', <2>
50
+
'es.index.auto.create = false'); <3>
50
51
----
51
52
52
53
<1> {eh} configuration (target resource)
@@ -163,12 +164,10 @@ For example:
163
164
[source,sql]
164
165
----
165
166
STORE B INTO '...' USING org.elasticsearch.hadoop.pig.EsStorage(
<2> Pig column `date` mapped in {es} to `@timestamp`
171
-
<3> Pig column `url` mapped in {es} to `url_123`
170
+
<1> Pig column `date` mapped in {es} to `@timestamp`; Pig column `uRL` mapped in {es} to `url`
172
171
173
172
TIP: Since {eh} 2.1, the Pig schema case sensitivity is preserved to {es} and back.
174
173
@@ -185,11 +184,13 @@ A = LOAD 'src/test/resources/artists.dat' USING PigStorage()
185
184
-- transform data
186
185
B = FOREACH A GENERATE name, TOTUPLE(url, picture) AS links;
187
186
-- save the result to Elasticsearch
188
-
STORE B INTO 'radio/artists'<1> USING org.elasticsearch.hadoop.pig.EsStorage(<2>);
187
+
STORE B INTO 'radio/artists'<1>
188
+
USING org.elasticsearch.hadoop.pig.EsStorage(); <2>
189
189
----
190
190
191
191
<1> {es} resource (index and type) associated with the given storage
192
-
<2> additional configuration parameters can be passed here - in this case the defaults are used
192
+
<2> additional configuration parameters can be passed inside the `()` - in this
193
+
case the defaults are used
193
194
194
195
For cases where the id (or other metadata fields like +ttl+ or +timestamp+) of the document needs to be specified, one can do so by setting the appropriate <<cfg-mapping, mapping>> namely +es.mapping.id+. Following the previous example, to indicate to {es} to use the field +id+ as the document id, update the +Storage+ configuration:
195
196
@@ -219,9 +220,9 @@ IMPORTANT: Make sure the data is properly encoded, in `UTF-8`. The field content
219
220
220
221
[source,sql]
221
222
----
222
-
A = LOAD '/resources/artists.json' USING PigStorage() AS (json:chararray<1>);"
223
+
A = LOAD '/resources/artists.json' USING PigStorage() AS (json:chararray);" <1>
223
224
STORE B INTO 'radio/artists'
224
-
USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'<2>...);
225
+
USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true'...); <2>
225
226
----
226
227
227
228
<1> Load the (JSON) data as a single field (`json`)
@@ -235,8 +236,9 @@ One can index the data to a different resource, depending on the 'row' being rea
235
236
[source,sql]
236
237
----
237
238
A = LOAD 'src/test/resources/media.dat' USING PigStorage()
238
-
AS (name:chararray, type:chararray <1>, year: chararray);
239
-
STORE B INTO 'my-collection/{type}'<2> USING org.elasticsearch.hadoop.pig.EsStorage();
239
+
AS (name:chararray, type:chararray, year: chararray); <1>
240
+
STORE B INTO 'my-collection/{type}' <2>
241
+
USING org.elasticsearch.hadoop.pig.EsStorage();
240
242
----
241
243
242
244
<1> Tuple field used by the resource pattern. Any of the declared fields can be used.
@@ -262,8 +264,8 @@ the table declaration can be as follows:
262
264
263
265
[source,sql]
264
266
----
265
-
A = LOAD '/resources/media.json' USING PigStorage() AS (json:chararray<1>);"
266
-
STORE B INTO 'my-collection/{media_type}'<2>
267
+
A = LOAD '/resources/media.json' USING PigStorage() AS (json:chararray);" <1>
268
+
STORE B INTO 'my-collection/{media_type}'<2>
267
269
USING org.elasticsearch.hadoop.pig.EsStorage('es.input.json=true');
268
270
----
269
271
@@ -278,23 +280,23 @@ As you would expect, loading the data is straight forward:
278
280
[source,sql]
279
281
----
280
282
-- execute Elasticsearch query and load data into Pig
281
-
A = LOAD 'radio/artists'<1>
282
-
USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'<2>);
283
+
A = LOAD 'radio/artists'<1>
284
+
USING org.elasticsearch.hadoop.pig.EsStorage('es.query=?me*'); <2>
283
285
DUMP A;
284
286
----
285
287
286
288
<1> {es} resource
287
289
<2> search query to execute
288
290
289
-
IMPORTANT: Due to a https://issues.apache.org/jira/browse/PIG-3646[bug] in Pig, +LoadFunctions+ are not aware of any schema associated with them. This means +EsStorage+ is forced to fully the documents
291
+
IMPORTANT: Due to a https://issues.apache.org/jira/browse/PIG-3646[bug] in Pig, +LoadFunctions+ are not aware of any schema associated with them. This means +EsStorage+ is forced to fully parse the documents
290
292
from Elasticsearch before passing the data to Pig for projection. In practice, this has little impact as long as a document top-level fields are used; for nested fields consider extracting the values
291
293
yourself in Pig.
292
294
293
295
294
296
[float]
295
297
=== Reading data from {es} as JSON
296
298
297
-
In case where the results from {es} need to be in JSON format (typically to be sent down the wire to some other system), one can instruct the {eh} to return the data as is. By setting `es.output.json` to `true`, the connector will parse the response from {es}, identify the documents and, without converting them, return their content to the user as +String/chararray+ objects.
299
+
In the case where the results from {es} need to be in JSON format (typically to be sent down the wire to some other system), one can instruct {eh} to return the data as is. By setting `es.output.json` to `true`, the connector will parse the response from {es}, identify the documents and, without converting them, return their content to the user as +String/chararray+ objects.
298
300
299
301
300
302
[[pig-type-conversion]]
@@ -316,7 +318,7 @@ Pig internally uses native java types for most of its types and {eh} abides to t
316
318
| `double` | `double`
317
319
| `float` | `float`
318
320
| `bytearray` | `binary`
319
-
| `tuple` | `array` or `map` (depending on <<tuple-names,this>> settings)
321
+
| `tuple` | `array` or `map` (depending on <<tuple-names,this>> setting)
0 commit comments