Skip to content

Commit 574d168

Browse files
authored
Split large shard sizing into two challenges and added additional id types (elastic#24)
* Split large shard sizing into two challenges and added additional id types * Updated README.md file for revised large shard sizing tests * Updated following review. * Updated to use default Rally refresh operation
1 parent b46584e commit 574d168

File tree

3 files changed

+92
-136
lines changed

3 files changed

+92
-136
lines changed

README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -111,14 +111,13 @@ $ cat params-file.json
111111

112112
### 8) large-shard-sizing
113113

114-
This challenge examines the performance and memory usage of large shards. It indexes data into a single shard index ~25GB at a time and runs up to a shard size of ~300GB. After every 25GB that has been indexed, select index statistics are recorded and a number of simulated Kibana dashboards are run against the index. Two indices are created and benchmarked, one with document IDs generated by Elasticsearch and one with application generated UUIDs used as document IDs.
114+
This challenge examines the performance and memory usage of large shards. It indexes data into a single shard index ~25GB at a time and runs up to a shard size of ~300GB. After every 25GB that has been indexed, select index statistics are recorded and a number of simulated Kibana dashboards are run against the index to show how query performance varies with shard size.
115115

116116
This challenge will show the following:
117-
* How index performance varies with shard size for autogenerated IDs and UUIDs
118117
* How dashboard query performance varies with shard size
119118
* How memory usage varies with shard size
120119

121-
Note that this challenge will generate up to ~600GB of data on disk and will require additional space for merging and overhead. Make sure around 1TB of disk space is available before running this to be on the safe side.
120+
Note that this challenge will generate up to ~300GB of data on disk and will require additional space for merging and overhead. Make sure around 600GB of disk space is available before running this to be on the safe side.
122121

123122
The table below shows the track parameters that can be adjusted along with default values:
124123

@@ -127,6 +126,19 @@ The table below shows the track parameters that can be adjusted along with defau
127126
| `bulk_indexing_clients` | Number of bulk indexing clients/connections | `int` | `32` |
128127
| `query_iterations` | Number of times each dashboard is simulated at each level | `int` | `10` |
129128

129+
### 9) large-shard-id-type-evaluation
130+
131+
This challenge examines the storage and heap usage implications of a wide variety of document ID types. It indexes data into a set of ~25GB single shard index, each for a different type of document ID (`auto`, `uuid`, `epoch_uuid`, `sha1`, `sha256`, `sha384`, and `sha512`). For each index a refresh is then run before select index statistics are recorded.
132+
133+
134+
Note that this challenge will generate up to ~200GB of data on disk and will require additional space for merging and overhead. Make sure around 300GB of disk space is available before running this to be on the safe side.
135+
136+
The table below shows the track parameters that can be adjusted along with default values:
137+
138+
| Parameter | Explanation | Type | Default Value |
139+
| --------- | ----------- | ---- | ------------- |
140+
| `bulk_indexing_clients` | Number of bulk indexing clients/connections | `int` | `32` |
141+
130142
## Custom parameter sources
131143

132144
### elasticlogs\_bulk\_source

eventdata/challenges/large-shard-sizing.json

Lines changed: 63 additions & 132 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66
{
77
"name": "large-shard-sizing",
8-
"description": "Index data into a single shard ~25Gb at a time (up to a total of ~300GB), then record index statistics and run a number of queries against the shard. IDs are based on UUIDs or autogenerated by Elasticsearch, meaning there are no conflicts.",
8+
"description": "Index data into a single shard ~25Gb at a time (up to a total of ~300GB), then record index statistics and run a number of queries against the shard. IDs are autogenerated by Elasticsearch, meaning there are no conflicts.",
99
"meta": {
1010
"client_count": {{ p_bulk_indexing_clients }},
1111
"benchmark_type": "large-shard-sizing",
@@ -40,34 +40,6 @@
4040
"index_template_name": "elasticlogs-auto"
4141
}
4242
},
43-
{
44-
"name": "deleteindex-elasticlogs-uuid",
45-
"operation": {
46-
"operation-type": "delete-index",
47-
"index": "elasticlogs-uuid"
48-
}
49-
},
50-
{
51-
"name": "createindex-elasticlogs-uuid",
52-
"operation": {
53-
"operation-type": "createindex",
54-
"index_name": "elasticlogs-uuid",
55-
"index_template_body": {
56-
"template": "elasticlogs-uuid",
57-
"settings": {
58-
"index.refresh_interval": "5s",
59-
"index.codec": "best_compression",
60-
"index.number_of_replicas": 0,
61-
"index.number_of_shards": 1
62-
},
63-
"mappings":
64-
{% include "mappings.json" %}
65-
,
66-
"aliases": {}
67-
},
68-
"index_template_name": "elasticlogs-uuid"
69-
}
70-
},
7143
{% for p_multiple in range(1, 13) %}
7244
{% set p_size = p_multiple * 25 %}
7345
{
@@ -76,7 +48,8 @@
7648
"operation-type": "bulk",
7749
"param-source": "elasticlogs_bulk",
7850
"index": "elasticlogs-auto",
79-
"bulk-size": 1000
51+
"bulk-size": 1000,
52+
"id_type": "auto"
8053
},
8154
"iterations": {{ p_ops_per_client }},
8255
"clients": {{ p_bulk_indexing_clients }},
@@ -87,11 +60,7 @@
8760
},
8861
{
8962
"name": "refresh-auto-{{ p_size }}",
90-
"operation": {
91-
"operation-type": "raw-request",
92-
"method": "POST",
93-
"path": "/elasticlogs-auto/_refresh"
94-
},
63+
"operation": "refresh",
9564
"iterations": 1,
9665
"clients": 1
9766
},
@@ -103,7 +72,7 @@
10372
},
10473
"meta": {
10574
"id_mode": "auto",
106-
"shard_size": {{ p_size }}
75+
"shard_size": {{ p_size }}
10776
}
10877
},
10978
{
@@ -181,128 +150,90 @@
181150
}
182151
},
183152
{% endfor %}
184-
{% for p_multiple in range(1, 13) %}
185-
{% set p_size = p_multiple * 25 %}
186153
{
187-
"name": "index-append-1000-uuid-{{ p_size }}",
154+
"name": "refresh-final",
155+
"operation": "refresh",
156+
"iterations": 1,
157+
"clients": 1
158+
}
159+
]
160+
},
161+
{
162+
"name": "large-shard-id-type-evaluation",
163+
"description": "Index data into a number of ~25Gb single shard indices with different document ID types, then record index statistics to allow size and memory usage comparisons. IDs are based on UUIDs or autogenerated by Elasticsearch, meaning there are no conflicts.",
164+
"meta": {
165+
"client_count": {{ p_bulk_indexing_clients }},
166+
"benchmark_type": "large-shard-sizing",
167+
"version": 2
168+
},
169+
"schedule": [
170+
{% for id_type in ['auto', 'uuid', 'epoch_uuid', 'sha1', 'sha256', 'sha384', 'sha512'] %}
171+
{
172+
"name": "deleteindex-elasticlogs-{{ id_type }}",
188173
"operation": {
189-
"operation-type": "bulk",
190-
"param-source": "elasticlogs_bulk",
191-
"index": "elasticlogs-uuid",
192-
"bulk-size": 1000
193-
},
194-
"iterations": {{ p_ops_per_client }},
195-
"clients": {{ p_bulk_indexing_clients }},
196-
"meta": {
197-
"id_mode": "uuid",
198-
"shard_size": {{ p_size }}
174+
"operation-type": "delete-index",
175+
"index": "elasticlogs-{{ id_type }}"
199176
}
200177
},
201178
{
202-
"name": "refresh-uuid-{{ p_size }}",
179+
"name": "createindex-elasticlogs-{{ id_type }}",
203180
"operation": {
204-
"operation-type": "raw-request",
205-
"method": "POST",
206-
"path": "/elasticlogs-uuid/_refresh"
207-
},
208-
"iterations": 1,
209-
"clients": 1
181+
"operation-type": "createindex",
182+
"index_name": "elasticlogs-{{ id_type }}",
183+
"index_template_body": {
184+
"template": "elasticlogs-{{ id_type }}",
185+
"settings": {
186+
"index.refresh_interval": "5s",
187+
"index.codec": "best_compression",
188+
"index.number_of_replicas": 0,
189+
"index.number_of_shards": 1
190+
},
191+
"mappings":
192+
{% include "mappings.json" %}
193+
,
194+
"aliases": {}
195+
},
196+
"index_template_name": "elasticlogs-{{ id_type }}"
197+
}
210198
},
211199
{
212-
"name": "indicesstats-elasticlogs-uuid-{{ p_size }}",
200+
"name": "index-append-1000-{{ id_type }}",
213201
"operation": {
214-
"operation-type": "indicesstats",
215-
"index_pattern": "elasticlogs-uuid"
202+
"operation-type": "bulk",
203+
"param-source": "elasticlogs_bulk",
204+
"index": "elasticlogs-{{ id_type }}",
205+
"bulk-size": 1000,
206+
"id_type": "{{ id_type }}"
216207
},
208+
"iterations": {{ p_ops_per_client }},
209+
"clients": {{ p_bulk_indexing_clients }},
217210
"meta": {
218-
"id_mode": "uuid",
219-
"shard_size": {{ p_size }}
211+
"id_mode": "{{ id_type }}"
220212
}
221213
},
222214
{
223-
"name": "fieldstats-elasticlogs-uuid-{{ p_size }}",
224-
"operation": {
225-
"operation-type": "fieldstats",
226-
"index_pattern": "elasticlogs-uuid"
227-
},
228-
"warmup-iterations": 1,
229-
"iterations": 1,
230-
"clients": {{ p_bulk_indexing_clients }}
231-
},
232-
{
233-
"name": "clear-caches-uuid-{{ p_size }}",
234-
"operation": {
235-
"operation-type": "raw-request",
236-
"method": "POST",
237-
"path": "/_cache/clear"
238-
},
215+
"name": "refresh-{{ id_type }}",
216+
"operation": "refresh",
239217
"iterations": 1,
240218
"clients": 1
241219
},
242-
{
243-
"name": "kibana-content_issues-50%-uuid-{{ p_size }}",
244-
"operation": {
245-
"operation-type": "kibana",
246-
"param-source": "elasticlogs_kibana",
247-
"dashboard": "content_issues",
248-
"index_pattern": "elasticlogs-uuid",
249-
"query_string": ["*"],
250-
"window_end": "START+50%,END",
251-
"window_length": "50%"
252-
},
253-
"iterations": {{ p_query_iterations }},
254-
"clients": 1,
255-
"meta": {
256-
"id_mode": "uuid",
257-
"shard_size": {{ p_size }}
258-
}
259-
},
260-
{
261-
"name": "kibana-traffic-25%-uuid-{{ p_size }}",
262-
"operation": {
263-
"operation-type": "kibana",
264-
"param-source": "elasticlogs_kibana",
265-
"dashboard": "traffic",
266-
"index_pattern": "elasticlogs-uuid",
267-
"query_string": ["*"],
268-
"window_end": "START+25%,END",
269-
"window_length": "25%"
270-
},
271-
"iterations": {{ p_query_iterations }},
272-
"clients": 1,
273-
"meta": {
274-
"id_mode": "uuid",
275-
"shard_size": {{ p_size }}
276-
}
277-
},
278-
{
279-
"name": "kibana-discover-50%-uuid-{{ p_size }}",
220+
{
221+
"name": "indicesstats-elasticlogs-{{ id_type }}",
280222
"operation": {
281-
"operation-type": "kibana",
282-
"param-source": "elasticlogs_kibana",
283-
"dashboard": "discover",
284-
"index_pattern": "elasticlogs-uuid",
285-
"query_string": ["*"],
286-
"window_end": "START+50%,END",
287-
"window_length": "50%"
223+
"operation-type": "indicesstats",
224+
"index_pattern": "elasticlogs-{{ id_type }}"
288225
},
289-
"iterations": {{ p_query_iterations }},
290-
"clients": 1,
291226
"meta": {
292-
"id_mode": "uuid",
293-
"shard_size": {{ p_size }}
227+
"id_mode": "{{ id_type }}"
294228
}
295229
},
296-
{% endfor %}
230+
{% endfor %}
297231
{
298232
"name": "refresh-final",
299-
"operation": {
300-
"operation-type": "raw-request",
301-
"method": "POST",
302-
"path": "/elasticlogs-*/_refresh"
303-
},
233+
"operation": "refresh",
304234
"iterations": 1,
305235
"clients": 1
306236
}
307237
]
308238
}
239+

eventdata/parameter_sources/elasticlogs_bulk_source.py

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import random
33
import uuid
44
import time
5+
import hashlib
56
from eventdata.parameter_sources.randomevent import RandomEvent
67

78
logger = logging.getLogger("track.eventdata")
@@ -50,6 +51,10 @@ class ElasticlogsBulkSource:
5051
uuid - Assign a UUID4 id to each document.
5152
epoch_uuid - Assign a UUIO4 identifier prefixed with the hex representation of the current
5253
timestamp.
54+
sha1 - SHA1 hash of UUID in hex representation. (Note: Generating this type of id can be CPU intensive)
55+
sha256 - SHA256 hash of UUID in hex representation. (Note: Generating this type of id can be CPU intensive)
56+
sha384 - SHA384 hash of UUID in hex representation. (Note: Generating this type of id can be CPU intensive)
57+
sha512 - SHA512 hash of UUID in hex representation. (Note: Generating this type of id can be CPU intensive)
5358
"id_delay_probability" - If id_type is set to `epoch_uuid` this parameter determnines the probability will be set in the
5459
past. This can be used to simulate a portion of the events arriving delayed. Must be in range [0.0, 1.0].
5560
Defaults to 0.0.
@@ -67,7 +72,7 @@ def __init__(self, track, params, **kwargs):
6772

6873
self._id_type = "auto"
6974
if 'id_type' in params.keys():
70-
if params['id_type'] in ['auto', 'uuid', 'epoch_uuid']:
75+
if params['id_type'] in ['auto', 'uuid', 'epoch_uuid', 'sha1', 'sha256', 'sha384', 'sha512']:
7176
self._id_type = params['id_type']
7277
else:
7378
logger.warning("[bulk] Invalid id_type ({}) specified. Will use default.".format(params['id_type']))
@@ -115,6 +120,14 @@ def params(self):
115120
else:
116121
if self._id_type == 'uuid':
117122
docid = self.__get_uuid()
123+
elif self._id_type == 'sha1':
124+
docid = hashlib.sha1(self.__get_uuid().encode()).hexdigest()
125+
elif self._id_type == 'sha256':
126+
docid = hashlib.sha256(self.__get_uuid().encode()).hexdigest()
127+
elif self._id_type == 'sha384':
128+
docid = hashlib.sha384(self.__get_uuid().encode()).hexdigest()
129+
elif self._id_type == 'sha512':
130+
docid = hashlib.sha512(self.__get_uuid().encode()).hexdigest()
118131
else:
119132
docid = self.__get_epoch_uuid()
120133

0 commit comments

Comments
 (0)