[ML] Memory estimates way too high for very simple analyses #1106

droberts195 · 2020-03-31T12:37:14Z

Steps to reproduce:

Import the Iowa liquor sales dataset using the file data visualizer (ping me if you need the file)
Create a new regression data frame analytics job to analyze it, setting training percent to 5, dependent variable to Sale (Dollars) and excluding every variable except Bottle Volume (ml) and Store Number from the analysis. (So effectively we're predicting one number from two others on 5% of the 380000 rows, i.e. 19000 rows.)

Start the analysis and wait for it to finish.
Look at the job details. The memory limit recommended by the UI was around 1.2GB. The actual memory required was less than 12MB.

Part of the problem here is elastic/kibana#60496, because the memory estimate didn't get updated when I added the exclude fields. However, a considerable part of the problem is in the C++ estimation code. If I run the estimate in dev console using the final config it's still 25 times bigger than it needs to be:

POST _ml/data_frame/analytics/_explain
{
    "source": {
      "index": [
        "iowa"
      ],
      "query": {
        "match_all": {}
      }
    },
    "analysis": {
      "regression": {
        "dependent_variable": "Sale (Dollars)",
        "prediction_field_name": "Sale (Dollars)_prediction",
        "training_percent": 5
      }
    },
    "analyzed_fields": {
      "includes": [],
      "excludes": [
        "Address",
        "Bottles Sold",
        "Category",
        "Category Name",
        "City",
        "County",
        "County Boundaries of Iowa",
        "Iowa ZIP Code Tabulation Areas",
        "Item Description",
        "Item Number",
        "Pack",
        "State Bottle Cost",
        "State Bottle Retail",
        "Store Location",
        "Store Name",
        "Iowa Watersheds (HUC 10)",
        "Iowa Watershed Sub-Basins (HUC 08)",
        "County Number",
        "Invoice/Item Number",
        "US Counties",
        "Zip Code",
        "Vendor Name",
        "Vendor Number",
        "Volume Sold (Gallons)",
        "Volume Sold (Liters)"
      ]
    }  
}

returns:

{
  "field_selection" : [
    ... blah ...
  ],
  "memory_estimation" : {
    "expected_memory_without_disk" : "306147kb",
    "expected_memory_with_disk" : "306147kb"
  }
}

And from the second screenshot you can see actual was 12322863 bytes ~= 12034kb.

This is a big problem for Cloud trials where users don't have much memory to play with, and we refuse to run an analysis if its memory estimate won't fit onto the available machine.

The text was updated successfully, but these errors were encountered:

tveasey · 2020-03-31T12:46:53Z

This is partly a known issue: we need to communicate the training percentage to the memory estimation process, since this very significantly affects the actual memory usage.

droberts195 · 2020-04-07T11:18:48Z

After the fix of #1111 the estimate for a training percent of 5% on the Iowa liquor sales data dropped from 306147kb to 74672kb, a great improvement.

droberts195 · 2020-04-07T12:25:46Z

With a training percent of 80%, the estimate is currently 273319kb and the actual is 13109339 bytes.

tveasey · 2020-05-29T13:04:08Z

We've discussed this and we're going to work on calibrating the current worst case memory estimates based on a variety of different classification and regression runs.

tveasey · 2020-06-24T15:10:36Z

This was fixed in #1298.

droberts195 added the :ml/DataFrameAnalysis label Mar 31, 2020

droberts195 mentioned this issue Mar 31, 2020

[ML] Switch data frame analytics memory estimate from KB to MB #1110

Closed

benwtrent self-assigned this Apr 1, 2020

tveasey mentioned this issue Apr 3, 2020

[ML] Eagerly discard node statistics for leaves which we will never split #1125

Merged

tveasey added the v7.9.0 label May 29, 2020

tveasey assigned valeriy42 May 29, 2020

tveasey closed this as completed Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ML] Memory estimates way too high for very simple analyses #1106

[ML] Memory estimates way too high for very simple analyses #1106

droberts195 commented Mar 31, 2020

tveasey commented Mar 31, 2020

Uh oh!

droberts195 commented Apr 7, 2020

Uh oh!

droberts195 commented Apr 7, 2020

Uh oh!

tveasey commented May 29, 2020

Uh oh!

tveasey commented Jun 24, 2020

Uh oh!

[ML] Memory estimates way too high for very simple analyses #1106

[ML] Memory estimates way too high for very simple analyses #1106

Comments

droberts195 commented Mar 31, 2020

tveasey commented Mar 31, 2020

Uh oh!

droberts195 commented Apr 7, 2020

Uh oh!

droberts195 commented Apr 7, 2020

Uh oh!

tveasey commented May 29, 2020

Uh oh!

tveasey commented Jun 24, 2020

Uh oh!