Skip to content

[ML] Memory estimates way too high for very simple analyses #1106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droberts195 opened this issue Mar 31, 2020 · 5 comments
Closed

[ML] Memory estimates way too high for very simple analyses #1106

droberts195 opened this issue Mar 31, 2020 · 5 comments

Comments

@droberts195
Copy link
Contributor

Steps to reproduce:

  1. Import the Iowa liquor sales dataset using the file data visualizer (ping me if you need the file)
  2. Create a new regression data frame analytics job to analyze it, setting training percent to 5, dependent variable to Sale (Dollars) and excluding every variable except Bottle Volume (ml) and Store Number from the analysis. (So effectively we're predicting one number from two others on 5% of the 380000 rows, i.e. 19000 rows.)

Screenshot 2020-03-31 at 13 05 05

  1. Start the analysis and wait for it to finish.
  2. Look at the job details. The memory limit recommended by the UI was around 1.2GB. The actual memory required was less than 12MB.

Screenshot 2020-03-31 at 13 09 49

Part of the problem here is elastic/kibana#60496, because the memory estimate didn't get updated when I added the exclude fields. However, a considerable part of the problem is in the C++ estimation code. If I run the estimate in dev console using the final config it's still 25 times bigger than it needs to be:

POST _ml/data_frame/analytics/_explain
{
    "source": {
      "index": [
        "iowa"
      ],
      "query": {
        "match_all": {}
      }
    },
    "analysis": {
      "regression": {
        "dependent_variable": "Sale (Dollars)",
        "prediction_field_name": "Sale (Dollars)_prediction",
        "training_percent": 5
      }
    },
    "analyzed_fields": {
      "includes": [],
      "excludes": [
        "Address",
        "Bottles Sold",
        "Category",
        "Category Name",
        "City",
        "County",
        "County Boundaries of Iowa",
        "Iowa ZIP Code Tabulation Areas",
        "Item Description",
        "Item Number",
        "Pack",
        "State Bottle Cost",
        "State Bottle Retail",
        "Store Location",
        "Store Name",
        "Iowa Watersheds (HUC 10)",
        "Iowa Watershed Sub-Basins (HUC 08)",
        "County Number",
        "Invoice/Item Number",
        "US Counties",
        "Zip Code",
        "Vendor Name",
        "Vendor Number",
        "Volume Sold (Gallons)",
        "Volume Sold (Liters)"
      ]
    }  
}

returns:

{
  "field_selection" : [
    ... blah ...
  ],
  "memory_estimation" : {
    "expected_memory_without_disk" : "306147kb",
    "expected_memory_with_disk" : "306147kb"
  }
}

And from the second screenshot you can see actual was 12322863 bytes ~= 12034kb.

This is a big problem for Cloud trials where users don't have much memory to play with, and we refuse to run an analysis if its memory estimate won't fit onto the available machine.

@tveasey
Copy link
Contributor

tveasey commented Mar 31, 2020

This is partly a known issue: we need to communicate the training percentage to the memory estimation process, since this very significantly affects the actual memory usage.

@droberts195
Copy link
Contributor Author

After the fix of #1111 the estimate for a training percent of 5% on the Iowa liquor sales data dropped from 306147kb to 74672kb, a great improvement.

@droberts195
Copy link
Contributor Author

With a training percent of 80%, the estimate is currently 273319kb and the actual is 13109339 bytes.

@tveasey
Copy link
Contributor

tveasey commented May 29, 2020

We've discussed this and we're going to work on calibrating the current worst case memory estimates based on a variety of different classification and regression runs.

@tveasey
Copy link
Contributor

tveasey commented Jun 24, 2020

This was fixed in #1298.

@tveasey tveasey closed this as completed Jun 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants