Skip to content

[5pt] Document LuaJIT getmetrics C and Lua API #1597

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Buristan opened this issue Oct 13, 2020 · 7 comments · Fixed by #2280
Closed

[5pt] Document LuaJIT getmetrics C and Lua API #1597

Buristan opened this issue Oct 13, 2020 · 7 comments · Fixed by #2280
Assignees
Labels
feature A new functionality server [area] Task relates to Tarantool's server (core) functionality

Comments

@Buristan
Copy link

Buristan commented Oct 13, 2020

Root: next to https://www.tarantool.io/en/doc/latest/book/app_server/luajit_memprof/

We finally added C and Lua API for LuaJIT metrics (#5187).

API

The additional header <lmisclib.h> is introduced to extend the existing LuaJIT
C API with new interfaces. The first function provided via this header is the
following:

/* API for obtaining various platform metrics. */

LUAMISC_API void luaM_metrics(lua_State *L, struct luam_Metrics *metrics);

This function fills the structure pointed to by metrics with the corresponding
metrics related to Lua state anchored to the given coroutine L.

The struct luam_Metrics has the following definition:

struct luam_Metrics {
  /*
  ** Number of strings being interned (i.e. the string with the
  ** same payload is found, so a new one is not created/allocated).
  */
  size_t strhash_hit;
  /* Total number of strings allocations during the platform lifetime. */
  size_t strhash_miss;

  /* Amount of allocated string objects. */
  size_t gc_strnum;
  /* Amount of allocated table objects. */
  size_t gc_tabnum;
  /* Amount of allocated udata objects. */
  size_t gc_udatanum;
  /* Amount of allocated cdata objects. */
  size_t gc_cdatanum;

  /* Memory currently allocated. */
  size_t gc_total;
  /* Total amount of freed memory. */
  size_t gc_freed;
  /* Total amount of allocated memory. */
  size_t gc_allocated;

  /* Count of incremental GC steps per state. */
  size_t gc_steps_pause;
  size_t gc_steps_propagate;
  size_t gc_steps_atomic;
  size_t gc_steps_sweepstring;
  size_t gc_steps_sweep;
  size_t gc_steps_finalize;

  /*
  ** Overall number of snap restores (amount of guard assertions
  ** leading to stopping trace executions).
  */
  size_t jit_snap_restore;
  /* Overall number of abort traces. */
  size_t jit_trace_abort;
  /* Total size of all allocated machine code areas. */
  size_t jit_mcode_size;
  /* Amount of JIT traces. */
  unsigned int jit_trace_num;
};

All metrics are collected throughout the platform uptime. These metrics
increase monotonically and can overflow:

  • strhash_hit
  • strhash_miss
  • gc_freed
  • gc_allocated
  • gc_steps_pause
  • gc_steps_propagate
  • gc_steps_atomic
  • gc_steps_sweepstring
  • gc_steps_sweep
  • gc_steps_finalize
  • jit_snap_restore
  • jit_trace_abort

They make sense only with comparing with their value from a previous
luaM_metrics() call.

There is also a complement introduced for Lua space -- misc.getmetrics().
This function is just a wrapper for luaM_metrics() returning a Lua table with
the similar metrics. All returned values are presented as numbers with cast to
double, so there is a corresponding precision loss.

How to use

This section describes small example of metrics usage.

For example amount of strhash_misses can be shown for tracking of new string
objects allocations. For example if we add code like:

local function sharded_storage_func(storage_name, func_name)
    return 'sharded_storage.storages.' .. storage_name .. '.' .. func_name
end

increase in slope curve of strhash_misses means, that after your changes
there are more new strings allocating at runtime. Of course slope curve of
strhash_misses should be less than slope curve of strhash_hits.

Slope curves of gc_freed and gc_allocated can be used for analysis of GC
pressure of your application (less is better).

Also we can check some hacky optimization with these metrics. For example let's
assume that we have this code snippet:

local old_metrics = misc.getmetrics()
local t = {}
for i = 1, 513 do
    t[i] = i
end
local new_metrics = misc.getmetrics()
local diff = new_metrics.gc_allocated - old_metrics.gc_allocated

diff equals to 18879 after running of this chunk.

But if we change table initialization to

local table_new = require "table.new"
local old_metrics = misc.getmetrics()
local t = table_new(513,0)
for i = 1, 513 do
    t[i] = i
end
local new_metrics = misc.getmetrics()
local diff = new_metrics.gc_allocated - old_metrics.gc_allocated

diff shows us only 5895.

Slope curves of gc_steps_* can be used for tracking of GC pressure too. For
long time observations you will see periodic increment for gc_steps_* metrics
-- for example longer period of gc_steps_atomic increment is better. Also
additional amount of gc_steps_propagate in one period can be used to
indirectly estimate amount of objects. These values also correlate with the
step multiplier of the GC. The amount of incremental steps can grow, but
one step can process a small amount of objects. So these metrics should be
considered together with GC setup.

Amount of gc_*num is useful for control of memory leaks -- total amount of
these objects should not growth nonstop (you can also track gc_total for
this). Also jit_mcode_size can be used for tracking amount of allocated
memory for traces machine code.

Slope curves of jit_trace_abort shows how many times trace hasn't been
compiled when the attempt was made (less is better).

Amount of gc_trace_num is shown how much traces was generated (usually
more is better).

And the last one -- gc_snap_restores can be used for estimation when LuaJIT
is stop trace execution. If slope curves of this metric growth after changing
old code it can mean performance degradation.

Assumes that we have code like this:

local function foo(i)
    return i <= 5 and i or tostring(i)
end
-- minstitch option needs to emulate nonstitching behaviour
jit.opt.start(0, "hotloop=2", "hotexit=2", "minstitch=15")

local sum = 0
local old_metrics = misc.getmetrics()
for i = 1, 10 do
    sum = sum + foo(i)
end
local new_metrics = misc.getmetrics()
local diff = new_metrics.jit_snap_restore - old_metrics.jit_snap_restore

diff equals 3 (1 side exit on loop end, 2 side exits to the interpreter
before trace gets hot and compiled) after this chunk of code.

And now we decide to change foo function like this:

local function foo(i)
    -- math.fmod is not yet compiled!
    return i <= 5 and i or math.fmod(i, 11)
end

diff equals 6 (1 side exit on loop end, 2 side exits to the interpreter
before trace gets hot and compiled an 3 side exits from the root trace could
not get compiled) after the same chunk of code.

@Buristan
Copy link
Author

@filonenko-mikhail @sharonovd @mtrempoltsev @rosik @orchaton @Mons @msiomkin @vasiliy-t @olegrok @knazarov @Kasen
It would be great if you could add your ideas / examples of what problems these metrics can help to solve.

olegrok added a commit to tarantool/metrics that referenced this issue Oct 14, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 14, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 15, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 15, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 15, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 15, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 16, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 19, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
olegrok added a commit to tarantool/metrics that referenced this issue Oct 19, 2020
After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
vasiliy-t pushed a commit to tarantool/metrics that referenced this issue Oct 20, 2020
* add luajit platform metrics

After tarantool/tarantool#5187 tarantool
could report luajit platform metrics.

This patch exports them as default metrics.
For detailed description of each metric see
tarantool/doc#1597.

Closes #127
@NickVolynkin NickVolynkin added feature A new functionality server [area] Task relates to Tarantool's server (core) functionality labels Jul 2, 2021
@NickVolynkin NickVolynkin changed the title Document LuaJIT getmetrics C and Lua API [0pt] Document LuaJIT getmetrics C and Lua API Jul 2, 2021
@pgulutzan
Copy link
Contributor

pgulutzan commented Jul 30, 2021

@Buristan:

The Tarantool manual already has a section on LuaJIT metrics
https://www.tarantool.io/en/doc/latest/book/monitoring/metrics_reference/#luajit-metrics
and I got confused, wondering: what is the difference between the set of items currently
in the manual, and the set of items proposed for the manual? So I made two tables.

CREATE TABLE lua_current (name STRING PRIMARY KEY);
INSERT INTO lua_current VALUES
('lj_jit_snap_restore'),
('lj_jit_trace_num'),
('lj_jit_trace_abort'),
('lj_jit_mcode_size'),
('lj_strhash_hit'),
('lj_strhash_miss'),
('lj_gc_steps_atomic'),
('lj_gc_steps_sweepstring'),
('lj_gc_steps_finalize'),
('lj_gc_steps_sweep'),
('lj_gc_steps_propagate'),
('lj_gc_steps_pause'),
('lj_gc_strnum'),
('lj_gc_tabnum'),
('lj_gc_cdatanum'),
('lj_gc_udatanum'),
('lj_gc_freed'),
('lj_gc_total'),
('lj_gc_allocated');
CREATE TABLE lua_propose (name STRING PRIMARY KEY);
INSERT INTO lua_propose VALUES
('gc_freed'),
('strhash_hit'),
('gc_steps_atomic'),
('strhash_miss'),
('gc_steps_sweepstring'),
('gc_strnum'),
('gc_tabnum'),
('gc_cdatanum'),
('jit_snap_restore'),
('gc_total'),
('gc_udatanum'),
('gc_steps_finalize'),
('gc_allocated'),
('jit_trace_num'),
('gc_steps_sweep'),
('jit_trace_abort'),
('jit_mcode_size'),
('gc_steps_propagate'),
('gc_steps_pause');
SELECT a.name AS lua_current_name, b.name AS lua_propose_name
FROM lua_current a LEFT JOIN lua_propose b
ON (SUBSTR(a.name,4,100) = b.name)
ORDER BY a.name;

Result:

+-------------------------+----------------------+
| LUA_CURRENT_NAME        | LUA_PROPOSE_NAME     |
+-------------------------+----------------------+
| lj_gc_allocated         | gc_allocated         |
| lj_gc_cdatanum          | gc_cdatanum          |
| lj_gc_freed             | gc_freed             |
| lj_gc_steps_atomic      | gc_steps_atomic      |
| lj_gc_steps_finalize    | gc_steps_finalize    |
| lj_gc_steps_pause       | gc_steps_pause       |
| lj_gc_steps_propagate   | gc_steps_propagate   |
| lj_gc_steps_sweep       | gc_steps_sweep       |
| lj_gc_steps_sweepstring | gc_steps_sweepstring |
| lj_gc_strnum            | gc_strnum            |
| lj_gc_tabnum            | gc_tabnum            |
| lj_gc_total             | gc_total             |
| lj_gc_udatanum          | gc_udatanum          |
| lj_jit_mcode_size       | jit_mcode_size       |
| lj_jit_snap_restore     | jit_snap_restore     |
| lj_jit_trace_abort      | jit_trace_abort      |
| lj_jit_trace_num        | jit_trace_num        |
| lj_strhash_hit          | strhash_hit          |
| lj_strhash_miss         | strhash_miss         |
+-------------------------+----------------------+

In other words: the current and proposed items are the same, except
the current ones begin with "lj_", the proposed ones do not.

Of course, I can do what's requested for this issue, and make a
new section after Application server / LuaJIT memory profiler.
However, I surely must explain what the critical difference is,
so users will know when to use the proposed and not the current.
I see that with the proposal it's easier to get old/new comparisons
(what you're calling "incremental" readings), but is that all?

@igormunkin
Copy link
Contributor

@pgulutzan, thanks for investigation, I've never seen this page. I guess this is a reference to tarantool/metrics, so I still believe we need a separate page for so-called "open source" platform metrics and here are the reasons I see for this:

  1. There are very poor and even odd (consider strhash* comment) descriptions for these platform metrics, and I would like to see much more verbose explanation. E.g. it would be nice to provide a link to "Understanding SNAP" page in jit_snap_restore description.
  2. At the end of the manual section, it would be great to have FAQ for metrics usage, so users can refer to this section while troubleshooting (consider the memprof FAQ).
  3. There is not a word regarding Lua C API in the current doc page.

Considering everything above, I guess the better solution is creating a separate page for LuaJIT platform metrics with the structure similar to LuaJIT memory profiler page and link the new page with the existing one with the comment about the naming with "lj_" prefix.

@pgulutzan
Copy link
Contributor

@igormunkin: Apparently you are right, the Tarantool manual's "LuaJIT metrics" section
https://www.tarantool.io/en/doc/latest/book/monitoring/metrics_reference/#luajit-metrics
has the same items as the tarantool/metrics "Metrics reference"
https://github.com/tarantool/metrics/blob/master/doc/monitoring/metrics_reference.rst
and, as we've seen, those are the same items that @Buristan is adding.
Perhaps @artembo who wrote issue#1414 "Add metrics module documentation as submodule"
#1414 added the current documentation.

@igormunkin
Copy link
Contributor

igormunkin commented Jul 30, 2021

@pgulutzan, strictly saying, they are not the same. At first, tarantool/metrics module is not provided within Tarantool binary, but the metrics implemented and described by @Buristan are. The items mentioned in "LuaJIT metrics" section are just proxies for the metrics to be added. But you're right, the values obtained via any of the interfaces are exactly the same.

@pgulutzan
Copy link
Contributor

There is now a pull request issue #2280.
I suggested @Buristan and @igormunkin and @NickVolynkin as reviewers;
you may of course remove yourselves from the suggested-reviewers list there
if you think it is not appropriate.
I did not specify a version. I tested with 2.9, but perhaps I should have said so.
I hope I have fulfilled Igor Munkin's request to make it verbose, add more links, etc.
But I did not fulfill Igor Munkin's request for a FAQ, because I think, although I did not check, that there are no frequently-asked questions yet about this specific feature.

Re the earlier discussion (above) about the feature that is so similar
https://www.tarantool.io/en/doc/latest/book/monitoring/metrics_reference/#luajit-metrics
Originally I wrote:
"Note: Although value names are similar to value names in LuaJIT metrics, and the values are exactly the same, misc.getmetrics() is slightly easier because there is no need to ‘require’ the misc module."
... But then I threw it away because the .rst file related those LuaJIT metrics has disappeared from the documentation sources.

@igormunkin
Copy link
Contributor

@pgulutzan, thanks for the changes, they are verbose, clear and just marvelous! The metrics are aboard since 2.6 series, and are not changed except a tiny bug with gc_cdatanum arithmetics (see tarantool/tarantool#5820, that is fixed in 2.6.2), so 2.9 is totally fine for testing.

Re FAQ: you're right, we don't have this section for now, but now we know where to place it. Your and @Buristan examples are fine for the start, so I hope users will start using these platform metrics for troubleshooting, so we can extend the doc with the real world examples later.

@ainoneko ainoneko self-assigned this Oct 26, 2021
@patiencedaur patiencedaur changed the title [0pt] Document LuaJIT getmetrics C and Lua API [5pt] Document LuaJIT getmetrics C and Lua API Nov 17, 2021
patiencedaur added a commit that referenced this issue Nov 18, 2021
Fixes #1597 

Written by Peter Gulutzan, reviewed by Igor Munkin and Sergey Kaplun

Translated by ainoneko, translation reviewed by Patience Daur

Co-authored-by: Peter Gulutzan <[email protected]>
Co-authored-by: Igor Munkin <[email protected]>
Co-authored-by: ainoneko <[email protected]>
Co-authored-by: Patience Daur <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature A new functionality server [area] Task relates to Tarantool's server (core) functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants