Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics of snmalloc #409

Open
SchrodingerZhu opened this issue Oct 24, 2021 · 12 comments
Open

metrics of snmalloc #409

SchrodingerZhu opened this issue Oct 24, 2021 · 12 comments

Comments

@SchrodingerZhu
Copy link
Contributor

SchrodingerZhu commented Oct 24, 2021

Hi,
I am implementing snmalloc support for an analytical database engine now. Everything works fine and the performance is really cool. But there is a problem on creating proper statistics of snmalloc:

Basically, I want something like resident memory and (de)committing information. Details like allocation size distribution can also be helpful but it is not an essence.

So I mimic the way of printing out the stats in snmalloc and wrote some code:

    {
        snmalloc::Stats stats;
        snmalloc::current_alloc_pool()->aggregate_stats(stats);

        using namespace snmalloc;

        size_t current = 0;
        size_t total = 0;
        size_t max = 0;
        static size_t large_alloc_max[NUM_LARGE_CLASSES]{0};

        for (sizeclass_t i = 0; i < NUM_SIZECLASSES; i++)
        {
            if (stats.sizeclass[i].count.is_unused())
                continue;

            stats.sizeclass[i].addToRunningAverage();

            auto size = sizeclass_to_size(i);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_current", size), stats.sizeclass[i].count.current);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_max", size), stats.sizeclass[i].count.max);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_total", size), stats.sizeclass[i].count.used);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_average_slab_usage", size), stats.sizeclass[i].online_average);
            set(fmt::format("snmalloc.bucketed_stat_size_{}_average_wasted_space", size),
                (1.0 - stats.sizeclass[i].online_average) * stats.sizeclass[i].slab_count.max);
            current += stats.sizeclass[i].count.current * size;
            total += stats.sizeclass[i].count.used * size;
            max += stats.sizeclass[i].count.max * size;
        }

        for (uint8_t i = 0; i < NUM_LARGE_CLASSES; i++)
        {
            if ((stats.large_push_count[i] == 0) && (stats.large_pop_count[i] == 0))
                continue;

            auto size = large_sizeclass_to_size(i);
            set(fmt::format("snmalloc.large_bucketed_stat_size_{}_push_count", size), stats.large_push_count[i]);
            set(fmt::format("snmalloc.large_bucketed_stat_size_{}_pop_count", size), stats.large_pop_count[i]);
            auto large_alloc = (stats.large_pop_count[i] - stats.large_push_count[i]) * size;
            large_alloc_max[i] = std::max(large_alloc_max[i], large_alloc);
            current += large_alloc;
            total += stats.large_push_count[i] * size;
            max += large_alloc_max[i];
        }

        set("snmalloc.global_stat_remote_freed", stats.remote_freed);
        set("snmalloc.global_stat_remote_posted", stats.remote_posted);
        set("snmalloc.global_stat_remote_received", stats.remote_received);
        set("snmalloc.global_stat_superslab_pop_count", stats.superslab_pop_count);
        set("snmalloc.global_stat_superslab_push_count", stats.superslab_push_count);
        set("snmalloc.global_stat_segment_count", stats.segment_count);
        set("snmalloc.global_stat_current_size", current);
        set("snmalloc.global_stat_total_size", total);
        set("snmalloc.global_stat_max_size", max);
    }

I don't know. but maybe the above method would create too many entries in the summary?

And any suggestion on creating more concise async metrics for the allocator?

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 24, 2021

Another thing is that it is probably not a good idea to print out the statistics after thread exiting in this situation.

@SchrodingerZhu
Copy link
Contributor Author

image
Another interesting part is that, if I enable stats, the dealloc routine would be taken up by costly average calculation. (I cannot provide further stack traces since the product has not been released as open source yet. sorry for that)

@mjp41
Copy link
Member

mjp41 commented Oct 25, 2021

So those statistics are pretty heavy weight, and were not designed for production. More for working out what snmalloc is doing wrong. They have not really been maintained. There are very coarse statistics available from

void get_malloc_info_v1(malloc_info_v1* stats)
{
auto next_memory_usage = default_memory_provider().memory_usage();
stats->current_memory_usage = next_memory_usage.first;
stats->peak_memory_usage = next_memory_usage.second;
}

This might be sufficient for what you are after. This is tracked all the time and is very cheap. It was considered the bare minimum for some other services.

With the rewrite on the snmalloc2 branch, I am about to investigate statistics tracking. So if you have requirements, I will try to work them into what I build.

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 25, 2021

image
I would provide some records from my side. This is from a analytical database engine (single node in this case). It took up almost all the system memory on linux (as it won't madvise them back).
The problem is, the server itself uses mmap/mremap for large allocations for get potential speedup from OS paging. So I am very concerned to have this de-commit pattern in a production env.

@mjp41
Copy link
Member

mjp41 commented Oct 25, 2021

@SchrodingerZhu are you able to try #404 for your use case? This getting pretty stable now, and should address your concern about holding on to OS memory.

What is the green line showing in the graph? RSS or Virtual memory usage?

@SchrodingerZhu
Copy link
Contributor Author

according to the name of the metric, it should be RSS. I can also see htop show similar memory usage of my program with the green line.

@SchrodingerZhu SchrodingerZhu changed the title Implement async metric for snmalloc metrics of snmalloc Oct 25, 2021
@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 25, 2021

Since all of my works now are in experimental mode, I would like to give snmalloc 2 and #404 a try. I could also report back the changes in performance and the metric.

Thanks for the suggestions! And the above results were still on snmalloc 1 and there was something like a tens of seconds performance bump on some TPCH workload when switching from jemalloc to snmalloc, which really made me astonished. Let's see what we can get with snmalloc 2.

@SchrodingerZhu
Copy link
Contributor Author

I believe #404 is working since we can now see drops of RSS curve.

However, in this case:
image

as you can see after some peaks in the memory curve (it tried to acquire more than 169GiB!), the stats suddenly went to zero with snmalloc 2. This means the engine is killed for OOM. Ouch, this is bad; with snmalloc 1 thou the space is not de-committed, I didn't experience OOMs.

The performance of snmalloc 2 degrading was still there, for those successful trials, I could see 100% slow down (from 30s to 1min) for some particular queries. I may provide some flamegraphs on snmalloc stacks when they are ready,

@SchrodingerZhu
Copy link
Contributor Author

SchrodingerZhu commented Oct 26, 2021

image
image

Oops, since I was running this on kernel 3.10 I guess madvise with MADV_DONTNEED was too much heavier than I have ever expected.

I think it is madvise that took all the extra running time (up to 30s for that query) in this case.

@mjp41
Copy link
Member

mjp41 commented Oct 26, 2021

I am going to look into a more consolidating calls to madvise, which will hopefully reduce this cost.

So did it work in terms of reducing the memory usage, or did it regress the memory usage and get OOM. I was clear from your message?

@SchrodingerZhu
Copy link
Contributor Author

  • I can see decrement of the memory usage now: so madvise is working.
  • even with the decrement above, I got a regression of OOM compared with snmalloc 1

@mjp41
Copy link
Member

mjp41 commented Mar 16, 2022

@SchrodingerZhu would you be able to run this experiment again with the latest main branch? I have done a lot of work on bringing down the footprint, most examples are very close to snmalloc 1 now, so would be interested to know if I have fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants