-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ESP-WIFI-MESH] memory leak on IDF>=5.1.4 (IDFGH-12810) #13786
Comments
@habablaxe |
I haven’t tested it on a mesh example ((
I'm using MDF(components mwifi and mupgrade only) and my code works with IDF 4.4.5, 5.0.3, 5.1, 5.1.1, 5.1.2. with uptime of more than several months I don’t think the problem is in my code, it doesn’t do anything (start mesh and debug timer) or in MDF (it’s just a wrapper), because the memory is completely returned when a node disconnects/connects |
just tested on 5.1.3, same problem:
|
I am currently observing a similar issue with our FW after updating to In our case, we observe the following:
I tried to locate where the allocation is done using
FYI @KonssnoK |
@zhangyanjiaoesp could you tell us when the log |
I'm currently using the esp-mesh ip_internal_network example on ESP-IDF 5.2 and I actually have observed this same behaviour at the non root nodes from the mesh. I'm monitoring the heap memory and using wifi_statis_dump(0xffff) as well, the nodes oftenly leaks memory, don't know exactly why. It seems that from time to time this leaked memory, is recovered, until it is not recovered anymore, then it keeps dropping until this warning "wifi: m f null" appears without signaling wifi disconnection, which is very strange, especially because it kept publishing on MQTT for a while after that (about 1 hour or even more) and after this warning the heap memory kept very constant without recoverying or dropping. |
When the null frame malloc fail, it will return |
I also tried to reproduce with the
exactly describes what I also observed. Based on the logs, my assumption would be that the allocation is somehow related to the moment where this log In the below log we also see multiple allocations of 128 bytes that seem to get released after some time all together, however, sometimes this releasing mechanism seems not to work (and then we end up seeing the
|
thanks, we wait for updates then :) |
@cbaechler @KonssnoK @habablaxe I have tested with the |
Thanks for your comment.
Does the timing match with what is reported in this ticket? Does the scanning happen every 9-10 seconds?
I understand deallocation happens after some time based when the switch process is complete. For sure this works in all nominal cases. Still, do you see any possible chance that the
I think the I did not yet reproduce However, it seems @Cirobar has seen the |
@cbaechler The scan interval has a short timer and a long timer, the short timer is random between 4.5s - 12.9s, and the long timer is random between 5min ~8min. Every three short intervals is followed by a long interval. |
@zhangyanjiaoesp would have a way to do a diff between mesh implementation in 5.1.2 and 5.1.4 ? |
Well, I've managed to retrieve a captured log, but unfortunately at the time I wasn't using wifi_statis_dump() or heap tracing. Moreover, there is additional tasks running besides the ip_internal_network example tasks, none of them causes any kind of leakage for sure. A cut of this log is attached below, since it is very extense. Anyway, the leakage can be seen as the tasks are printing the current heap memory. Also, as I've said before, the "wifi: m f null" warning happens often (not on every run of the app) and sometimes when it does actually happen, the application is not interrupted. One thing I've noticed is that when there is a very stable router or maybe the components are closer to the router, not sure yet of each one, the "wifi: m f null" may not happen at all. |
@KonssnoK I will check the difference between 5.1.2 and 5.1.4.
|
Hi @zhangyanjiaoesp, these are the changes I applied to diff --git a/examples/mesh/ip_internal_network/main/mesh_main.c b/examples/mesh/ip_internal_network/main/mesh_main.c
index efa4f185a8..bee1873031 100644
--- a/examples/mesh/ip_internal_network/main/mesh_main.c
+++ b/examples/mesh/ip_internal_network/main/mesh_main.c
@@ -17,6 +17,8 @@
#include "mesh_netif.h"
#include "driver/gpio.h"
#include "freertos/semphr.h"
+#include "esp_timer.h"
+#include "esp_debug_helpers.h"
/*******************************************************
* Macros
@@ -33,7 +35,7 @@
* Constants
*******************************************************/
static const char *MESH_TAG = "mesh_main";
-static const uint8_t MESH_ID[6] = { 0x77, 0x77, 0x77, 0x77, 0x77, 0x76};
+static const uint8_t MESH_ID[6] = { 0xCB, 0xCB, 0xCB, 0xCB, 0xCB, 0x75};
/*******************************************************
* Variable Definitions
@@ -46,6 +48,8 @@ static mesh_addr_t s_route_table[CONFIG_MESH_ROUTE_TABLE_SIZE];
static int s_route_table_size = 0;
static SemaphoreHandle_t s_route_table_lock = NULL;
static uint8_t s_mesh_tx_payload[CONFIG_MESH_ROUTE_TABLE_SIZE*6+1];
+static bool enable_hook = false;
+static esp_timer_handle_t periodic_timer;
/*******************************************************
@@ -148,6 +152,9 @@ void esp_mesh_mqtt_task(void *arg)
mesh_data_t data;
esp_err_t err;
mqtt_app_start();
+
+ enable_hook = true;
+
while (is_running) {
asprintf(&print, "layer:%d IP:" IPSTR, esp_mesh_get_layer(), IP2STR(&s_current_ip));
ESP_LOGV(MESH_TAG, "Tried to publish %s", print);
@@ -381,6 +388,21 @@ void ip_event_handler(void *arg, esp_event_base_t event_base,
esp_mesh_comm_mqtt_task_start();
}
+// called 1x per second
+static void on_timer(void* arg)
+{
+ static int count_heap_log = 0;
+ static int last_free_internal_heap = 0;
+ int free_internal_heap = esp_get_free_internal_heap_size();
+
+ if ((last_free_internal_heap != free_internal_heap) || (count_heap_log > 50)) {
+ ESP_LOGI("heap", "free_internal_heap: %d bytes, delta: %d bytes", free_internal_heap, free_internal_heap - last_free_internal_heap);
+ last_free_internal_heap = free_internal_heap;
+ count_heap_log = 0;
+ } else {
+ count_heap_log++;
+ }
+}
void app_main(void)
{
@@ -425,4 +447,36 @@ void app_main(void)
ESP_ERROR_CHECK(esp_mesh_start());
ESP_LOGI(MESH_TAG, "mesh starts successfully, heap:%" PRId32 ", %s", esp_get_free_heap_size(),
esp_mesh_is_root_fixed() ? "root fixed" : "root not fixed");
+
+ ESP_LOGI("heap", "free_internal_heap: %ld bytes before drain loop", esp_get_free_internal_heap_size());
+ while (esp_get_free_internal_heap_size() > 70000) {
+ void* mem = heap_caps_malloc(2048, MALLOC_CAP_INTERNAL);
+ if (!mem) {
+ ESP_LOGE(MESH_TAG, "Failed to alloc heap");
+ break;
+ }
+ }
+ ESP_LOGI("heap", "free_internal_heap: %ld bytes after drain loop", esp_get_free_internal_heap_size());
+
+ /* create timer to track heap */
+ const esp_timer_create_args_t periodic_timer_args = {
+ .callback = &on_timer,
+ .name = "periodic",
+ };
+
+ ESP_ERROR_CHECK(esp_timer_create(&periodic_timer_args, &periodic_timer));
+ ESP_ERROR_CHECK(esp_timer_start_periodic(periodic_timer, 1000000));
+}
+
+void esp_heap_trace_alloc_hook(void* ptr, size_t size, uint32_t caps)
+{
+ if (enable_hook) {
+ if ((caps & MALLOC_CAP_INTERNAL) != 0) {
+ if ((size == 124) || (size == 128)) {
+ ESP_EARLY_LOGW("alloc", "0x%08x allocated %d bytes", ptr, size);
+ // ESP_LOGW("alloc", "0x%08x allocated %d bytes", ptr, size);
+ esp_backtrace_print(5);
+ }
+ }
+ }
} Changes in sdkconfig are as follows, I enabled SPIRAM and configured allocations for wifi and mbedtls to happen on the external RAM in order to reduce allocations in the internal RAM:
Thank you! |
@cbaechler @KonssnoK @Cirobar @habablaxe The log The log The scanning time is 21038-19478 = 1560ms. Because the number of memory can be applied for And there is no different between 5.1.2 and 5.1.4 in handling this. I (18338) mesh_main: Received Routing table [0] 84:f7:03:80:01:88
I (18338) mesh_main: Received Routing table [1] 84:f7:03:80:01:40
I (18508) mesh_main: Tried to publish layer:2 IP:10.0.0.2
I (18518) mesh_mqtt: sent publish returned msg_id=26502
I (18798) mesh_mqtt: MQTT_EVENT_PUBLISHED, msg_id=26502
I (19448) mesh: 5157<active>parent layer:1(node), channel:5, rssi:-8, assoc:0(cnx rssi threshold:-120)my_assoc:0
>>>pm_off_channel,1317
>>>pm_send_nullfunc,997
>>>ieee80211_pm_tx_null_process,5444
>>>ieee80211_encap_null_data,5373
>>>esf_buf_alloc,274, type = 3, len = 24
>>>esf_buf_alloc_dynamic,134, type = 3, all = 124(96+24+0+4)
W (19478) alloc: 0x3fceca50 allocated 124 bytes
Backtrace: 0x403787D7:0x3FCB4F50 0x40376545:0x3FCB4F70 0x403765A5:0x3FCB4F90 0x40377295:0x3FCB4FB0 0x403773A9:0x3FCB5040 0x4037A201:0x3FCB5060 0x420BF316:0x3FCB5080 0x40385D0A:0x3FCB50C0 0x400398B5:0x3FCB50F0 |<-CORRUPTED
0x403787d7: esp_heap_trace_alloc_hook at /home/zhangyanjiao/work/code/idf1/examples/mesh/ip_internal_network/main/mesh_main.c:480 (discriminator 3)
0x40376545: heap_caps_malloc_base at /home/zhangyanjiao/work/code/idf1/components/heap/heap_caps.c:178 (discriminator 1)
0x403765a5: heap_caps_malloc at /home/zhangyanjiao/work/code/idf1/components/heap/heap_caps.c:197
0x40377295: trace_malloc at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:95
0x403773a9: __wrap_heap_caps_malloc at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:184
0x4037a201: esp_coex_common_malloc_internal_wrapper at /home/zhangyanjiao/work/code/idf1/components/esp_coex/esp32s3/esp_coex_adapter.c:128
0x420bf316: esf_buf_alloc_dynamic at ??:?
0x40385d0a: esf_buf_alloc at ??:?
0x400398b5: ic_ebuf_alloc in ROM
W (19658) free: 0x3fceca50 free
Backtrace: 0x40378819:0x3FCB51D0 0x40376792:0x3FCB51F0 0x403772DB:0x3FCB5210 0x4037735D:0x3FCB5260 0x40385739:0x3FCB5280 0x420BFB61:0x3FCB52A0 0x40385381:0x3FCB52C0 0x403808D5:0x3FCB52F0
0x40378819: esp_heap_trace_free_hook at /home/zhangyanjiao/work/code/idf1/examples/mesh/ip_internal_network/main/mesh_main.c:492 (discriminator 3)
0x40376792: heap_caps_free at /home/zhangyanjiao/work/code/idf1/components/heap/heap_caps.c:390 (discriminator 1)
0x403772db: trace_free at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:119
0x4037735d: __wrap_free at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:164
0x40385739: esf_buf_recycle at ??:?
0x420bfb61: ppProcTxDone at ??:?
0x40385381: ppTask at ??:?
0x403808d5: vPortTaskWrapper at /home/zhangyanjiao/work/code/idf1/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
I (19978) heap: free_internal_heap: 42411 bytes, delta: -36 bytes
I (20518) mesh_main: Tried to publish layer:2 IP:10.0.0.2
I (20518) mesh_main: Received Routing table [0] 84:f7:03:80:01:88
I (20518) mesh_main: Received Routing table [1] 84:f7:03:80:01:40
I (20518) mesh_mqtt: sent publish returned msg_id=2430
I (20788) mesh_mqtt: MQTT_EVENT_PUBLISHED, msg_id=2430
>>>pm_send_nullfunc,997
>>>ieee80211_pm_tx_null_process,5444
>>>ieee80211_encap_null_data,5373
>>>esf_buf_alloc,274, type = 3, len = 24
>>>esf_buf_alloc_dynamic,134, type = 3, all = 124(96+24+0+4)
W (21008) alloc: 0x3fceca50 allocated 124 bytes
Backtrace: 0x403787D7:0x3FCB4F30 0x40376545:0x3FCB4F50 0x403765A5:0x3FCB4F70 0x40377295:0x3FCB4F90 0x403773A9:0x3FCB5020 0x4037A201:0x3FCB5040 0x420BF316:0x3FCB5060 0x40385D0A:0x3FCB50A0 0x400398B5:0x3FCB50D0 |<-CORRUPTED
0x403787d7: esp_heap_trace_alloc_hook at /home/zhangyanjiao/work/code/idf1/examples/mesh/ip_internal_network/main/mesh_main.c:480 (discriminator 3)
0x40376545: heap_caps_malloc_base at /home/zhangyanjiao/work/code/idf1/components/heap/heap_caps.c:178 (discriminator 1)
0x403765a5: heap_caps_malloc at /home/zhangyanjiao/work/code/idf1/components/heap/heap_caps.c:197
0x40377295: trace_malloc at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:95
0x403773a9: __wrap_heap_caps_malloc at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:184
0x4037a201: esp_coex_common_malloc_internal_wrapper at /home/zhangyanjiao/work/code/idf1/components/esp_coex/esp32s3/esp_coex_adapter.c:128
0x420bf316: esf_buf_alloc_dynamic at ??:?
0x40385d0a: esf_buf_alloc at ??:?
0x400398b5: ic_ebuf_alloc in ROM
I (21038) mesh: 5948<scan>parent layer:1, rssi:-8, assoc:1(cnx rssi threshold:-120)
W (21038) free: 0x3fceca50 free
Backtrace: 0x40378819:0x3FCB51D0 0x40376792:0x3FCB51F0 0x403772DB:0x3FCB5210 0x4037735D:0x3FCB5260 0x40385739:0x3FCB5280 0x420BFB61:0x3FCB52A0 0x40385381:0x3FCB52C0 0x403808D5:0x3FCB52F0
0x40378819: esp_heap_trace_free_hook at /home/zhangyanjiao/work/code/idf1/examples/mesh/ip_internal_network/main/mesh_main.c:492 (discriminator 3)
0x40376792: heap_caps_free at /home/zhangyanjiao/work/code/idf1/components/heap/heap_caps.c:390 (discriminator 1)
0x403772db: trace_free at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:119
0x4037735d: __wrap_free at /home/zhangyanjiao/work/code/idf1/components/heap/include/heap_trace.inc:164
0x40385739: esf_buf_recycle at ??:?
0x420bfb61: ppProcTxDone at ??:?
0x40385381: ppTask at ??:?
0x403808d5: vPortTaskWrapper at /home/zhangyanjiao/work/code/idf1/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:162
I (21068) mesh: [SCAN][ch:5]AP:1, other(ID:0, RD:0), MAP:1, idle:0, candidate:1, root:1, topMAP:0[c:2,i:2][ac:84:c6:7e:31:2b]<weak>
I (21078) mesh: 7370[weak]try rssi_threshold:-120, backoff times:0, max:5<-78,-82,-85>
I (21088) mesh: 716[monitor]no change, parent:84:f7:03:80:01:89, rssi:-8
I (21088) mesh: 2004<arm>parent monitor, my layer:2(cap:6)(node), interval:7359ms, retries:2<>
I (22528) mesh_main: Tried to publish layer:2 IP:10.0.0.2
I (22548) mesh_main: Received Routing table [0] 84:f7:03:80:01:88
I (22548) mesh_main: Received Routing table [1] 84:f7:03:80:01:40
I (22578) mesh_mqtt: sent publish returned msg_id=16618
I (22848) mesh_mqtt: MQTT_EVENT_PUBLISHED, msg_id=16618
I (24568) mesh_main: Received Routing table [0] 84:f7:03:80:01:88
I (24568) mesh_main: Received Routing table [1] 84:f7:03:80:01:40
I (24578) mesh_main: Tried to publish layer:2 IP:10.0.0.2
I (24588) mesh_mqtt: sent publish returned msg_id=56594 |
hello @zhangyanjiaoesp , thanks for the answer. |
@zhangyanjiaoesp i'm using the same code that @cbaechler gave. you said that such allocations should be freed as soon as the packet is sent the second time, but in my logs i do not see such thing.
|
@KonssnoK you need to add a static uint32_t *alloc_ptr = NULL;
void esp_heap_trace_alloc_hook(void* ptr, size_t size, uint32_t caps)
{
if (enable_hook) {
if ((caps & MALLOC_CAP_INTERNAL) != 0) {
if ((size == 124) || (size == 128)) {
ESP_EARLY_LOGW("alloc", "0x%08x allocated %d bytes", ptr, size);
// ESP_LOGW("alloc", "0x%08x allocated %d bytes", ptr, size);
alloc_ptr = ptr;
esp_backtrace_print(15);
}
}
}
}
void esp_heap_trace_free_hook(void* ptr)
{
if (enable_hook) {
if (alloc_ptr == ptr) {
ESP_EARLY_LOGW("free", "0x%08x free", ptr);
alloc_ptr = NULL;
esp_backtrace_print(15);
}
}
} |
@zhangyanjiaoesp i do have the free hook, it's just not called. the log above starts from 49+ kB of free memory, then the 128B start consuming the memory, until at 42kB of heap 6k are released
the thing is you have at least 3785539 - 1987539 = 180000 (ms?) between them
6912 = 128*54 |
@zhangyanjiaoesp could you confirm that you also see this delay in the resources release? What is it related to? |
@KonssnoK I didn't see so long delay on my side, it may related to the test environment, do you have many WiFi packets in your testing environment?And it may related to the busy state of the wifi task (When the data is sent out, hardware will post event to wifi task, wifi task also need to handle TX/RX event). You can capture packets to see if the NULL data has been sent out. |
@KonssnoK which IDF commit do you use? we want to let mesh not send NULL data during scan, and I want to provide a wifi lib for you to test. |
Current tests are based on commit f8319f0. Today I tried to change the test environment, but I can still reproduce. I used the phone as a WIFI hotspot with only those two devkits connected to it. Also, we are able to reproduce the behavior in two geographically distinct locations (multiple 100 km apart), for this reason I think it is unrelated to the environment. |
I let the system run for ~2 hours to collect some data. The following plot shows the memory consumption over time: We can observe the following:
For reference, I also attach the raw monitor output: Also here, the setup is built with 2x ESP32-S3 DevKit-C connected to the phone hotspot (no other devices connected to the same access point). |
@cbaechler @KonssnoK please test using the following wifi lib. wifi_lib_esp32s3_5.2.zip |
Hi @zhangyanjiaoesp , i think we'll be able to test tomorrow. We'll update asap. |
@zhangyanjiaoesp thanks for the wifi lib. I gave it a quick try and was not able to reproduce the issue up to now. Actually, I do no longer see any of the 124 bytes allocations, thus I assume you are allocating this memory statically? Can you describe in more detail what the changes in the lib are? Thank you! |
@cbaechler We cancelled the sending of the null data. |
so we can confirm that the temporary memory consumption issue is created by that data 🤔 |
@KonssnoK I think the debug log in #13786 (comment) has confirm the 124 bytes allocation is the NULL data. And after modifying the wifi code, the 124 bytes allocation on longer be seen (#13786 (comment)). Is this evidence not sufficient enough? |
yes sure it's enough to know that is the allocation. My question was more about : what do we do next? :) Thanks! |
@KonssnoK The NULL data is not necessary for mesh, and we will remove it from the main branch and other release branches. |
@zhangyanjiaoesp ok thanks! |
will this already be fixed in release 5.2.2? |
@habablaxe The fix will be merged in release/v5.2, the stable version should be v5.2.3. |
@zhangyanjiaoesp |
@AxelLin The v5.1.2 also will send the null data, it's not a recent addition. |
I don't get it. |
@AxelLin @habablaxe |
So you just confirmed the memory is allocated by sending NULL data but still don't know why the memory is not freed in time in #13786 (comment). (It's probably ok if null data is not necessary for MESH. But I feel you didn't figure out the root cause for free memory in time.) |
this was the reason why I opened this issue. |
i think the point is @zhangyanjiaoesp cannot reproduce this issue in his environment (for any versions). |
When the scan starts, if the device switches to another channel, it will send NULL data to the AP. But in this mesh case, the device only scan on the current channel, so NULL data is not necessary. So my fix is to stop sending NULL data. It can be seen as a workaround. |
Answers checklist.
General issue report
MESH:
1 fixed root and 2 nodes.
code doesn't do anything, it just starts the mesh
if I compile code on IDF 5.1.2 - there are no problems and no memory leaks!
if I compile code on IDF 5.1.4, 5.2 or 5.2.1 - a problem appears, on nodes a memory leak occurs approximately every 9 seconds:
on node 1(ESP32) - 124 bytes
on node 2(ESP32S3) - 128 bytes
1 sec timer for debug log:
log for node 1:
log32-1.txt
after some time the memory returned (line 68) and the leak started again:
log32-2.txt
log without dump for node 2:
(S3 with SPIRAM), after a while I get the error:
W (1097142) wifi:m f null
logS3-1.txt
if at any time you turn off/on the power on the root(nodes are disconnected), then all memory returns on the all nodes:
logS3-2.txt
and the leakage process starts all over again.
sdkconfig32.txt
sdkconfigS3.txt
The text was updated successfully, but these errors were encountered: