sphenix-tracking-l AT lists.bnl.gov
Subject: sPHENIX tracking discussion
List archive
Re: [Sphenix-tracking-l] [EXTERNAL] Re: memory plots for 10k events
- From: Hugo Pereira Da Costa <hugo.pereira-da-costa AT cea.fr>
- To: "Osborn, Joe" <osbornjd AT ornl.gov>, Anthony Frawley <afrawley AT fsu.edu>, "sphenix-tracking-l AT lists.bnl.gov" <sphenix-tracking-l AT lists.bnl.gov>
- Subject: Re: [Sphenix-tracking-l] [EXTERNAL] Re: memory plots for 10k events
- Date: Fri, 14 Jan 2022 11:22:14 -0700
Hi Hugo,
This begs the question of what is the circumstance when two clusters with the same key are created? Why does this happen? I can try to see if I can reproduce this behavior…
Agreed.
Looking at the code, it is not obvious how this can happen,
because cluster index is always incremented when calling the
parent method, unless if there is an overflow in the clusterkey
mapping, (if you have more clusters in a given hitset than the max
number of bits assigned to the cluster index).
It is still worth adding the check though because it is a local
logical loophole.
Joe
From:
sPHENIX-tracking-l
<sphenix-tracking-l-bounces AT lists.bnl.gov> on behalf
of Hugo Pereira Da Costa via sPHENIX-tracking-l
<sphenix-tracking-l AT lists.bnl.gov>
Date: Friday, January 14, 2022 at 1:02 PM
To: Anthony Frawley <afrawley AT fsu.edu>,
sphenix-tracking-l AT lists.bnl.gov
<sphenix-tracking-l AT lists.bnl.gov>
Subject: [EXTERNAL] Re: [Sphenix-tracking-l]
memory plots for 10k events
On 1/14/22 10:40, Anthony Frawley wrote:
Hi Hugo,
The line:
if(my_data.clusterlist) my_data.clusterlist->insert(std::make_pair(ckey, clus.release()));
is just checking to see if the cluster map exists before trying to insert the cluster. Still, this raises the question of why this check is needed, since not finding a cluster map on the node tree produces a fatal error at the beginning of the process_event. Perhaps there is some subtlety of the threaded application?
Hi Tony,
Sorry for the confusion:
- the part that might lead to a leak is not "if(my_data.clusterlist)" (clusterlist = nullptr never happens, and if it were, clus.release would not be called and the unique_ptr would delete the cluster properly)
- The part that might lead to a leak is: my_data.clusterlist->insert(std::make_pair(ckey, clus.release()));
The way maps work, is that if ckey is already in the map, the pair in _not_ inserted. however, clus.release() is still called, and thus the unique_ptr does not delete the cluster even though it is not inserted in the map. Is this more clear ?
What we should do, is
- check that the insertion really happen (this can be done by checking the output of ::insert, namely: if( !my_data.clusterlist->insert(...).second ) { do_something; }, as documented here: https://en.cppreference.com/w/cpp/container/map/insert
- complain if this is not the case (because to my understanding this should not happen), and delete the cluster nonetheless.
This might need a bit of logic change (because once you have called clus.release you cannot access the cluster pointed to by the unique_ptr any more.
I can provide a patch.
Hugo
In any case, if this check fails there should be:
a) An error message
b) The pointer should be released
Tony
From: sPHENIX-tracking-l <sphenix-tracking-l-bounces AT lists.bnl.gov> on behalf of Hugo Pereira Da Costa via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov>
Sent: Friday, January 14, 2022 11:42 AM
To: sphenix-tracking-l AT lists.bnl.gov <sphenix-tracking-l AT lists.bnl.gov>
Subject: Re: [Sphenix-tracking-l] memory plots for 10k events
On 1/14/22 09:31, pinkenburg via sPHENIX-tracking-l wrote:
Hi Tony, Joe,
it's time to look at valgrind again, I haven't hunted memory leaks in a long time.
valgrind from jenkins points to an event by event leak in the TpcClusterizer.cc:
https://web.sdcc.bnl.gov/jenkins-sphenix/job/sPHENIX/job/test-default-detector-valgrind-pipeline/1180/valgrindResult/pid=41720,0x1b3fbd/
I don't understand this - uniq_ptrs should not leak memory but then there is root :) Given the number of tpc clusters we have in hijing events this might be a substantial part of the lost memory.
I think there is a similar warning coming from GenFit - but I assume once GenFit is ditched, this will go away by itself.
In terms of the abort event - these jobs run only on the new hardware since the files are in lustre. I hope by the end of today we can run on other nodes using minio to read those files. But over the weekend I'll just run one with verbosity enabled to print out the return codes. That should tell us the culprit and then we can skip to the event which triggers it for debugging (if the reason is not immediately apparent).
ChrisHi
I looked at the code pointed to warning above.
It might be a genuine leak
What happens is that at line 465 of TpcClusterizer, we call
if(my_data.clusterlist) my_data.clusterlist->insert(std::make_pair(ckey, clus.release()));
if for some reason the cluster key is already in the clusterlist map, the insertion does not happen, but the unique_ptr has been "released", so that the cluster never gets deleted nonetheless.
The process by which you can have the same cluster key inserted twice in the list is a mistery.
Hugo
On 1/14/2022 11:16 AM, Anthony Frawley wrote:
Hello Chris,
Ouch. So even after the rapid rise to 8 GB resident memory, it doubles again. Do we have any memory profiling tools?
Tony
From: sPHENIX-tracking-l <sphenix-tracking-l-bounces AT lists.bnl.gov> on behalf of pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov>
Sent: Friday, January 14, 2022 10:46 AM
To: sphenix-tracking <sphenix-tracking-l AT lists.bnl.gov>
Subject: [Sphenix-tracking-l] memory plots for 10k events
Hi folks,
I ran over 10k events from the latest production under prmon with the
current tracking (Wednesday to be specific). The plots are attached. It
doesn't seem to make a difference if the tracks are written out or not
(with output/no output) . There goes my pet theory which was based on
https://urldefense.com/v3/__https://github.com/pinkenburg/rootmemory__;!!PhOWcWs!lzlLUxlIUvOP0PdItgdBME0jGja82Pe6D-1ppFUXDVVXJSXxpROhmlRu49EBIYPT$ , though those tests were done
without reading root objects which is what our tracking does.
Our resident memory grows by quite a bit, roughly a factor of 2 (rss
memory only in PrMon_wtime_vs_rss_with_output.png). If the vmem turns
out to be a feature we need to adjust the swap space of our nodes by
quite a bit.
Another observation is that asking for 10k events results in reading
10195 events - something in our chain is discarding events at a 2% level
(returning ABORT_EVENT).
Chris
--
*************************************************************
Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; https://urldefense.com/v3/__http://www.phenix.bnl.gov/*pinkenbu__;fg!!PhOWcWs!lzlLUxlIUvOP0PdItgdBME0jGja82Pe6D-1ppFUXDVVXJSXxpROhmlRu499iM7pX$
Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000
*************************************************************
--
*************************************************************
Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; http://www.phenix.bnl.gov/~pinkenbu
Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000
*************************************************************
_______________________________________________
sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
-
[Sphenix-tracking-l] memory plots for 10k events,
pinkenburg, 01/14/2022
- Re: [Sphenix-tracking-l] [EXTERNAL] memory plots for 10k events, Osborn, Joe, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Anthony Frawley, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
pinkenburg, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Hugo Pereira Da Costa, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Anthony Frawley, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Hugo Pereira Da Costa, 01/14/2022
-
Re: [Sphenix-tracking-l] [EXTERNAL] Re: memory plots for 10k events,
Osborn, Joe, 01/14/2022
- Re: [Sphenix-tracking-l] [EXTERNAL] Re: memory plots for 10k events, Hugo Pereira Da Costa, 01/14/2022
-
Re: [Sphenix-tracking-l] [EXTERNAL] Re: memory plots for 10k events,
Osborn, Joe, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Hugo Pereira Da Costa, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Anthony Frawley, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Hugo Pereira Da Costa, 01/14/2022
- Re: [Sphenix-tracking-l] memory plots for 10k events, pinkenburg, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
Hugo Pereira Da Costa, 01/14/2022
-
Re: [Sphenix-tracking-l] memory plots for 10k events,
pinkenburg, 01/14/2022
Archive powered by MHonArc 2.6.24.