sphenix-tracking-l AT lists.bnl.gov
Subject: sPHENIX tracking discussion
List archive
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found)
- From: pinkenburg <pinkenburg AT bnl.gov>
- To: Christof Roland <christof.roland AT cern.ch>, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov>
- Subject: Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found)
- Date: Fri, 21 Jan 2022 12:42:03 -0500
Hi Christof,
too bad - but yes - that would have been too easy.
As long as there is a way to get to the allocated memory, valgrind is happy. E.g. if one keeps filling a container without ever clearing it, valgrind is happy (I can remember a case in PHENIX which was painful to track down). But it should pick up when things go out of scope.
Chris
On 1/21/2022 12:36 PM, Christof Roland
wrote:
Hi Chris,
its not that simple. There is one map per hitset and
each thread is processing one hitset.
So in terms of race conditions etc its safe. Each
thread knows the pointer to the map its
supposed to insert to and the various threads
explictely can step on eachothers feet.
That was the whole point of splitting the whole
architecture into completly indepented
hitsets. so this part should be threadsafe by
construction.
for completeness I tried putting the pointer to the
entire container in each thread to insert
and this indeed produces random seg faults as its
expected to do.
A mutex I also tried during development but this
slows things down so much that there is
almost no benefit from threading, since all threads
spend their jolly time aiting for eachother.
I am currently suspecting the make_pair to do
something strange in this context.
When adding to the map something that would simply
go out of scope doesn't anymore.
Btw. there is no trace in valgrind of this. So this
is somehow accounted for.
We could try running singlethreaded, but this will
obviously blow the cpu time back up
to 5-10sec per event. I will try if this solves the
problem.
In the end, for data only we dont need the
hitassociation, this is only needed for the current
reco - truth matching strategy.
cheers
Christof
On 21. Jan 2022, at 17:56, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:
I suspect there is some confusion as well with the versioning and where you are using an expliteHi Christof,
this might be a threading issue. I just checked - map::insert does not seem to be thread safe. Can you put a mutex around this insert and see if this make the leak go away? Or just run single threaded?
Chris
On 1/21/2022 10:12 AM, Christof Roland wrote:Hi again,
I am still struggling with the memory leak.This may turn out to be a little more intricate than we thought.
I get the Reset() of TrkrClusterHitAcssocv3 to be executed now, but all attempts to get rid of the memoryfailed so far. The only way I get the memory to be wiped is if I call clusterhitassoc->clear() at the endof the processing of each hitset inside each given thread processing it.
Any attempts clearing the memory outside the threads fail.So this is probably the place to look, but i am not sure I understand yet how this works.Each thread gets a pointer to the multimap that is supposed to store the association information.I put a print statement in the Reset() of TrkrClusterHitAssocv3 and it never shows.
std::multimap<TrkrDefs::cluskey, TrkrDefs::hitkey> *clusterhitassoc
This is then filled for each cluster with the entries from a vector containing the hitkeys constituting the clusterexplicitif(my_data.do_assoc && my_data.clusterhitassoc){for (unsigned int i = 0; i < hitkeyvec.size(); i++){my_data.clusterhitassoc->insert(std::make_pair(ckey, hitkeyvec[i]));}}
no explicit memory allocation or new operators involved.Does anybody have an idea how to work around this?
cheers
Christof
On 20. Jan 2022, at 21:00, Christof Roland via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:
Hi,I ran over 1000 - 4000 events.
I also found a loop that should clear and set to zero the multimaps used in the HitAssoc, but as of now it doesn't get called.
But at least turning on and off the hit assiciation turns on and off the big memory leak. There is another small one from the ACTS layer association. But that one is a small fraction of a GB over 4k events. I will investigate that one if have the hit assoc one stuffed.
Btw. we may want to get rid of the hit assoc container completely at some point. For reco we doent need it and for truth association we will probably make it obsolete as well wheee we rework the logic here.
cheers
Christof
On 20. Jan 2022, at 18:31, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:
TrkrHitTruthAssociation has the same reset instruction. I assume that Christof is running over already produced hits files, so would not see a memory leak from that.Hi Tony, Christof,
I wonder if we are barking up the wrong tree here. The memory leak plot ran over 10k events. If I look at the scale of Christofs plot it is likely from a single input file with 400 events. Maybe we are seeing just the map blowing up by accommodating the largest event it saw.
But clearing the map this way might do some good since I think we have the problem that a map memory footprint doesn't really shrink if you .clear() them.
Chris
On 1/20/2022 12:19 PM, Anthony Frawley wrote:Dear Christof and Chris,
The suggestion is to create an empty map (in local scope, presumably) and use swap to exchange the content with persistent map: The example was: =============
Try:
{
std::map<int,int> empty_map;
empty_map.swap(_map);
}
(At least, this is the usual way to convince a standard library container actually to release its memory.)
=============It would be worth trying this in the reset method of the container. Tony From: sPHENIX-tracking-l <sphenix-tracking-l-bounces AT lists.bnl.gov> on behalf of Christof Roland via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> Sent: Thursday, January 20, 2022 10:17 AM To: pinkenburg <pinkenburg AT bnl.gov> Cc: sphenix-tracking-l AT lists.bnl.gov <sphenix-tracking-l AT lists.bnl.gov> Subject: Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found)
ok, I just ran a
job explicitely resetting the
maps in the clusterizer. This
didn't change things.
See below
Its the Reset()
of TrkrClusterHitAssocv3.cc
The filling of it
is well contained in an if
statement in the clusterizer,
so I'll check there if
something else is
alloacted in
there.
What is the
recommended way of resetting a
map?
cheers
Christof
<prmon_clusterizer_reset.jpg>
_______________________________________________
sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
On 20. Jan
2022, at 16:00,
pinkenburg via
sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov>
wrote:
Hi
Christof,
the event is cleared in a separate step - after processing by calling the Reset() method for every object under the DST node (we have some opaque foreach type way to iterate over the node tree, what is being executed here is offline/framework/phool/PHNodeReset.cc).
I think the problem is likely in a Reset() method of the container. Which one we are talking about is opaque to me but if I look at the containers in trackbase, they just clear their map without calling the Reset() method of their members. If a member of this map allocates memory it leaks right there.
Chris
On 1/20/2022 7:49 AM, Christof Roland via sPHENIX-tracking-l wrote:
Hi
Everybody,
the event is cleared in a separate step - after processing by calling the Reset() method for every object under the DST node (we have some opaque foreach type way to iterate over the node tree, what is being executed here is offline/framework/phool/PHNodeReset.cc).
I think the problem is likely in a Reset() method of the container. Which one we are talking about is opaque to me but if I look at the containers in trackbase, they just clear their map without calling the Reset() method of their members. If a member of this map allocates memory it leaks right there.
Chris
On 1/20/2022 7:49 AM, Christof Roland via sPHENIX-tracking-l wrote:
I
tried a few stunts
to find the memory
leak in the
TpcClusterizer and
what finally
stopped
it was turning off
the creation of
the
ClusterHitAssoociation
map.
With
this disabled the
memory along 1000
events is pretty
stable. See PrMon
plot below.
It
looks like thes
ClusterHitsAssoc
map is not
properly reset.
Looking in the
Clusterizer code
neither
the clustermap nor
the assoc map are
reset. Which makes
sense since the
INTT and
MVTX
clusterizers also
write to the same
maps.
While
ClusterMap
apparently gets
reset properly,
the assoc map
doesn't.
Where
do we do this
actually?
Cheers
Christof
<prmon_clusterizer_nohitassoc.jpg>
_______________________________________________ sPHENIX-tracking-l mailing list sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
-- ************************************************************* Christopher H. Pinkenburg ; pinkenburg AT bnl.gov ; http://www.phenix.bnl.gov/~pinkenbu Brookhaven National Laboratory ; phone: (631) 344-5692 Physics Department Bldg 510 C ; fax: (631) 344-3253 Upton, NY 11973-5000 *************************************************************
sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
-- ************************************************************* Christopher H. Pinkenburg ; pinkenburg AT bnl.gov ; http://www.phenix.bnl.gov/~pinkenbu Brookhaven National Laboratory ; phone: (631) 344-5692 Physics Department Bldg 510 C ; fax: (631) 344-3253 Upton, NY 11973-5000 *************************************************************<PrMon_wtime_vs_rss_with_output.png>_______________________________________________ sPHENIX-tracking-l mailing list sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
_______________________________________________ sPHENIX-tracking-l mailing list sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
-- ************************************************************* Christopher H. Pinkenburg ; pinkenburg AT bnl.gov ; http://www.phenix.bnl.gov/~pinkenbu Brookhaven National Laboratory ; phone: (631) 344-5692 Physics Department Bldg 510 C ; fax: (631) 344-3253 Upton, NY 11973-5000 *************************************************************_______________________________________________ sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
-- ************************************************************* Christopher H. Pinkenburg ; pinkenburg AT bnl.gov ; http://www.phenix.bnl.gov/~pinkenbu Brookhaven National Laboratory ; phone: (631) 344-5692 Physics Department Bldg 510 C ; fax: (631) 344-3253 Upton, NY 11973-5000 *************************************************************
-
[Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
pinkenburg, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Anthony Frawley, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
pinkenburg, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/21/2022
- Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), pinkenburg, 01/21/2022
- Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), Christof Roland, 01/21/2022
- Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), pinkenburg, 01/21/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/21/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
pinkenburg, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Anthony Frawley, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
Christof Roland, 01/20/2022
-
Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found),
pinkenburg, 01/20/2022
Archive powered by MHonArc 2.6.24.