sphenix-tracking-l AT lists.bnl.gov

Subject: sPHENIX tracking discussion

List archive

Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found)

From: pinkenburg <pinkenburg AT bnl.gov>
To: Christof Roland <christof.roland AT cern.ch>, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov>
Subject: Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found)
Date: Fri, 21 Jan 2022 12:42:03 -0500

Hi Christof,

too bad - but yes - that would have been too easy.

As long as there is a way to get to the allocated memory, valgrind is happy. E.g. if one keeps filling a container without ever clearing it, valgrind is happy (I can remember a case in PHENIX which was painful to track down). But it should pick up when things go out of scope.

Chris

On 1/21/2022 12:36 PM, Christof Roland wrote:

Hi Chris,

its not that simple. There is one map per hitset and each thread is processing one hitset.

So in terms of race conditions etc its safe. Each thread knows the pointer to the map its

supposed to insert to and the various threads explictely can step on eachothers feet.

That was the whole point of splitting the whole architecture into completly indepented

hitsets. so this part should be threadsafe by construction.

for completeness I tried putting the pointer to the entire container in each thread to insert

and this indeed produces random seg faults as its expected to do.

A mutex I also tried during development but this slows things down so much that there is

almost no benefit from threading, since all threads spend their jolly time aiting for eachother.

I am currently suspecting the make_pair to do something strange in this context.

When adding to the map something that would simply go out of scope doesn't anymore.

Btw. there is no trace in valgrind of this. So this is somehow accounted for.

We could try running singlethreaded, but this will obviously blow the cpu time back up

to 5-10sec per event. I will try if this solves the problem.

In the end, for data only we dont need the hitassociation, this is only needed for the current

reco - truth matching strategy.

cheers

Christof

On 21. Jan 2022, at 17:56, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:

Hi Christof,

this might be a threading issue. I just checked - map::insert does not seem to be thread safe. Can you put a mutex around this insert and see if this make the leak go away? Or just run single threaded?

Chris

On 1/21/2022 10:12 AM, Christof Roland wrote:

Hi again,

I am still struggling with the memory leak.

This may turn out to be a little more intricate than we thought.

I get the Reset() of TrkrClusterHitAcssocv3 to be executed now, but all attempts to get rid of the memory

failed so far. The only way I get the memory to be wiped is if I call clusterhitassoc->clear() at the end

of the processing of each hitset inside each given thread processing it.

Any attempts clearing the memory outside the threads fail.

So this is probably the place to look, but i am not sure I understand yet how this works.

Each thread gets a pointer to the multimap that is supposed to store the association information.

std::multimap<TrkrDefs::cluskey, TrkrDefs::hitkey> *clusterhitassoc

This is then filled for each cluster with the entries from a vector containing the hitkeys constituting the cluster

explicit

if(my_data.do_assoc && my_data.clusterhitassoc){

for (unsigned int i = 0; i < hitkeyvec.size(); i++){

my_data.clusterhitassoc->insert(std::make_pair(ckey, hitkeyvec[i]));

}

}

no explicit memory allocation or new operators involved.

Does anybody have an idea how to work around this?

cheers

Christof

On 20. Jan 2022, at 21:00, Christof Roland via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:

Hi,

I ran over 1000 - 4000 events.

I put a print statement in the Reset() of TrkrClusterHitAssocv3 and it never shows.

I suspect there is some confusion as well with the versioning and where you are using an explite

pointer to the base class and where you acctually use v3.

I also found a loop that should clear and set to zero the multimaps used in the HitAssoc, but as of now it doesn't get called.
But at least turning on and off the hit assiciation turns on and off the big memory leak. There is another small one from the ACTS layer association. But that one is a small fraction of a GB over 4k events. I will investigate that one if have the hit assoc one stuffed.
Btw. we may want to get rid of the hit assoc container completely at some point. For reco we doent need it and for truth association we will probably make it obsolete as well wheee we rework the logic here.
cheers
Christof

On 20. Jan 2022, at 18:31, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:

Hi Tony, Christof,
I wonder if we are barking up the wrong tree here. The memory leak plot ran over 10k events. If I look at the scale of Christofs plot it is likely from a single input file with 400 events. Maybe we are seeing just the map blowing up by accommodating the largest event it saw.
But clearing the map this way might do some good since I think we have the problem that a map memory footprint doesn't really shrink if you .clear() them.

Chris

On 1/20/2022 12:19 PM, Anthony Frawley wrote:

Dear Christof and Chris,

TrkrHitTruthAssociation has the same reset instruction. I assume that Christof is running over already produced hits files, so would not see a memory leak from that.

The clear command for a multimap reduces the size to zero. So evidently the problem is that something is not releasing the memory. I came across a suggestion that may be worth trying: https://stackoverflow.com/questions/6770316/memory-leak-with-map
The suggestion is to create an empty map (in local scope, presumably) and use swap to exchange the content with persistent map: The example was: =============

Try:

{
    std::map<int,int> empty_map;
    empty_map.swap(_map);
}

(At least, this is the usual way to convince a standard library container actually to release its memory.)

=============
It would be worth trying this in the reset method of the container. Tony From: sPHENIX-tracking-l <sphenix-tracking-l-bounces AT lists.bnl.gov> on behalf of Christof Roland via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> Sent: Thursday, January 20, 2022 10:17 AM To: pinkenburg <pinkenburg AT bnl.gov> Cc: sphenix-tracking-l AT lists.bnl.gov <sphenix-tracking-l AT lists.bnl.gov> Subject: Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found)

Hi Chris,

ok, I just ran a job explicitely resetting the maps in the clusterizer. This didn't change things.

See below

Its the Reset() of TrkrClusterHitAssocv3.cc

The filling of it is well contained in an if statement in the clusterizer, so I'll check there if something else is

alloacted in there.

What is the recommended way of resetting a map?

cheers

Christof

<prmon_clusterizer_reset.jpg>

On 20. Jan 2022, at 16:00, pinkenburg via sPHENIX-tracking-l <sphenix-tracking-l AT lists.bnl.gov> wrote:

Hi Christof,

the event is cleared in a separate step - after processing by calling the Reset() method for every object under the DST node (we have some opaque foreach type way to iterate over the node tree, what is being executed here is offline/framework/phool/PHNodeReset.cc).
I think the problem is likely in a Reset() method of the container. Which one we are talking about is opaque to me but if I look at the containers in trackbase, they just clear their map without calling the Reset() method of their members. If a member of this map allocates memory it leaks right there.
Chris

On 1/20/2022 7:49 AM, Christof Roland via sPHENIX-tracking-l wrote:

Hi Everybody,

I tried a few stunts to find the memory leak in the TpcClusterizer and what finally

stopped it was turning off the creation of the ClusterHitAssoociation map.

With this disabled the memory along 1000 events is pretty stable. See PrMon plot below.

It looks like thes ClusterHitsAssoc map is not properly reset. Looking in the Clusterizer code

neither the clustermap nor the assoc map are reset. Which makes sense since the INTT and

MVTX clusterizers also write to the same maps.

While ClusterMap apparently gets reset properly, the assoc map doesn't.

Where do we do this actually?

Cheers

Christof

<prmon_clusterizer_nohitassoc.jpg>

_______________________________________________
sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l

-- 
*************************************************************

Christopher H. Pinkenburg	;    pinkenburg AT bnl.gov
				;    http://www.phenix.bnl.gov/~pinkenbu

Brookhaven National Laboratory	;    phone: (631) 344-5692
Physics Department Bldg 510 C	;    fax:   (631) 344-3253
Upton, NY 11973-5000

*************************************************************

_______________________________________________
sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l

-- 
*************************************************************

Christopher H. Pinkenburg	;    pinkenburg AT bnl.gov
				;    http://www.phenix.bnl.gov/~pinkenbu

Brookhaven National Laboratory	;    phone: (631) 344-5692
Physics Department Bldg 510 C	;    fax:   (631) 344-3253
Upton, NY 11973-5000

*************************************************************

<PrMon_wtime_vs_rss_with_output.png>_______________________________________________ sPHENIX-tracking-l mailing list sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l
_______________________________________________ sPHENIX-tracking-l mailing list sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l

-- 
*************************************************************

Christopher H. Pinkenburg	;    pinkenburg AT bnl.gov
				;    http://www.phenix.bnl.gov/~pinkenbu

Brookhaven National Laboratory	;    phone: (631) 344-5692
Physics Department Bldg 510 C	;    fax:   (631) 344-3253
Upton, NY 11973-5000

*************************************************************

_______________________________________________ sPHENIX-tracking-l mailing list
sPHENIX-tracking-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/sphenix-tracking-l

-- 
*************************************************************

Christopher H. Pinkenburg	;    pinkenburg AT bnl.gov
				;    http://www.phenix.bnl.gov/~pinkenbu

Brookhaven National Laboratory	;    phone: (631) 344-5692
Physics Department Bldg 510 C	;    fax:   (631) 344-3253
Upton, NY 11973-5000

*************************************************************

[Sphenix-tracking-l] TpcClusterizer memory leak (found), Christof Roland, 01/20/2022
- Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), pinkenburg, 01/20/2022
  - Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), Christof Roland, 01/20/2022
    - Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), Anthony Frawley, 01/20/2022
      - Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), pinkenburg, 01/20/2022
        
        Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), Christof Roland, 01/20/2022
        
        Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), Christof Roland, 01/21/2022
        
        Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), pinkenburg, 01/21/2022
        Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), Christof Roland, 01/21/2022
        Re: [Sphenix-tracking-l] TpcClusterizer memory leak (found), pinkenburg, 01/21/2022