sphenix-software-l AT lists.bnl.gov
Subject: sPHENIX discussion of software
List archive
- From: pinkenburg <pinkenburg AT bnl.gov>
- To: "sphenix-software-l AT lists.bnl.gov" <sphenix-software-l AT lists.bnl.gov>
- Subject: Re: [Sphenix-software-l] MDC status
- Date: Sat, 21 Nov 2020 13:00:35 -0500
Hi Tony,
the jobs died when using too much memory, 16GB looks like it does the trick, they have been running for 13 hours now, currently sitting at 13GB. I did enable the timing printout during the resubmission. From what I gather this should be done this evening. For this batch I am flying blind - the next batch prints out the event number so one can see how far they actually are.
Chris
On 11/20/2020 9:27 PM, Anthony Frawley
wrote:
Hi Chris,
Do you have the module timer on in the Fun4All macro? We need to see where the time is being used.
Tony
From:
sPHENIX-software-l
<sphenix-software-l-bounces AT lists.bnl.gov> on behalf of
pinkenburg <pinkenburg AT bnl.gov>
Sent: Friday, November 20, 2020 7:53 PM
To: sphenix-software-l AT lists.bnl.gov <sphenix-software-l AT lists.bnl.gov>
Subject: [Sphenix-software-l] MDC status
Sent: Friday, November 20, 2020 7:53 PM
To: sphenix-software-l AT lists.bnl.gov <sphenix-software-l AT lists.bnl.gov>
Subject: [Sphenix-software-l] MDC status
Hi folks,
I produced 10mio minbias hijing events (guess what, hijing has a 0.04%
chance to run into an infinite loop in the subroutine luzdis). They are
located in /sphenix/sim/sim01/sphnxpro/MDC1/sHijing_HepMC/data (1000
files with 10000 events each). sHijing has now a rudimentary command
line parser so one can set the number of events, the seed and the output
filename from the cmd line (no more multiple xml files where one
typically forgets to update the seed and ends up with identical files).
The macro and code for the pass1 (G4 hits production) is ready. I
enabled the bbc, micromegas and the epd (since the epd is outside of our
regular acceptance it doesn't interfere with our baseline). The flow and
Fermi motion afterburners are enabled. We only save the detector hits
(no absorber hits) and the truth info. The output filesize (based on 2
jobs) is 12-13GB for 100 events, the running time is in the order of
15-20 hours. So we are looking at 100,000 condor jobs (and need 228 cpu
years for processing) and a total storage need of 1.3PB. This wouldn't
be a big problem to run under normal circumstances but the memory
consumption of each jobs is 20GB and this drastically reduces the number
of simultaneous jobs we can run. Our memory goes in quanta of 2GB, so
these jobs will only start if a machine has 10 idle cores when the
scheduler checks a node. The farm is always busy, so chances of this
happening are limited. It'll get better with time since one of our jobs
quitting will free up 10 cores and the next jobs will just take over.
The other problem is that we have a lot of old hardware where 10 cores
are a substantial fraction of a node (I don't think we have nodes with
fewer than 10 cores - they are not that old). Basically our throughput
is hard to predict and so I just submitted 5000 jobs and threw all our
condor slots into the sphenix queue to see what we can get out of rcf.
But I am not terribly optimistic that 10 mio events will be possible in
a months time if we cannot reduce the memory consumption.
Anyway - given that our G4 code has been stable for a long time we can
likely use those files for the MDC and I would like to start the
production to get going as soon as possible. From a test I have two
g4hits files available under
/sphenix/sim/sim01/sphnxpro/MDC1/sHijing_HepMC/G4Hits
please have a look at those. They lack the flow and fermi motion
afterburner but are otherwise identical to what is being run right now.
The first version of the reconstruction pass macro (pass2) which can run
on the DST's produced by pass1 is also running. There will definitely be
some tuning and changes in the reconstruction code but once the hijing
production spits out some hits files we can run this to produce input
for the topical groups (and anyone who wants to analyze this). The
processing for the above mentioned two hits files is ongoing, the DSTs
will be written to (in the hope that their memory stays more or less at
6GB where it is right now):
/sphenix/sim/sim01/sphnxpro/MDC1/sHijing_HepMC/DST
give it till tomorrow before you look. Those jobs run more than just the
tracking and they seems to take about 10 minutes/event (based on 2
events), ~15 hours for those 100 events. On the positive side - event
wise memory leaks which we have been fixing over the last few days are
not too critical when you run over 100 events only.
Just a reminder, the MDC git repo is
https://urldefense.com/v3/__https://github.com/sPHENIX-Collaboration/MDC1__;!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmahqJ0Zp$ .
The production Fun4All macros (pass1 and pass2) are located in the
macros/detectors/sPHENIX directory:
https://urldefense.com/v3/__https://github.com/sPHENIX-Collaboration/MDC1/tree/main/macros/detectors/sPHENIX__;!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmSQT-Qru$
They do call the common subsystem macros, so they stay in sync with our
latest and greatest (until we tag the show and make some production
build). Feel free to submit PR's with changes (or let me know).
Have a good weekend,
Chris
--
*************************************************************
Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; https://urldefense.com/v3/__http://www.phenix.bnl.gov/*pinkenbu__;fg!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmVfz8eS7$
Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000
*************************************************************
_______________________________________________
sPHENIX-software-l mailing list
sPHENIX-software-l AT lists.bnl.gov
https://urldefense.com/v3/__https://lists.bnl.gov/mailman/listinfo/sphenix-software-l__;!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmSxZLtbm$
I produced 10mio minbias hijing events (guess what, hijing has a 0.04%
chance to run into an infinite loop in the subroutine luzdis). They are
located in /sphenix/sim/sim01/sphnxpro/MDC1/sHijing_HepMC/data (1000
files with 10000 events each). sHijing has now a rudimentary command
line parser so one can set the number of events, the seed and the output
filename from the cmd line (no more multiple xml files where one
typically forgets to update the seed and ends up with identical files).
The macro and code for the pass1 (G4 hits production) is ready. I
enabled the bbc, micromegas and the epd (since the epd is outside of our
regular acceptance it doesn't interfere with our baseline). The flow and
Fermi motion afterburners are enabled. We only save the detector hits
(no absorber hits) and the truth info. The output filesize (based on 2
jobs) is 12-13GB for 100 events, the running time is in the order of
15-20 hours. So we are looking at 100,000 condor jobs (and need 228 cpu
years for processing) and a total storage need of 1.3PB. This wouldn't
be a big problem to run under normal circumstances but the memory
consumption of each jobs is 20GB and this drastically reduces the number
of simultaneous jobs we can run. Our memory goes in quanta of 2GB, so
these jobs will only start if a machine has 10 idle cores when the
scheduler checks a node. The farm is always busy, so chances of this
happening are limited. It'll get better with time since one of our jobs
quitting will free up 10 cores and the next jobs will just take over.
The other problem is that we have a lot of old hardware where 10 cores
are a substantial fraction of a node (I don't think we have nodes with
fewer than 10 cores - they are not that old). Basically our throughput
is hard to predict and so I just submitted 5000 jobs and threw all our
condor slots into the sphenix queue to see what we can get out of rcf.
But I am not terribly optimistic that 10 mio events will be possible in
a months time if we cannot reduce the memory consumption.
Anyway - given that our G4 code has been stable for a long time we can
likely use those files for the MDC and I would like to start the
production to get going as soon as possible. From a test I have two
g4hits files available under
/sphenix/sim/sim01/sphnxpro/MDC1/sHijing_HepMC/G4Hits
please have a look at those. They lack the flow and fermi motion
afterburner but are otherwise identical to what is being run right now.
The first version of the reconstruction pass macro (pass2) which can run
on the DST's produced by pass1 is also running. There will definitely be
some tuning and changes in the reconstruction code but once the hijing
production spits out some hits files we can run this to produce input
for the topical groups (and anyone who wants to analyze this). The
processing for the above mentioned two hits files is ongoing, the DSTs
will be written to (in the hope that their memory stays more or less at
6GB where it is right now):
/sphenix/sim/sim01/sphnxpro/MDC1/sHijing_HepMC/DST
give it till tomorrow before you look. Those jobs run more than just the
tracking and they seems to take about 10 minutes/event (based on 2
events), ~15 hours for those 100 events. On the positive side - event
wise memory leaks which we have been fixing over the last few days are
not too critical when you run over 100 events only.
Just a reminder, the MDC git repo is
https://urldefense.com/v3/__https://github.com/sPHENIX-Collaboration/MDC1__;!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmahqJ0Zp$ .
The production Fun4All macros (pass1 and pass2) are located in the
macros/detectors/sPHENIX directory:
https://urldefense.com/v3/__https://github.com/sPHENIX-Collaboration/MDC1/tree/main/macros/detectors/sPHENIX__;!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmSQT-Qru$
They do call the common subsystem macros, so they stay in sync with our
latest and greatest (until we tag the show and make some production
build). Feel free to submit PR's with changes (or let me know).
Have a good weekend,
Chris
--
*************************************************************
Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; https://urldefense.com/v3/__http://www.phenix.bnl.gov/*pinkenbu__;fg!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmVfz8eS7$
Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000
*************************************************************
_______________________________________________
sPHENIX-software-l mailing list
sPHENIX-software-l AT lists.bnl.gov
https://urldefense.com/v3/__https://lists.bnl.gov/mailman/listinfo/sphenix-software-l__;!!PhOWcWs!h5P-DFrj9MrAtF42vKOnc8sHOfNYyGnFdTW88Do5-wSC0Fc7VxEukSnkmSxZLtbm$
-- ************************************************************* Christopher H. Pinkenburg ; pinkenburg AT bnl.gov ; http://www.phenix.bnl.gov/~pinkenbu Brookhaven National Laboratory ; phone: (631) 344-5692 Physics Department Bldg 510 C ; fax: (631) 344-3253 Upton, NY 11973-5000 *************************************************************
-
[Sphenix-software-l] MDC status,
pinkenburg, 11/20/2020
-
Re: [Sphenix-software-l] MDC status,
Anthony Frawley, 11/20/2020
- Re: [Sphenix-software-l] MDC status, pinkenburg, 11/21/2020
-
Re: [Sphenix-software-l] MDC status,
Anthony Frawley, 11/20/2020
Archive powered by MHonArc 2.6.24.