atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers
- From: Matt LeBlanc <matt.leblanc AT cern.ch>
- To: Lincoln Bryant <lincolnb AT uchicago.edu>
- Cc: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
- Subject: Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers
- Date: Fri, 27 Mar 2020 12:12:58 +0100
Hi Lincoln,
It looks like there can be connectivity issues when streaming the files to condor via XRD, which results in a significant memory leak developing when they are processed. For most of the inputs, things run smoothly, but I guess this is why one or two of the jobs "stick" and take much longer to finish. I think these are the problematic jobs that you saw. If I process the input files for these pathological jobs interactively, everything runs well and the memory usage is under control.
I'm not sure what the best course of action is to solve this problem -- it would be better if the jobs failed in these circumstances, but they seem to keep running instead. I was perhaps being a little aggressive, and trying to process the same input files multiple times from different jobs, which is maybe asking for too much I/O? I can try to run a single set of test jobs over the files later today, and watch for any that break in this way.
Best,
Matt
---
JetHistoAlgo :: execute() Processing event 38000.
MC DSID 407343 MCWeight 734.112 XS*kFactor*FE: 1.83714e-08
Pile-up info mu :48.5 NPV: 24
GenFiltHT: 1086.48 GeV
Applying selections!
NMu: 0 FAIL
MET: 139.505 GeV
MET+MTW: 138.505 GeV
dRb: 0.502649
Leading jet pT: 448.482 GeV SEL
Number of b-jets: 3
dR(0.4,1.0): 0.171958
dR(b,1.0): 0.171958
TNetXNGFile::ReadBuffers ERROR [FATAL] Connection error
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:406716\
0606, branch:tvar_event_mcEventWeight, entry:38900, badread=1, nerrors=1, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38901, badread=0, nerrors=2, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38902, badread=0, nerrors=3, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38903, badread=0, nerrors=4, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38904, badread=0, nerrors=5, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38905, badread=0, nerrors=6, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38906, badread=0, nerrors=7, basketnumber=389
MC DSID 407343 MCWeight 734.112 XS*kFactor*FE: 1.83714e-08
Pile-up info mu :48.5 NPV: 24
GenFiltHT: 1086.48 GeV
Applying selections!
NMu: 0 FAIL
MET: 139.505 GeV
MET+MTW: 138.505 GeV
dRb: 0.502649
Leading jet pT: 448.482 GeV SEL
Number of b-jets: 3
dR(0.4,1.0): 0.171958
dR(b,1.0): 0.171958
TNetXNGFile::ReadBuffers ERROR [FATAL] Connection error
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:406716\
0606, branch:tvar_event_mcEventWeight, entry:38900, badread=1, nerrors=1, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38901, badread=0, nerrors=2, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38902, badread=0, nerrors=3, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38903, badread=0, nerrors=4, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38904, badread=0, nerrors=5, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38905, badread=0, nerrors=6, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38906, badread=0, nerrors=7, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38907, badread=0, nerrors=8, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38908, badread=0, nerrors=9, basketnumber=389
file probably overwritten: stopping reporting error messages
===>File is more than 2 Gigabytes
JetHistoAlgo :: execute() Processing event 39000.
MC DSID 407343 MCWeight 735.703 XS*kFactor*FE: 1.83714e-08
Pile-up info mu :18.49 NPV: 12
GenFiltHT: 1123.23 GeV
Applying selections!
NMu: 1
mu pt, eta: 64.8302 1.12432
MET: 63.3693 GeV
MET+MTW: 95.2342 GeV
dRb: 0.502649
Leading jet pT: 491.644 GeV SEL
Number of b-jets: 1
dR(0.4,1.0): 0.355497
dR(b,1.0): 0.478931
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TBasket::Streamer ERROR The value of fKeylen is incorrect (-18480) ; trying to recover by setting it to zero
TBasket::Streamer ERROR The value of fNbytes is incorrect (-941449880) ; trying to recover by setting it to zero
TBasket::TBasket::Stre... ERROR The value of fNevBufSize (-33336326) or fIOBits (209) is incorrect ; setting the buffer to a zombie.
TBasket::Streamer ERROR The value of fKeylen is incorrect (-18480) ; trying to recover by setting it to zero
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38907, badread=0, nerrors=8, basketnumber=389
TBranch::GetBasket ERROR File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38908, badread=0, nerrors=9, basketnumber=389
file probably overwritten: stopping reporting error messages
===>File is more than 2 Gigabytes
JetHistoAlgo :: execute() Processing event 39000.
MC DSID 407343 MCWeight 735.703 XS*kFactor*FE: 1.83714e-08
Pile-up info mu :18.49 NPV: 12
GenFiltHT: 1123.23 GeV
Applying selections!
NMu: 1
mu pt, eta: 64.8302 1.12432
MET: 63.3693 GeV
MET+MTW: 95.2342 GeV
dRb: 0.502649
Leading jet pT: 491.644 GeV SEL
Number of b-jets: 1
dR(0.4,1.0): 0.355497
dR(b,1.0): 0.478931
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TNetXNGFile::TNetXNGFile ERROR The remote file is not open
TBasket::Streamer ERROR The value of fKeylen is incorrect (-18480) ; trying to recover by setting it to zero
TBasket::Streamer ERROR The value of fNbytes is incorrect (-941449880) ; trying to recover by setting it to zero
TBasket::TBasket::Stre... ERROR The value of fNevBufSize (-33336326) or fIOBits (209) is incorrect ; setting the buffer to a zombie.
TBasket::Streamer ERROR The value of fKeylen is incorrect (-18480) ; trying to recover by setting it to zero
On Thu, Mar 26, 2020 at 7:23 PM Matt LeBlanc <matt.leblanc AT cern.ch> wrote:
Hi Lincoln,Sorry about that! Those are a new analysis, so there might be some hidden wrinkles to smooth out. I will kill them and look into this.Cheers,Matt--On Thu, Mar 26, 2020 at 19:14 Lincoln Bryant <lincolnb AT uchicago.edu> wrote:Hi Matt,
Apologies but I've held your jobs on ATLAS Connect because they seem to be OOMing our worker nodes.
From the PS tree of a worker:
ruc.mwt2 198552 9.1 82.7 3956304672 162588084 ? Rl 12:47 1:59 | | \_ eventloop_batch_worker 96 ./config.root
thats 150GiB RSS memory for that job.
We have limited access to the machine room right now due to COVID19-related lockdown at the University, so I immediately held all of your jobs to prevent any other workers from rebooting.
Could you look into the memory utilization when you get a chance?
Thanks,LincolnMatt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/
--
Matt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/
-
[Atlas-connect-l] Held ATLAS Connect jobs OOMing workers,
Lincoln Bryant, 03/26/2020
-
Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers,
Matt LeBlanc, 03/26/2020
-
Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers,
Matt LeBlanc, 03/27/2020
- Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers, Matt LeBlanc, 03/27/2020
-
Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers,
Matt LeBlanc, 03/27/2020
-
Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers,
Matt LeBlanc, 03/26/2020
Archive powered by MHonArc 2.6.24.