atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers

From: Matt LeBlanc <matt.leblanc AT cern.ch>
To: Lincoln Bryant <lincolnb AT uchicago.edu>
Cc: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
Subject: Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers
Date: Fri, 27 Mar 2020 12:12:58 +0100

Hi Lincoln,

It looks like there can be connectivity issues when streaming the files to condor via XRD, which results in a significant memory leak developing when they are processed. For most of the inputs, things run smoothly, but I guess this is why one or two of the jobs "stick" and take much longer to finish. I think these are the problematic jobs that you saw. If I process the input files for these pathological jobs interactively, everything runs well and the memory usage is under control.

I'm not sure what the best course of action is to solve this problem -- it would be better if the jobs failed in these circumstances, but they seem to keep running instead. I was perhaps being a little aggressive, and trying to process the same input files multiple times from different jobs, which is maybe asking for too much I/O? I can try to run a single set of test jobs over the files later today, and watch for any that break in this way.

Best,

Matt

---

JetHistoAlgo :: execute()       Processing event 38000.
MC              DSID    407343  MCWeight        734.112 XS*kFactor*FE:  1.83714e-08
Pile-up info    mu :48.5        NPV: 24
GenFiltHT:      1086.48 GeV

Applying selections!

        NMu: 0  FAIL
        MET: 139.505 GeV
        MET+MTW: 138.505 GeV

        dRb: 0.502649
        Leading jet pT: 448.482 GeV     SEL

        Number of b-jets: 3
        dR(0.4,1.0): 0.171958
        dR(b,1.0): 0.171958
TNetXNGFile::ReadBuffers  ERROR   [FATAL] Connection error
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:406716\
0606, branch:tvar_event_mcEventWeight, entry:38900, badread=1, nerrors=1, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38901, badread=0, nerrors=2, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38902, badread=0, nerrors=3, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38903, badread=0, nerrors=4, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38904, badread=0, nerrors=5, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38905, badread=0, nerrors=6, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38906, badread=0, nerrors=7, basketnumber=389

TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38907, badread=0, nerrors=8, basketnumber=389
TBranch::GetBasket        ERROR   File: root://faxbox.usatlas.org//faxbox2/user/mleblanc/mtop/MTOPv12/user.mleblanc.407343.PhPy8EG_A14\
_ttbarHT1k_1k5_hdamp258p75_nonallhad.deriv.MTOPv12_tree_pyconfig.root/user.mleblanc.20823890._000125.tree_pyconfig.root at byte:0, bra\
nch:tvar_event_mcEventWeight, entry:38908, badread=0, nerrors=9, basketnumber=389
 file probably overwritten: stopping reporting error messages
===>File is more than 2 Gigabytes
JetHistoAlgo :: execute()       Processing event 39000.
MC              DSID    407343  MCWeight        735.703 XS*kFactor*FE:  1.83714e-08
Pile-up info    mu :18.49       NPV: 12
GenFiltHT:      1123.23 GeV

Applying selections!

        NMu: 1
        mu pt, eta: 64.8302     1.12432
        MET: 63.3693 GeV
        MET+MTW: 95.2342 GeV

        dRb: 0.502649
        Leading jet pT: 491.644 GeV     SEL

        Number of b-jets: 1
        dR(0.4,1.0): 0.355497
        dR(b,1.0): 0.478931
TNetXNGFile::TNetXNGFile  ERROR   The remote file is not open
TNetXNGFile::TNetXNGFile  ERROR   The remote file is not open
TNetXNGFile::TNetXNGFile  ERROR   The remote file is not open
TNetXNGFile::TNetXNGFile  ERROR   The remote file is not open
TBasket::Streamer         ERROR   The value of fKeylen is incorrect (-18480) ; trying to recover by setting it to zero
TBasket::Streamer         ERROR   The value of fNbytes is incorrect (-941449880) ; trying to recover by setting it to zero
TBasket::TBasket::Stre... ERROR   The value of fNevBufSize (-33336326) or fIOBits (209) is incorrect ; setting the buffer to a zombie.
TBasket::Streamer         ERROR   The value of fKeylen is incorrect (-18480) ; trying to recover by setting it to zero

On Thu, Mar 26, 2020 at 7:23 PM Matt LeBlanc <matt.leblanc AT cern.ch> wrote:

Hi Lincoln,

Sorry about that! Those are a new analysis, so there might be some hidden wrinkles to smooth out. I will kill them and look into this.

Cheers,
Matt

On Thu, Mar 26, 2020 at 19:14 Lincoln Bryant <lincolnb AT uchicago.edu> wrote:

Hi Matt,

Apologies but I've held your jobs on ATLAS Connect because they seem to be OOMing our worker nodes.

From the PS tree of a worker:

ruc.mwt2 198552 9.1 82.7 3956304672 162588084 ? Rl 12:47 1:59 | | \_ eventloop_batch_worker 96 ./config.root

thats 150GiB RSS memory for that job.

We have limited access to the machine room right now due to COVID19-related lockdown at the University, so I immediately held all of your jobs to prevent any other workers from rebooting.

Could you look into the memory utilization when you get a chance?

Thanks,

Lincoln

--
Matt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/

Matt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/

[Atlas-connect-l] Held ATLAS Connect jobs OOMing workers, Lincoln Bryant, 03/26/2020
- Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers, Matt LeBlanc, 03/26/2020
  - Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers, Matt LeBlanc, 03/27/2020
    - Re: [Atlas-connect-l] Held ATLAS Connect jobs OOMing workers, Matt LeBlanc, 03/27/2020