Skip to Content.
Sympa Menu

sphenix-software-l - [[Sphenix-software-l] ] condor problems

sphenix-software-l AT lists.bnl.gov

Subject: sPHENIX discussion of software

List archive

Chronological Thread  
  • From: pinkenburg <pinkenburg AT bnl.gov>
  • To: "sphenix-software-l AT lists.bnl.gov" <sphenix-software-l AT lists.bnl.gov>
  • Subject: [[Sphenix-software-l] ] condor problems
  • Date: Wed, 5 Mar 2025 23:17:04 -0500

Hi folks,

as of a few days ago our farm nodes crash seemingly randomly which then kills all condor jobs running on them. Depending on you job files, you might find this in your condor logs:

UserLog = "/tmp/fm_0_20/pass1/run27/condor-0000000027-004964.log"
VacateReason = "Job disconnected too long: JobLeaseDuration (3600 seconds) expired"

condor has no idea that the node crashed (it does reboot but condor isn't restarted), it tries to reconnect and then times out putting the job (which is long gone) on hold.

It's very actively being looked at it's a complete mystery so far - the syslogs just say kernel panic. We started with 640 nodes (yes we have our 60k condor slots back) this afternoon after and are down to 634 by now. With 100 cores per node - this is 600 condor jobs. Restarting the jobs is the only thing one can do.

I'll keep you posted

Chris

--
*************************************************************

Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; http://www.phenix.bnl.gov/~pinkenbu

Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000

*************************************************************



  • [[Sphenix-software-l] ] condor problems, pinkenburg, 03/05/2025

Archive powered by MHonArc 2.6.24.

Top of Page