sphenix-software-l AT lists.bnl.gov
Subject: sPHENIX discussion of software
List archive
[Sphenix-software-l] be mindful with your condor jobs
- From: pinkenburg <pinkenburg AT bnl.gov>
- To: "sphenix-software-l AT lists.bnl.gov" <sphenix-software-l AT lists.bnl.gov>
- Subject: [Sphenix-software-l] be mindful with your condor jobs
- Date: Tue, 18 Jul 2023 10:36:55 -0400
Hi folks,
you probably saw the mails about unresponsive nodes and mmfsd taking tons of cpu.
One way of causing this is when running (lots of) wrongly configured condor jobs. There is a huge difference in the impact on our infra structure between running single condor jobs and running 1000's of them. Condor log/error/output files cause issues when you run a lot of jobs - they are continuously written to (in small batches) which then keeps gpfs or lustre busy which are set up to deal efficiently with large files.
Please have a look at our documentation (especially the condor job file):
https://wiki.sphenix.bnl.gov/index.php/Condor
and stick to this, suggestions for improvement are welcome but all of the entries exist for a reason, don't just feel free to run your own.
The condor log file has to go to /tmp, which keeps it on the submission host. Dumping huge output into the condor output and error file is also counter productive (refrain from writing something for every event). Using the same output filenames (no distinction via e.g. process id) wreaks havoc (thousands of jobs trying to write to the same file - not sure how this is even handled in the backend). Shipping input or output files is not needed - everything is locally available and this taxes the caching on the submission machine (up to bringing condor down). It can be used to transfer files to/from the local condor_scratch dir if you want to (the sim production just does a cp as part of its jobs). There is a section in the wiki but never use this to transfer files from gpfs to gpfs (or lustre), this just adds i/o load.
Thanks,
Chris
--
*************************************************************
Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; http://www.phenix.bnl.gov/~pinkenbu
Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000
*************************************************************
- [Sphenix-software-l] be mindful with your condor jobs, pinkenburg, 07/18/2023
Archive powered by MHonArc 2.6.24.