atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
- From: Shawn McKee <smckee AT umich.edu>
- To: Rob Gardner <rwg AT hep.uchicago.edu>
- Cc: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>, Bob Ball <ball AT umich.edu>
- Subject: Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
- Date: Tue, 28 Jan 2014 21:13:18 -0500
Hi Rob,
Glad to see things are working. Let us know if you see any issues with AGLT2.
Shawn
Glad to see things are working. Let us know if you see any issues with AGLT2.
Thanks,
Shawn
On Tue, Jan 28, 2014 at 9:09 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
All,just sending along a scale test - 30k jobs submitted. (We probably need to add the rcc.uchicago.edu pool to cycle server.)Current snapshot,[rwg@login ~]$ condor_q | grep R | wc2462 29540 201882[rwg@login ~]$which seems to be the current “reach” of Atlas Connect, given our priority settings presently. I can’t wait until we get stampede connected.The only snag I hit was the number of log files written to my home directory (I was asking for 90k files in a single directory). Once I reduced it to 30k, all proceeded smoothly.Some plots with the distribution:This show’s we’re getting about 280 jobs going on AGLT2 at the moment, not bad.Both UC3 and Fresno have topped out:On Jan 28, 2014, at 6:59 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:back?[rwg@login connect-quickstart]$ condor_q | grep R | wc1338 16052 107396[rwg@login connect-quickstart]$On Jan 28, 2014, at 6:58 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:Well, I spoke too soon. Now I’m getting warnings sent to my shell, such as the below.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18589 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18499 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17995 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17986 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17977 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17968 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17959 is not writable by condor.So - I’m obviously writing too many files to my home directory. I changed my submit fileto have the jobs share err and log files.Resubmitting 30k jobs.Hopefully not making too much a mess.Ah, I have broken it:[rwg@login connect-quickstart]$ condor_q-- Failed to fetch ads from: <192.170.227.199:55997> : login.atlas.ci-connect.netCEDAR:6001:Failed to connect to <192.170.227.199:55997>- Rob
On Jan 28, 2014, at 6:48 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:Lincoln,I might have broken AC. I was testing my luck - submitted 30k jobs, attempted to see what I might get.At some point, during the submission, I got a bunch of error messages - could not write to the home directory err, out, and log files.I removed all the jobs.Re-submitted 20k jobs, this time without errors.------------
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Rob Gardner, 01/28/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Shawn McKee, 01/28/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Bob Ball, 01/29/2014
- Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch, Rob Gardner, 01/29/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Bob Ball, 01/29/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Shawn McKee, 01/28/2014
Archive powered by MHonArc 2.6.24.