atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
- From: Rob Gardner <rwg AT hep.uchicago.edu>
- To: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
- Cc: Shawn McKee <smckee AT umich.edu>, Bob Ball <ball AT umich.edu>
- Subject: Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
- Date: Tue, 28 Jan 2014 20:09:44 -0600
All,
just sending along a scale test - 30k jobs submitted. (We probably need to add the rcc.uchicago.edu pool to cycle server.)
Current snapshot,
[rwg@login ~]$ condor_q | grep R | wc
2462 29540 201882
[rwg@login ~]$
which seems to be the current “reach” of Atlas Connect, given our priority settings presently. I can’t wait until we get stampede connected.
The only snag I hit was the number of log files written to my home directory (I was asking for 90k files in a single directory). Once I reduced it to 30k, all proceeded smoothly.
Some plots with the distribution:

This show’s we’re getting about 280 jobs going on AGLT2 at the moment, not bad.

Both UC3 and Fresno have topped out:


On Jan 28, 2014, at 6:59 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
back?[rwg@login connect-quickstart]$ condor_q | grep R | wc1338 16052 107396[rwg@login connect-quickstart]$On Jan 28, 2014, at 6:58 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:Well, I spoke too soon. Now I’m getting warnings sent to my shell, such as the below.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18589 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18499 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17995 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17986 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17977 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17968 is not writable by condor.WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17959 is not writable by condor.So - I’m obviously writing too many files to my home directory. I changed my submit fileto have the jobs share err and log files.Resubmitting 30k jobs.Hopefully not making too much a mess.Ah, I have broken it:[rwg@login connect-quickstart]$ condor_q-- Failed to fetch ads from: <192.170.227.199:55997> : login.atlas.ci-connect.netCEDAR:6001:Failed to connect to <192.170.227.199:55997>- RobOn Jan 28, 2014, at 6:48 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:Lincoln,I might have broken AC. I was testing my luck - submitted 30k jobs, attempted to see what I might get.At some point, during the submission, I got a bunch of error messages - could not write to the home directory err, out, and log files.I removed all the jobs.Re-submitted 20k jobs, this time without errors.---Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago---Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago---Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago
---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Rob Gardner, 01/28/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Shawn McKee, 01/28/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Bob Ball, 01/29/2014
- Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch, Rob Gardner, 01/29/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Bob Ball, 01/29/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Shawn McKee, 01/28/2014
Archive powered by MHonArc 2.6.24.