atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
- From: Rob Gardner <rwg AT hep.uchicago.edu>
- To: Bob Ball <ball AT umich.edu>
- Cc: Shawn McKee <smckee AT umich.edu>, atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
- Subject: Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
- Date: Wed, 29 Jan 2014 10:22:21 -0600
Hi Bob,
Looks like we got 5324 jobs through aglt2, or 53k CPU-minutes.
- Rob
[rwg@login log]$ cat job.out.* | grep aglt2 | wc
5324 26620 326072
[rwg@login log]$ cat job.out.* | grep fresno | wc
7040 37440 558440
[rwg@login log]$ cat job.out.* | grep golub | wc
6304 31520 359328
[rwg@login log]$ cat job.out.* | grep taub | wc
0 0 0
[rwg@login log]$ cat job.out.* | grep uct2 | wc
11450 57250 709900
[rwg@login log]$ cat job.out.* | grep iut2 | wc
7452 37260 454572
[rwg@login log]$ cat job.out.* | grep uc3 | wc
10080 65520 559440
[rwg@login log]$ cat job.out.* | grep midway | wc
15870 79350 921333
[rwg@login log]$ cat job.out.* | grep complete | wc
30000 60000 540000
On Jan 29, 2014, at 9:58 AM, Bob Ball <ball AT umich.edu> wrote:
Looks like AGLT2 peaked at around 380 out of a possible max allocation of 500 before the supply of jobs ran out.
bob
On 1/28/2014 9:13 PM, Shawn McKee wrote:
Hi Rob,
Glad to see things are working. Let us know if you see any issues with AGLT2.
Thanks,
Shawn
On Tue, Jan 28, 2014 at 9:09 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
All,
just sending along a scale test - 30k jobs submitted. (We probably need to add the rcc.uchicago.edu pool to cycle server.)
Current snapshot,
[rwg@login ~]$ condor_q | grep R | wc2462 29540 201882[rwg@login ~]$
which seems to be the current “reach” of Atlas Connect, given our priority settings presently. I can’t wait until we get stampede connected.
The only snag I hit was the number of log files written to my home directory (I was asking for 90k files in a single directory). Once I reduced it to 30k, all proceeded smoothly.
Some plots with the distribution:
<Mail Attachment.png>
This show’s we’re getting about 280 jobs going on AGLT2 at the moment, not bad.
<Mail Attachment.png>
Both UC3 and Fresno have topped out:
<Mail Attachment.png>
<Mail Attachment.png>
On Jan 28, 2014, at 6:59 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
back?
[rwg@login connect-quickstart]$ condor_q | grep R | wc1338 16052 107396[rwg@login connect-quickstart]$
On Jan 28, 2014, at 6:58 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
Well, I spoke too soon. Now I’m getting warnings sent to my shell, such as the below.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18589 is not writable by condor.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18499 is not writable by condor.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17995 is not writable by condor.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17986 is not writable by condor.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17977 is not writable by condor.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17968 is not writable by condor.
WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17959 is not writable by condor.
So - I’m obviously writing too many files to my home directory. I changed my submit fileto have the jobs share err and log files.
Resubmitting 30k jobs.
Hopefully not making too much a mess.
Ah, I have broken it:
[rwg@login connect-quickstart]$ condor_q
-- Failed to fetch ads from: <192.170.227.199:55997> : login.atlas.ci-connect.netCEDAR:6001:Failed to connect to <192.170.227.199:55997>
- Rob
On Jan 28, 2014, at 6:48 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
Lincoln,
I might have broken AC. I was testing my luck - submitted 30k jobs, attempted to see what I might get.
At some point, during the submission, I got a bunch of error messages - could not write to the home directory err, out, and log files.
I removed all the jobs.
Re-submitted 20k jobs, this time without errors.
---
---
---
---
---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Rob Gardner, 01/28/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Shawn McKee, 01/28/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Bob Ball, 01/29/2014
- Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch, Rob Gardner, 01/29/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Bob Ball, 01/29/2014
-
Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch,
Shawn McKee, 01/28/2014
Archive powered by MHonArc 2.6.24.