Skip to Content.
Sympa Menu

atlas-connect-l - Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch

atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Chronological Thread  
  • From: Bob Ball <ball AT umich.edu>
  • To: Shawn McKee <smckee AT umich.edu>, Rob Gardner <rwg AT hep.uchicago.edu>
  • Cc: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
  • Subject: Re: [Atlas-connect-l] Another heads up - 30k job submission failed; resubmitted a 20k batch
  • Date: Wed, 29 Jan 2014 10:58:56 -0500

Looks like AGLT2 peaked at around 380 out of a possible max allocation of 500 before the supply of jobs ran out.

bob

On 1/28/2014 9:13 PM, Shawn McKee wrote:
Hi  Rob,

Glad to  see  things are  working.   Let  us  know if you see  any issues  with AGLT2.

Thanks,

Shawn


On Tue, Jan 28, 2014 at 9:09 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
All, 

just sending along a scale test - 30k jobs submitted.   (We probably need to add the rcc.uchicago.edu pool to cycle server.)

Current snapshot,

[rwg@login ~]$ condor_q | grep R | wc
   2462   29540  201882
[rwg@login ~]$ 

which seems to be the current “reach” of Atlas Connect, given our priority settings presently.   I can’t wait until we get stampede connected.

The only snag I hit was the number of log files written to my home directory (I was asking for 90k files in a single directory).  Once I reduced it to 30k, all proceeded smoothly.

Some plots with the distribution:



This show’s we’re getting about 280 jobs going on AGLT2 at the moment, not bad.




Both UC3 and Fresno have topped out:







On Jan 28, 2014, at 6:59 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:

back?


[rwg@login connect-quickstart]$ condor_q | grep R | wc
   1338   16052  107396
[rwg@login connect-quickstart]$ 




On Jan 28, 2014, at 6:58 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:

Well, I spoke too soon.  Now I’m getting warnings sent to my shell, such as the below.


WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18589 is not writable by condor.

WARNING: File /home/rwg/connect-quickstart/log/job.out.184-18499 is not writable by condor.

WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17995 is not writable by condor.

WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17986 is not writable by condor.

WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17977 is not writable by condor.

WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17968 is not writable by condor.

WARNING: File /home/rwg/connect-quickstart/log/job.out.184-17959 is not writable by condor.


So - I’m obviously writing too many files to my home directory.   I changed my submit file
to have the jobs share err and log files.

Resubmitting 30k jobs.

Hopefully not making too much a mess.

Ah, I have broken it:


[rwg@login connect-quickstart]$ condor_q 

-- Failed to fetch ads from: <192.170.227.199:55997> : login.atlas.ci-connect.net
CEDAR:6001:Failed to connect to <192.170.227.199:55997>


- Rob




On Jan 28, 2014, at 6:48 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:

Lincoln,

I might have broken AC.  I was testing my luck - submitted 30k jobs, attempted to see what I might get.

At some point, during the submission, I got a bunch of error messages - could not write to the home directory err, out, and log files.

I removed all the jobs.

Re-submitted 20k jobs, this time without errors.

---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago


---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago


---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago


---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago



Attachment: pngiKjZhdYqGP.png
Description: PNG image

Attachment: pngcmIUmS6bFc.png
Description: PNG image

Attachment: png01N1S4ZDvf.png
Description: PNG image

Attachment: pngIMDFfzGl6T.png
Description: PNG image




Archive powered by MHonArc 2.6.24.

Top of Page