Skip to Content.
Sympa Menu

atlas-connect-l - Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster

atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Chronological Thread  
  • From: Matthew Epland <matthewepland AT gmail.com>
  • To: Lincoln Bryant <lincolnb AT uchicago.edu>
  • Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
  • Subject: Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster
  • Date: Wed, 2 Jan 2019 15:34:40 -0500

Hi Lincoln,

It worked alright yesterday, but I just launched some more jobs and they appear to be stuck in idle again. Could you take a look at them?

Thanks,
Matt

On Tue, Jan 1, 2019 at 12:41 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
Great- though I think there's still a bug that I am investigating. I think when we updated the version of Condor, it broke the script that monitors the job queue.

I'll let you know when I've found out more. Until then I'll keep my eyes peeled and make sure your jobs are running.

--Lincoln


On 1/1/2019 11:17 AM, Matthew Epland wrote:
Hi Lincoln,

The new jobs all appear to be running now! Must have been some bug in the scheduler / VM creation script.

Thanks,
Matt

On Tue, Jan 1, 2019 at 12:14 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
Please go ahead and I'll keep an eye on things / manually add more workers if need be.

--Lincoln


On 1/1/2019 11:11 AM, Matthew Epland wrote:
Hi Lincoln,

Thank you for looking into this! Unfortunately I need to run these 135 fits multiple times, making changes by hand between each iteration, so I can't group them up into a larger job. Something might be wonky with the scheduler though, as all my jobs from yesterday have completed and condor_q shows no jobs, butĀ grafana still says I have 71 idle. I'm ready to submit another batch of 135. Please let me know when it would be a good time to try it again.

Thanks,
Matt

On Tue, Jan 1, 2019 at 12:06 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
On 1/1/2019 10:57 AM, Lincoln Bryant wrote:
On 12/31/2018 3:48 PM, Matthew Epland wrote:
Hello,

It appears thatĀ rnarayan is using all 1000 available worker nodes on the ATLAS Connect condor cluster. I need to run 135 jobs, frequently, for the next few days in order to get an urgent result out. Would it be possible to add more worker VMs? Is there any user priority system so one person doesn't tie up the whole system?

Thanks,
Matt

--
Matthew Epland
651.773.9352
Hi Matthew,

Let me look into this. There should be more slots spinning up as a response to demand.

Cheers,
Lincoln


_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

_______________________________________________
ATLAS Midwest Tier2 mailing list
http://mwt2.usatlasfacility.org
OK, it looks like your jobs did sit in queue for quite a long time which is unusual.

I am looking into where this 1000 slot limit may be coming from. MWT2 is actually largely idle / backfill right now so there should be plenty of resources.

Is it possible to batch more than 135 at a time, or to do multiple batches concurrently? That will help.

Thanks,
Lincoln


--
Matthew Epland
651.773.9352



--
Matthew Epland
651.773.9352



--
Matthew Epland
651.773.9352



Archive powered by MHonArc 2.6.24.

Top of Page