atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster
- From: Matthew Epland <matthewepland AT gmail.com>
- To: Lincoln Bryant <lincolnb AT uchicago.edu>, <jlstephen AT uchicago.edu>
- Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
- Subject: Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster
- Date: Thu, 3 Jan 2019 12:45:26 -0500
Hi guys,
I was able to run one set of jobs earlier today but now I am intermittently not able to connect to the cluster (below) and the dashboard is getting crazy spikey. Any ideas on what is causing the instability?
Thanks,
Matt
[mepland@login MBJ_HistFitter]$ condor_q
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
mepland CMD: condor_job.sh 1/3 11:40 _ _ 135 135 433640.0-134
135 jobs; 0 completed, 0 removed, 135 idle, 0 running, 0 held, 0 suspended
[mepland@login MBJ_HistFitter]$ condor_q
-- Failed to fetch ads from: <192.170.231.50:9618?addrs=192.170.231.50-9618+[--1]-9618&noUDP&sock=3069740_33d9_3> : login.usatlas.org
SECMAN:2007:Failed to end classad message.
[mepland@login MBJ_HistFitter]$ condor_q
-- Failed to fetch ads from: <192.170.231.50:9618?addrs=192.170.231.50-9618+[--1]-9618&noUDP&sock=3069740_33d9_3> : login.usatlas.org
SECMAN:2007:Failed to end classad message.
[mepland@login MBJ_HistFitter]$ condor_q
-- Failed to fetch ads from: <192.170.231.50:9618?addrs=192.170.231.50-9618+[--1]-9618&noUDP&sock=3069740_33d9_3> : login.usatlas.org
SECMAN:2007:Failed to end classad message.
[mepland@login MBJ_HistFitter]$ condor_q
-- Schedd: login.usatlas.org : <192.170.231.50:9618?... @ 01/03/19 11:43:46
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
mepland CMD: condor_job.sh 1/3 11:40 _ 121 14 135 433640.0-134
135 jobs; 0 completed, 0 removed, 14 idle, 121 running, 0 held, 0 suspended
On Wed, Jan 2, 2019 at 3:34 PM Matthew Epland <matthewepland AT gmail.com> wrote:
Hi Lincoln,It worked alright yesterday, but I just launched some more jobs and they appear to be stuck in idle again. Could you take a look at them?Thanks,MattOn Tue, Jan 1, 2019 at 12:41 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:Great- though I think there's still a bug that I am investigating. I think when we updated the version of Condor, it broke the script that monitors the job queue.
I'll let you know when I've found out more. Until then I'll keep my eyes peeled and make sure your jobs are running.
--Lincoln
On 1/1/2019 11:17 AM, Matthew Epland wrote:
Hi Lincoln,
The new jobs all appear to be running now! Must have been some bug in the scheduler / VM creation script.
Thanks,Matt
On Tue, Jan 1, 2019 at 12:14 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
Please go ahead and I'll keep an eye on things / manually add more workers if need be.
--Lincoln
On 1/1/2019 11:11 AM, Matthew Epland wrote:
Hi Lincoln,
Thank you for looking into this! Unfortunately I need to run these 135 fits multiple times, making changes by hand between each iteration, so I can't group them up into a larger job. Something might be wonky with the scheduler though, as all my jobs from yesterday have completed and condor_q shows no jobs, but grafana still says I have 71 idle. I'm ready to submit another batch of 135. Please let me know when it would be a good time to try it again.
Thanks,Matt
On Tue, Jan 1, 2019 at 12:06 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
On 1/1/2019 10:57 AM, Lincoln Bryant wrote:
OK, it looks like your jobs did sit in queue for quite a long time which is unusual.On 12/31/2018 3:48 PM, Matthew Epland wrote:
Hi Matthew,Hello,
It appears that rnarayan is using all 1000 available worker nodes on the ATLAS Connect condor cluster. I need to run 135 jobs, frequently, for the next few days in order to get an urgent result out. Would it be possible to add more worker VMs? Is there any user priority system so one person doesn't tie up the whole system?
Thanks,Matt
--
Matthew Epland
651.773.9352
Let me look into this. There should be more slots spinning up as a response to demand.
Cheers,
Lincoln
_______________________________________________ Atlas-connect-l mailing list Atlas-connect-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
_______________________________________________ ATLAS Midwest Tier2 mailing list http://mwt2.usatlasfacility.org
I am looking into where this 1000 slot limit may be coming from. MWT2 is actually largely idle / backfill right now so there should be plenty of resources.
Is it possible to batch more than 135 at a time, or to do multiple batches concurrently? That will help.
Thanks,
Lincoln
--
Matthew Epland
651.773.9352
--
Matthew Epland
651.773.9352
--Matthew Epland
651.773.9352
Matthew Epland
651.773.9352
651.773.9352
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Matthew Epland, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Matthew Epland, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Matthew Epland, 01/02/2019
- Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster, Matthew Epland, 01/03/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Matthew Epland, 01/02/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Matthew Epland, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Matthew Epland, 01/01/2019
-
Re: [Atlas-connect-l] No open nodes on ATLAS Connect condor cluster,
Lincoln Bryant, 01/01/2019
Archive powered by MHonArc 2.6.24.