Skip to Content.
Sympa Menu

atlas-connect-l - Re: [Atlas-connect-l] login.usatlas.org down?

atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Chronological Thread  
  • From: Lincoln Bryant <lincolnb AT uchicago.edu>
  • To: Matthew Epland <matthew.epland AT cern.ch>
  • Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
  • Subject: Re: [Atlas-connect-l] login.usatlas.org down?
  • Date: Sun, 13 Jan 2019 17:50:58 +0000

Hi folks,

It looks like something is causing a lot of file I/O, either through /faxbox or /home when someone's jobs are running. This is causing Condor to block on the I/O, so condor_q will not respond. I rebooted the node and added an additional scratch disk.

Could I ask you to first copy files to /scratch first, and have Condor pick up the jobs from there if you are using any kind of condor file transfer mechanisms? (e.g. transfer_input_files / transfer_output_files) This storage is local to the machine. I just created a 2TB volume here - let me know if more is needed. This should avoid some network I/O that might be causing things to hang.

Alternatively you can use Faxbox via its XRootD door if you are using xrdcp.
e.g.  root://faxbox.usatlas.org//faxbox2/user/lincolnb/ maps to /faxbox/user/lincolnb on the login node.

Let me know if you have any questions.

Thanks,
Lincoln

On 1/13/2019 10:25 AM, Matthew Epland wrote:
Hi guys,

It looks like something broke with the condor system again this morning, everything went off line around 8AM CST according to grafana. I can't connect to the cluster when running condor_q.

Yesterday the login machine was acting strange as well. It hung when trying to source ~/.bashrc on log in, and I couldn't rsync or scp files from remote machines - though I could rsync them out from a session on login. Both issues appear to have sorted themselves out however.

Thanks,
Matt Epland

On Mon, Jan 7, 2019 at 11:36 AM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
Ah, no that's me restarting Condor.

Anyhow, I found a firewall issue that is now resolved.

Can you try submitting again?

Thanks,
Lincoln


On Mon, 2019-01-07 at 17:33 +0100, Matt LeBlanc wrote:
> Hi Lincoln,
>
> It looks like I may have broken the condor daemon?
>
> login:~ mleblanc$ condor_q
> Error:
>
> Extra Info: You probably saw this error because the condor_schedd is
> not
> running on the machine you are trying to query. If the condor_schedd
> is not
> running, the Condor system will not be able to find an address and
> port to
> connect to and satisfy this request. Please make sure the Condor
> daemons are
> running and try again.
>
> Extra Info: If the condor_schedd is running on the machine you are
> trying to
> query and you still see the error, the most likely cause is that you
> have
> setup a personal Condor, you have not defined SCHEDD_NAME in your
> condor_config file, and something is wrong with your
> SCHEDD_ADDRESS_FILE
> setting. You must define either or both of those settings in your
> config
> file, or you must use the -name option to condor_q. Please see the
> Condor
> manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
>
> Cheers,
> Matt
>
> On Mon, Jan 7, 2019 at 5:10 PM Matt LeBlanc <matt.leblanc AT cern.ch>
> wrote:
> > Hi Lincoln,
> >
> > No problem! I'll empty it, and start a few at a time in a little
> > while from now.
> >
> > Cheers,
> > Matt
> >
> > On Mon, Jan 7, 2019 at 5:06 PM Lincoln Bryant <lincolnb AT uchicago.ed
> > u> wrote:
> > > Matt,
> > >
> > > I found at least one problem that is fixed now. However there are
> > > a
> > > _lot_ of jobs in queue (70k). Any way you could temporarily
> > > remove some
> > > of the queued jobs and reduce it to say <20k or <10k? I think our
> > > glidein system is timing out trying to query the condor queue for
> > > your
> > > jobs.
> > >
> > > Thanks,
> > > Lincoln
> > >
> > > On Mon, 2019-01-07 at 14:21 +0000, Lincoln Bryant wrote:
> > > > Hi Matt,
> > > > 
> > > > Will take a look.
> > > > 
> > > > --Lincoln
> > > > 
> > > > On 1/7/2019 4:24 AM, Matt LeBlanc wrote:
> > > > > Hi Lincoln,
> > > > > 
> > > > > All of my open sessions crashed a few minutes ago similarly
> > > to how
> > > > > they broke on Saturday. I am able to log in already, though
> > > my
> > > > > condor jobs appear to be stuck idle in the queue.
> > > > > 
> > > > > Cheers,
> > > > > Matt
> > > > > 
> > > > > On Sat, Jan 5, 2019 at 8:08 AM Lincoln Bryant <lincolnb@uchic
> > > ago.ed
> > > > > u> wrote:
> > > > > > On 1/4/2019 4:36 PM, Matthew Epland wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > Is login.usatlas.org down? My ssh connection just broke
> > > and I
> > > > > > can not reconnect.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Matt
> > > > > > >
> > > > > > Hi Matt,
> > > > > > 
> > > > > > We had a hypervisor issue at UChicago. You should be able
> > > to
> > > > > > login 
> > > > > > again. I am working on restoring Condor services now.
> > > > > > 
> > > > > > --Lincoln
> > > > > > 
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Atlas-connect-l mailing list
> > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
> > > > > > 
> > > > > 
> > > > > 
> > > > > _______________________________________________
> > > > > Atlas-connect-l mailing list
> > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
> > > > > _______________________________________________
> > > > > ATLAS Midwest Tier2 mailing list
> > > > > http://mwt2.usatlasfacility.org
> > >
> >
> >
> > -- 
> > Matt LeBlanc
> > University of Arizona
> > Office: 40/1-C11 (CERN)
> > https://cern.ch/mleblanc/
> >
>
>


--
Matthew Epland
Duke University Department of Physics




Archive powered by MHonArc 2.6.24.

Top of Page