atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Re: [Atlas-connect-l] login.usatlas.org down?

From: Matt LeBlanc <matt.leblanc AT cern.ch>
To: Matthew Epland <matthew.epland AT cern.ch>
Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
Subject: Re: [Atlas-connect-l] login.usatlas.org down?
Date: Mon, 21 Jan 2019 20:46:38 +0100

Hi all,

I've had ~8 jobs stuck in idle for >12 hours now as well. They flip between R and I, actually, but never finish ...

Cheers,

Matt

On Mon, Jan 21, 2019 at 4:39 PM Matthew Epland <matthew.epland AT cern.ch> wrote:

Hi Lincoln,

My jobs this morning have been stuck in idle for awhile now. Could you please check the cluster. If all goes well today (and maybe tomorrow) should be the last day I am running things as I unblind!

Thanks,
Matt Epland

On Mon, Jan 14, 2019 at 10:49 AM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
CVMFS looks ok to me now - clearly we're having a litany of
intermittent problems with this host..

I will investigate today and see if there are any underlying problems.

--Lincoln

On Mon, 2019-01-14 at 10:16 +0100, Matt LeBlanc wrote:
> Sorry -- what I should have said is, I'm unable to run setupATLAS
> this morning. I guess the /cvmfs mount is broken?
>
> Cheers,
> Matt
>
> On Mon, Jan 14, 2019 at 10:15 AM Matt LeBlanc <matt.leblanc AT cern.ch>
> wrote:
> > Hi all,
> >
> > I'm unable to log in this morning. Things hang after
> >
> > mlb-macbookpro:~ mleblanc$ usc
> > Last login: Sun Jan 13 15:44:37 2019 from x
> > Welcome to ATLAS Connect.
> >
> > For registration or login problems: support AT connect.usatlas.org
> > ATLAS Connect user forum: atlas-connect-l AT lists.bnl.gov
> > For additional documentation and examples: http://connect.usatlas.o
> > rg/
> >
> > Cheeers,
> > Matt
> >
> >
> > On Sun, Jan 13, 2019 at 11:08 PM Matthew Epland <matthew.epland@cer
> > n.ch> wrote:
> > > Hi Lincoln,
> > >
> > > I tried moving my input data files to /scratch:
> > >
> > > [mepland@login logs]$ ls /scratch/input_files/
> > > Bkg_21.2.55_mc16a.diboson_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16d.topEW_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16a.singletop_hybrid_preprocessed_with_predictions.
> > > root Bkg_21.2.55_mc16d.ttbar_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16a.topEW_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16d.W_jets_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16a.ttbar_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16d.Z_jets_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16a.W_jets_preprocessed_with_predictions.root
> > > Data_21.2.55_20152016_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16a.Z_jets_preprocessed_with_predictions.root
> > > Data_21.2.55_2017_preprocessed_with_predictions.root
> > > Bkg_21.2.55_mc16d.diboson_preprocessed_with_predictions.root
> > >
> > > Sig_21.2.55_mc16a.Gtt_preprocessed_with_predictions_with_fake_hi
> > > gh_masses.root
> > > Bkg_21.2.55_mc16d.singletop_hybrid_preprocessed_with_predictions.
> > > root
> > > Sig_21.2.55_mc16d.Gtt_preprocessed_with_predictions_with_fake_hig
> > > h_masses.root
> > >
> > > but the condor worker VMs aren't picking them up
> > >
> > > ^[[1;31m<ERROR> PrepareHistos: input file
> > > /scratch/input_files/Bkg_21.2.55_mc16d.ttbar_preprocessed_with_pr
> > > edictions.root does not exist - cannot load ttbar_nominal from
> > > it^[[0m
> > >
> > > I can stick with my present method of copying the files over
> > > directly in tar, but it does take awhile to get them running. Any
> > > suggestions?
> > >
> > > Thanks,
> > > Matt
> > >
> > > On Sun, Jan 13, 2019 at 1:43 PM Matt LeBlanc <matt.leblanc AT cern.c
> > > h> wrote:
> > > > Hi all,
> > > >
> > > > I notice that my 'watch condor_q' breaks whenever I am
> > > > *submitting* jobs to the condor queue. i.e. when the dots are
> > > > being printed out in messages such as
> > > >
> > > > INFO:root:creating driver
> > > > Submitting
> > > > job(s).........................................................
> > > > ...............................................................
> > > > .....................
> > > > 141 job(s) submitted to cluster 434109.
> > > >
> > > > So perhaps it is related to the condor driver sending inputs to
> > > > the nodes?
> > > >
> > > > I generally access input files via the XRootD door, so I guess
> > > > it isn't the staging of input files themselves, but rather the
> > > > analysis code ...
> > > >
> > > > Cheers,
> > > > Matt
> > > >
> > > > On Sun, Jan 13, 2019 at 6:50 PM Lincoln Bryant <lincolnb@uchica
> > > > go.edu> wrote:
> > > > > Hi folks,
> > > > >
> > > > > It looks like something is causing a lot of file I/O, either
> > > > > through /faxbox or /home when someone's jobs are running.
> > > > > This is causing Condor to block on the I/O, so condor_q will
> > > > > not respond. I rebooted the node and added an additional
> > > > > scratch disk.
> > > > >
> > > > > Could I ask you to first copy files to /scratch first, and
> > > > > have Condor pick up the jobs from there if you are using any
> > > > > kind of condor file transfer mechanisms? (e.g.
> > > > > transfer_input_files / transfer_output_files) This storage is
> > > > > local to the machine. I just created a 2TB volume here - let
> > > > > me know if more is needed. This should avoid some network I/O
> > > > > that might be causing things to hang.
> > > > >
> > > > > Alternatively you can use Faxbox via its XRootD door if you
> > > > > are using xrdcp.
> > > > > e.g. root://faxbox.usatlas.org//faxbox2/user/lincolnb/ maps
> > > > > to /faxbox/user/lincolnb on the login node.
> > > > >
> > > > > Let me know if you have any questions.
> > > > >
> > > > > Thanks,
> > > > > Lincoln
> > > > >
> > > > > On 1/13/2019 10:25 AM, Matthew Epland wrote:
> > > > > > Hi guys,
> > > > > >
> > > > > > It looks like something broke with the condor system again
> > > > > > this morning, everything went off line around 8AM CST
> > > > > > according to grafana. I can't connect to the cluster when
> > > > > > running condor_q.
> > > > > >
> > > > > > Yesterday the login machine was acting strange as well. It
> > > > > > hung when trying to source ~/.bashrc on log in, and I
> > > > > > couldn't rsync or scp files from remote machines - though I
> > > > > > could rsync them out from a session on login. Both issues
> > > > > > appear to have sorted themselves out however.
> > > > > >
> > > > > > Thanks,
> > > > > > Matt Epland
> > > > > >
> > > > > > On Mon, Jan 7, 2019 at 11:36 AM Lincoln Bryant <lincolnb@uc
> > > > > > hicago.edu> wrote:
> > > > > > > Ah, no that's me restarting Condor.
> > > > > > >
> > > > > > > Anyhow, I found a firewall issue that is now resolved.
> > > > > > >
> > > > > > > Can you try submitting again?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Lincoln
> > > > > > >
> > > > > > >
> > > > > > > On Mon, 2019-01-07 at 17:33 +0100, Matt LeBlanc wrote:
> > > > > > > > Hi Lincoln,
> > > > > > > >
> > > > > > > > It looks like I may have broken the condor daemon?
> > > > > > > >
> > > > > > > > login:~ mleblanc$ condor_q
> > > > > > > > Error:
> > > > > > > >
> > > > > > > > Extra Info: You probably saw this error because the
> > > > > > > condor_schedd is
> > > > > > > > not
> > > > > > > > running on the machine you are trying to query. If the
> > > > > > > condor_schedd
> > > > > > > > is not
> > > > > > > > running, the Condor system will not be able to find an
> > > > > > > address and
> > > > > > > > port to
> > > > > > > > connect to and satisfy this request. Please make sure
> > > > > > > the Condor
> > > > > > > > daemons are
> > > > > > > > running and try again.
> > > > > > > >
> > > > > > > > Extra Info: If the condor_schedd is running on the
> > > > > > > machine you are
> > > > > > > > trying to
> > > > > > > > query and you still see the error, the most likely
> > > > > > > cause is that you
> > > > > > > > have
> > > > > > > > setup a personal Condor, you have not defined
> > > > > > > SCHEDD_NAME in your
> > > > > > > > condor_config file, and something is wrong with your
> > > > > > > > SCHEDD_ADDRESS_FILE
> > > > > > > > setting. You must define either or both of those
> > > > > > > settings in your
> > > > > > > > config
> > > > > > > > file, or you must use the -name option to condor_q.
> > > > > > > Please see the
> > > > > > > > Condor
> > > > > > > > manual for details on SCHEDD_NAME and
> > > > > > > SCHEDD_ADDRESS_FILE.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Matt
> > > > > > > >
> > > > > > > > On Mon, Jan 7, 2019 at 5:10 PM Matt LeBlanc <matt.lebla
> > > > > > > nc AT cern.ch>
> > > > > > > > wrote:
> > > > > > > > > Hi Lincoln,
> > > > > > > > >
> > > > > > > > > No problem! I'll empty it, and start a few at a time
> > > > > > > in a little
> > > > > > > > > while from now.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Matt
> > > > > > > > >
> > > > > > > > > On Mon, Jan 7, 2019 at 5:06 PM Lincoln Bryant <lincol
> > > > > > > nb AT uchicago.ed
> > > > > > > > > u> wrote:
> > > > > > > > > > Matt,
> > > > > > > > > >
> > > > > > > > > > I found at least one problem that is fixed now.
> > > > > > > However there are
> > > > > > > > > > a
> > > > > > > > > > _lot_ of jobs in queue (70k). Any way you could
> > > > > > > temporarily
> > > > > > > > > > remove some
> > > > > > > > > > of the queued jobs and reduce it to say <20k or
> > > > > > > <10k? I think our
> > > > > > > > > > glidein system is timing out trying to query the
> > > > > > > condor queue for
> > > > > > > > > > your
> > > > > > > > > > jobs.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Lincoln
> > > > > > > > > >
> > > > > > > > > > On Mon, 2019-01-07 at 14:21 +0000, Lincoln Bryant
> > > > > > > wrote:
> > > > > > > > > > > Hi Matt,
> > > > > > > > > > >
> > > > > > > > > > > Will take a look.
> > > > > > > > > > >
> > > > > > > > > > > --Lincoln
> > > > > > > > > > >
> > > > > > > > > > > On 1/7/2019 4:24 AM, Matt LeBlanc wrote:
> > > > > > > > > > > > Hi Lincoln,
> > > > > > > > > > > >
> > > > > > > > > > > > All of my open sessions crashed a few minutes
> > > > > > > ago similarly
> > > > > > > > > > to how
> > > > > > > > > > > > they broke on Saturday. I am able to log in
> > > > > > > already, though
> > > > > > > > > > my
> > > > > > > > > > > > condor jobs appear to be stuck idle in the
> > > > > > > queue.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Matt
> > > > > > > > > > > >
> > > > > > > > > > > > On Sat, Jan 5, 2019 at 8:08 AM Lincoln Bryant <
> > > > > > > lincolnb@uchic
> > > > > > > > > > ago.ed
> > > > > > > > > > > > u> wrote:
> > > > > > > > > > > > > On 1/4/2019 4:36 PM, Matthew Epland wrote:
> > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Is login.usatlas.org down? My ssh
> > > > > > > connection just broke
> > > > > > > > > > and I
> > > > > > > > > > > > > can not reconnect.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Matt,
> > > > > > > > > > > > >
> > > > > > > > > > > > > We had a hypervisor issue at UChicago. You
> > > > > > > should be able
> > > > > > > > > > to
> > > > > > > > > > > > > login
> > > > > > > > > > > > > again. I am working on restoring Condor
> > > > > > > services now.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --Lincoln
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > _______________________________________________
> > > > > > > > > > > > > Atlas-connect-l mailing list
> > > > > > > > > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > > > > > > > > https://lists.bnl.gov/mailman/listinfo/atlas-
> > > > > > > connect-l
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > Atlas-connect-l mailing list
> > > > > > > > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > > > > > > > https://lists.bnl.gov/mailman/listinfo/atlas-co
> > > > > > > nnect-l
> > > > > > > > > > > > _______________________________________________
> > > > > > > > > > > > ATLAS Midwest Tier2 mailing list
> > > > > > > > > > > > http://mwt2.usatlasfacility.org
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Matt LeBlanc
> > > > > > > > > University of Arizona
> > > > > > > > > Office: 40/1-C11 (CERN)
> > > > > > > > > https://cern.ch/mleblanc/
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Matthew Epland
> > > > > > Duke University Department of Physics
> > > > > > matthew.epland AT cern.ch
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Matt LeBlanc
> > > > University of Arizona
> > > > Office: 40/1-C11 (CERN)
> > > > https://cern.ch/mleblanc/
> > > >
> > >
> > >
> > > --
> > > Matthew Epland
> > > Duke University Department of Physics
> > > matthew.epland AT cern.ch
> > >
> > >
> >
> >
> > --
> > Matt LeBlanc
> > University of Arizona
> > Office: 40/1-C11 (CERN)
> > https://cern.ch/mleblanc/
> >
>
>

--
Matthew Epland
Duke University Department of Physics
matthew.epland AT cern.ch

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

Matt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/

Re: [Atlas-connect-l] login.usatlas.org down? , (continued)