atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Re: [Atlas-connect-l] login.usatlas.org down?

From: Lincoln Bryant <lincolnb AT uchicago.edu>
To: Matthew Epland <matthew.epland AT cern.ch>, "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
Subject: Re: [Atlas-connect-l] login.usatlas.org down?
Date: Mon, 21 Jan 2019 21:17:51 +0000

Hi Matt,

How much disk space do you expect your jobs to take? Condor seems to
think they require quite a lot, which is why I think it's not matching.

--Lincoln

On Mon, 2019-01-21 at 10:39 -0500, Matthew Epland wrote:
> Hi Lincoln,
>
> My jobs this morning have been stuck in idle for awhile now. Could
> you please check the cluster. If all goes well today (and maybe
> tomorrow) should be the last day I am running things as I unblind!
>
> Thanks,
> Matt Epland
>
> On Mon, Jan 14, 2019 at 10:49 AM Lincoln Bryant <lincolnb AT uchicago.ed
> u> wrote:
> > CVMFS looks ok to me now - clearly we're having a litany of
> > intermittent problems with this host..
> >
> > I will investigate today and see if there are any underlying
> > problems.
> >
> > --Lincoln
> >
> > On Mon, 2019-01-14 at 10:16 +0100, Matt LeBlanc wrote:
> > > Sorry -- what I should have said is, I'm unable to run setupATLAS
> > > this morning. I guess the /cvmfs mount is broken?
> > >
> > > Cheers,
> > > Matt
> > >
> > > On Mon, Jan 14, 2019 at 10:15 AM Matt LeBlanc <matt.leblanc@cern.
> > ch>
> > > wrote:
> > > > Hi all,
> > > >
> > > > I'm unable to log in this morning. Things hang after
> > > >
> > > > mlb-macbookpro:~ mleblanc$ usc
> > > > Last login: Sun Jan 13 15:44:37 2019 from x
> > > > Welcome to ATLAS Connect.
> > > >
> > > > For registration or login problems: support AT connect.usatlas.or
> > g
> > > > ATLAS Connect user forum: atlas-connect-l AT lists.bnl.gov
> > > > For additional documentation and examples: http://connect.usatl
> > as.o
> > > > rg/
> > > >
> > > > Cheeers,
> > > > Matt
> > > >
> > > >
> > > > On Sun, Jan 13, 2019 at 11:08 PM Matthew Epland <matthew.epland
> > @cer
> > > > n.ch> wrote:
> > > > > Hi Lincoln,
> > > > >
> > > > > I tried moving my input data files to /scratch:
> > > > >
> > > > > [mepland@login logs]$ ls /scratch/input_files/
> > > > > Bkg_21.2.55_mc16a.diboson_preprocessed_with_predictions.root
> >
> > > > >
> > Bkg_21.2.55_mc16d.topEW_preprocessed_with_predictions.root
> > > > >
> > Bkg_21.2.55_mc16a.singletop_hybrid_preprocessed_with_predictions.
> > > > > root
> > Bkg_21.2.55_mc16d.ttbar_preprocessed_with_predictions.root
> > > > > Bkg_21.2.55_mc16a.topEW_preprocessed_with_predictions.root
> >
> > > > >
> > Bkg_21.2.55_mc16d.W_jets_preprocessed_with_predictions.root
> > > > > Bkg_21.2.55_mc16a.ttbar_preprocessed_with_predictions.root
> >
> > > > >
> > Bkg_21.2.55_mc16d.Z_jets_preprocessed_with_predictions.root
> > > > > Bkg_21.2.55_mc16a.W_jets_preprocessed_with_predictions.root
> >
> > > > >
> > Data_21.2.55_20152016_preprocessed_with_predictions.root
> > > > > Bkg_21.2.55_mc16a.Z_jets_preprocessed_with_predictions.root
> >
> > > > > Data_21.2.55_2017_preprocessed_with_predictions.root
> > > > > Bkg_21.2.55_mc16d.diboson_preprocessed_with_predictions.root
> >
> > > > >
> > > > >
> > Sig_21.2.55_mc16a.Gtt_preprocessed_with_predictions_with_fake_hi
> > > > > gh_masses.root
> > > > >
> > Bkg_21.2.55_mc16d.singletop_hybrid_preprocessed_with_predictions.
> > > > > root
> > > > >
> > Sig_21.2.55_mc16d.Gtt_preprocessed_with_predictions_with_fake_hig
> > > > > h_masses.root
> > > > >
> > > > > but the condor worker VMs aren't picking them up
> > > > >
> > > > > ^[[1;31m<ERROR> PrepareHistos: input file
> > > > >
> > /scratch/input_files/Bkg_21.2.55_mc16d.ttbar_preprocessed_with_pr
> > > > > edictions.root does not exist - cannot load ttbar_nominal
> > from
> > > > > it^[[0m
> > > > >
> > > > > I can stick with my present method of copying the files over
> > > > > directly in tar, but it does take awhile to get them running.
> > Any
> > > > > suggestions?
> > > > >
> > > > > Thanks,
> > > > > Matt
> > > > >
> > > > > On Sun, Jan 13, 2019 at 1:43 PM Matt LeBlanc <matt.leblanc@ce
> > rn.c
> > > > > h> wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I notice that my 'watch condor_q' breaks whenever I am
> > > > > > *submitting* jobs to the condor queue. i.e. when the dots
> > are
> > > > > > being printed out in messages such as
> > > > > >
> > > > > > INFO:root:creating driver
> > > > > > Submitting
> > > > > >
> > job(s).........................................................
> > > > > >
> > ...............................................................
> > > > > > .....................
> > > > > > 141 job(s) submitted to cluster 434109.
> > > > > >
> > > > > > So perhaps it is related to the condor driver sending
> > inputs to
> > > > > > the nodes?
> > > > > >
> > > > > > I generally access input files via the XRootD door, so I
> > guess
> > > > > > it isn't the staging of input files themselves, but rather
> > the
> > > > > > analysis code ...
> > > > > >
> > > > > > Cheers,
> > > > > > Matt
> > > > > >
> > > > > > On Sun, Jan 13, 2019 at 6:50 PM Lincoln Bryant <lincolnb@uc
> > hica
> > > > > > go.edu> wrote:
> > > > > > > Hi folks,
> > > > > > >
> > > > > > > It looks like something is causing a lot of file I/O,
> > either
> > > > > > > through /faxbox or /home when someone's jobs are running.
> > > > > > > This is causing Condor to block on the I/O, so condor_q
> > will
> > > > > > > not respond. I rebooted the node and added an additional
> > > > > > > scratch disk.
> > > > > > >
> > > > > > > Could I ask you to first copy files to /scratch first,
> > and
> > > > > > > have Condor pick up the jobs from there if you are using
> > any
> > > > > > > kind of condor file transfer mechanisms? (e.g.
> > > > > > > transfer_input_files / transfer_output_files) This
> > storage is
> > > > > > > local to the machine. I just created a 2TB volume here -
> > let
> > > > > > > me know if more is needed. This should avoid some network
> > I/O
> > > > > > > that might be causing things to hang.
> > > > > > >
> > > > > > > Alternatively you can use Faxbox via its XRootD door if
> > you
> > > > > > > are using xrdcp.
> > > > > > > e.g. root://faxbox.usatlas.org//faxbox2/user/lincolnb/
> > maps
> > > > > > > to /faxbox/user/lincolnb on the login node.
> > > > > > >
> > > > > > > Let me know if you have any questions.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Lincoln
> > > > > > >
> > > > > > > On 1/13/2019 10:25 AM, Matthew Epland wrote:
> > > > > > > > Hi guys,
> > > > > > > >
> > > > > > > > It looks like something broke with the condor system
> > again
> > > > > > > > this morning, everything went off line around 8AM CST
> > > > > > > > according to grafana. I can't connect to the cluster
> > when
> > > > > > > > running condor_q.
> > > > > > > >
> > > > > > > > Yesterday the login machine was acting strange as well.
> > It
> > > > > > > > hung when trying to source ~/.bashrc on log in, and I
> > > > > > > > couldn't rsync or scp files from remote machines -
> > though I
> > > > > > > > could rsync them out from a session on login. Both
> > issues
> > > > > > > > appear to have sorted themselves out however.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Matt Epland
> > > > > > > >
> > > > > > > > On Mon, Jan 7, 2019 at 11:36 AM Lincoln Bryant <lincoln
> > b@uc
> > > > > > > > hicago.edu> wrote:
> > > > > > > > > Ah, no that's me restarting Condor.
> > > > > > > > >
> > > > > > > > > Anyhow, I found a firewall issue that is now
> > resolved.
> > > > > > > > >
> > > > > > > > > Can you try submitting again?
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Lincoln
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, 2019-01-07 at 17:33 +0100, Matt LeBlanc
> > wrote:
> > > > > > > > > > Hi Lincoln,
> > > > > > > > > >
> > > > > > > > > > It looks like I may have broken the condor daemon?
> > > > > > > > > >
> > > > > > > > > > login:~ mleblanc$ condor_q
> > > > > > > > > > Error:
> > > > > > > > > >
> > > > > > > > > > Extra Info: You probably saw this error because the
> > > > > > > > > condor_schedd is
> > > > > > > > > > not
> > > > > > > > > > running on the machine you are trying to query. If
> > the
> > > > > > > > > condor_schedd
> > > > > > > > > > is not
> > > > > > > > > > running, the Condor system will not be able to find
> > an
> > > > > > > > > address and
> > > > > > > > > > port to
> > > > > > > > > > connect to and satisfy this request. Please make
> > sure
> > > > > > > > > the Condor
> > > > > > > > > > daemons are
> > > > > > > > > > running and try again.
> > > > > > > > > >
> > > > > > > > > > Extra Info: If the condor_schedd is running on the
> > > > > > > > > machine you are
> > > > > > > > > > trying to
> > > > > > > > > > query and you still see the error, the most likely
> > > > > > > > > cause is that you
> > > > > > > > > > have
> > > > > > > > > > setup a personal Condor, you have not defined
> > > > > > > > > SCHEDD_NAME in your
> > > > > > > > > > condor_config file, and something is wrong with
> > your
> > > > > > > > > > SCHEDD_ADDRESS_FILE
> > > > > > > > > > setting. You must define either or both of those
> > > > > > > > > settings in your
> > > > > > > > > > config
> > > > > > > > > > file, or you must use the -name option to condor_q.
> > > > > > > > > Please see the
> > > > > > > > > > Condor
> > > > > > > > > > manual for details on SCHEDD_NAME and
> > > > > > > > > SCHEDD_ADDRESS_FILE.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Matt
> > > > > > > > > >
> > > > > > > > > > On Mon, Jan 7, 2019 at 5:10 PM Matt LeBlanc
> > <matt.lebla
> > > > > > > > > nc AT cern.ch>
> > > > > > > > > > wrote:
> > > > > > > > > > > Hi Lincoln,
> > > > > > > > > > >
> > > > > > > > > > > No problem! I'll empty it, and start a few at a
> > time
> > > > > > > > > in a little
> > > > > > > > > > > while from now.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Matt
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jan 7, 2019 at 5:06 PM Lincoln Bryant
> > <lincol
> > > > > > > > > nb AT uchicago.ed
> > > > > > > > > > > u> wrote:
> > > > > > > > > > > > Matt,
> > > > > > > > > > > >
> > > > > > > > > > > > I found at least one problem that is fixed now.
> > > > > > > > > However there are
> > > > > > > > > > > > a
> > > > > > > > > > > > _lot_ of jobs in queue (70k). Any way you could
> > > > > > > > > temporarily
> > > > > > > > > > > > remove some
> > > > > > > > > > > > of the queued jobs and reduce it to say <20k or
> > > > > > > > > <10k? I think our
> > > > > > > > > > > > glidein system is timing out trying to query
> > the
> > > > > > > > > condor queue for
> > > > > > > > > > > > your
> > > > > > > > > > > > jobs.
> > > > > > > > > > > >
> > > > > > > > > > > > Thanks,
> > > > > > > > > > > > Lincoln
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 2019-01-07 at 14:21 +0000, Lincoln
> > Bryant
> > > > > > > > > wrote:
> > > > > > > > > > > > > Hi Matt,
> > > > > > > > > > > > >
> > > > > > > > > > > > > Will take a look.
> > > > > > > > > > > > >
> > > > > > > > > > > > > --Lincoln
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 1/7/2019 4:24 AM, Matt LeBlanc wrote:
> > > > > > > > > > > > > > Hi Lincoln,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All of my open sessions crashed a few
> > minutes
> > > > > > > > > ago similarly
> > > > > > > > > > > > to how
> > > > > > > > > > > > > > they broke on Saturday. I am able to log in
> > > > > > > > > already, though
> > > > > > > > > > > > my
> > > > > > > > > > > > > > condor jobs appear to be stuck idle in the
> > > > > > > > > queue.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sat, Jan 5, 2019 at 8:08 AM Lincoln
> > Bryant <
> > > > > > > > > lincolnb@uchic
> > > > > > > > > > > > ago.ed
> > > > > > > > > > > > > > u> wrote:
> > > > > > > > > > > > > > > On 1/4/2019 4:36 PM, Matthew Epland
> > wrote:
> > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Is login.usatlas.org down? My ssh
> > > > > > > > > connection just broke
> > > > > > > > > > > > and I
> > > > > > > > > > > > > > > can not reconnect.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi Matt,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > We had a hypervisor issue at UChicago.
> > You
> > > > > > > > > should be able
> > > > > > > > > > > > to
> > > > > > > > > > > > > > > login
> > > > > > > > > > > > > > > again. I am working on restoring Condor
> > > > > > > > > services now.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > --Lincoln
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > > > > > > > Atlas-connect-l mailing list
> > > > > > > > > > > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > > > > > > > > > > https://lists.bnl.gov/mailman/listinfo/at
> > las-
> > > > > > > > > connect-l
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > _______________________________________________
> > > > > > > > > > > > > > Atlas-connect-l mailing list
> > > > > > > > > > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > > > > > > > > > https://lists.bnl.gov/mailman/listinfo/atla
> > s-co
> > > > > > > > > nnect-l
> > > > > > > > > > > > > >
> > _______________________________________________
> > > > > > > > > > > > > > ATLAS Midwest Tier2 mailing list
> > > > > > > > > > > > > > http://mwt2.usatlasfacility.org
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Matt LeBlanc
> > > > > > > > > > > University of Arizona
> > > > > > > > > > > Office: 40/1-C11 (CERN)
> > > > > > > > > > > https://cern.ch/mleblanc/
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Matthew Epland
> > > > > > > > Duke University Department of Physics
> > > > > > > > matthew.epland AT cern.ch
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Matt LeBlanc
> > > > > > University of Arizona
> > > > > > Office: 40/1-C11 (CERN)
> > > > > > https://cern.ch/mleblanc/
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Matthew Epland
> > > > > Duke University Department of Physics
> > > > > matthew.epland AT cern.ch
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Matt LeBlanc
> > > > University of Arizona
> > > > Office: 40/1-C11 (CERN)
> > > > https://cern.ch/mleblanc/
> > > >
> > >
> > >
> >
>
>
> --
> Matthew Epland
> Duke University Department of Physics
> matthew.epland AT cern.ch
>

Re: [Atlas-connect-l] login.usatlas.org down? , (continued)