atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Re: [Atlas-connect-l] login.usatlas.org down?

From: Lincoln Bryant <lincolnb AT uchicago.edu>
To: Matt LeBlanc <matt.leblanc AT cern.ch>, Matthew Epland <matthew.epland AT cern.ch>
Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
Subject: Re: [Atlas-connect-l] login.usatlas.org down?
Date: Mon, 21 Jan 2019 20:29:12 +0000

Hi Matt,

In your case, it looks like they're using a _lot_ of RAM:

444191.130 mleblanc        1/20 17:50   0+01:46:50 I  0   46387.0 run
130
444194.240 mleblanc        1/20 17:51   0+01:26:58 I  0   48829.0 run
240
444195.153 mleblanc        1/20 17:52   0+01:18:29 I  0   34180.0 run
153
444196.75  mleblanc        1/20 17:52   0+01:20:30 I  0   46387.0 run
75
444196.89  mleblanc        1/20 17:52   0+01:45:21 I  0   46387.0 run
89
444197.52  mleblanc        1/20 17:52   0+01:11:28 I  0   73243.0 run
52

The second-to-last column is the 'SIZE' value, which is how much memory
your job used in megabytes the last time it ran.

Condor won't reschedule your jobs because "requestMemory" is larger
than any available node.

If you think they might be OK after resubmitting, you can use
"condor_qedit" to edit the "requestMemory" string to be something
reasonable in KBytes (e.g. 2000).

--Lincoln

On Mon, 2019-01-21 at 20:46 +0100, Matt LeBlanc wrote:
> Hi all,
>
> I've had ~8 jobs stuck in idle for >12 hours now as well. They flip
> between R and I, actually, but never finish ...
>
> Cheers,
> Matt
>
> On Mon, Jan 21, 2019 at 4:39 PM Matthew Epland <matthew.epland AT cern.c
> h> wrote:
> > Hi Lincoln,
> >
> > My jobs this morning have been stuck in idle for awhile now. Could
> > you please check the cluster. If all goes well today (and maybe
> > tomorrow) should be the last day I am running things as I unblind!
> >
> > Thanks,
> > Matt Epland
> >
> > On Mon, Jan 14, 2019 at 10:49 AM Lincoln Bryant <lincolnb@uchicago.
> > edu> wrote:
> > > CVMFS looks ok to me now - clearly we're having a litany of
> > > intermittent problems with this host..
> > >
> > > I will investigate today and see if there are any underlying
> > > problems.
> > >
> > > --Lincoln
> > >
> > > On Mon, 2019-01-14 at 10:16 +0100, Matt LeBlanc wrote:
> > > > Sorry -- what I should have said is, I'm unable to run
> > > setupATLAS
> > > > this morning. I guess the /cvmfs mount is broken?
> > > >
> > > > Cheers,
> > > > Matt
> > > >
> > > > On Mon, Jan 14, 2019 at 10:15 AM Matt LeBlanc <matt.leblanc@cer
> > > n.ch>
> > > > wrote:
> > > > > Hi all,
> > > > >
> > > > > I'm unable to log in this morning. Things hang after
> > > > >
> > > > > mlb-macbookpro:~ mleblanc$ usc
> > > > > Last login: Sun Jan 13 15:44:37 2019 from x
> > > > > Welcome to ATLAS Connect.
> > > > >
> > > > > For registration or login problems: support AT connect.usatlas.
> > > org
> > > > > ATLAS Connect user forum: atlas-connect-l AT lists.bnl.gov
> > > > > For additional documentation and examples: http://connect.usa
> > > tlas.o
> > > > > rg/
> > > > >
> > > > > Cheeers,
> > > > > Matt
> > > > >
> > > > >
> > > > > On Sun, Jan 13, 2019 at 11:08 PM Matthew Epland <matthew.epla
> > > nd@cer
> > > > > n.ch> wrote:
> > > > > > Hi Lincoln,
> > > > > >
> > > > > > I tried moving my input data files to /scratch:
> > > > > >
> > > > > > [mepland@login logs]$ ls /scratch/input_files/
> > > > > >
> > > Bkg_21.2.55_mc16a.diboson_preprocessed_with_predictions.root
> > > > > >
> > > Bkg_21.2.55_mc16d.topEW_preprocessed_with_predictions.root
> > > > > >
> > > Bkg_21.2.55_mc16a.singletop_hybrid_preprocessed_with_predictions.
> > > > > > root
> > > Bkg_21.2.55_mc16d.ttbar_preprocessed_with_predictions.root
> > > > > > Bkg_21.2.55_mc16a.topEW_preprocessed_with_predictions.root
> > >
> > > > > >
> > > Bkg_21.2.55_mc16d.W_jets_preprocessed_with_predictions.root
> > > > > > Bkg_21.2.55_mc16a.ttbar_preprocessed_with_predictions.root
> > >
> > > > > >
> > > Bkg_21.2.55_mc16d.Z_jets_preprocessed_with_predictions.root
> > > > > >
> > > Bkg_21.2.55_mc16a.W_jets_preprocessed_with_predictions.root
> > > > > >
> > > Data_21.2.55_20152016_preprocessed_with_predictions.root
> > > > > >
> > > Bkg_21.2.55_mc16a.Z_jets_preprocessed_with_predictions.root
> > > > > > Data_21.2.55_2017_preprocessed_with_predictions.root
> > > > > >
> > > Bkg_21.2.55_mc16d.diboson_preprocessed_with_predictions.root
> > > > > >
> > > > > >
> > > Sig_21.2.55_mc16a.Gtt_preprocessed_with_predictions_with_fake_hi
> > > > > > gh_masses.root
> > > > > >
> > > Bkg_21.2.55_mc16d.singletop_hybrid_preprocessed_with_predictions.
> > > > > > root
> > > > > >
> > > Sig_21.2.55_mc16d.Gtt_preprocessed_with_predictions_with_fake_hig
> > > > > > h_masses.root
> > > > > >
> > > > > > but the condor worker VMs aren't picking them up
> > > > > >
> > > > > > ^[[1;31m<ERROR> PrepareHistos: input file
> > > > > >
> > > /scratch/input_files/Bkg_21.2.55_mc16d.ttbar_preprocessed_with_pr
> > > > > > edictions.root does not exist - cannot load ttbar_nominal
> > > from
> > > > > > it^[[0m
> > > > > >
> > > > > > I can stick with my present method of copying the files
> > > over
> > > > > > directly in tar, but it does take awhile to get them
> > > running. Any
> > > > > > suggestions?
> > > > > >
> > > > > > Thanks,
> > > > > > Matt
> > > > > >
> > > > > > On Sun, Jan 13, 2019 at 1:43 PM Matt LeBlanc <matt.leblanc@
> > > cern.c
> > > > > > h> wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > > I notice that my 'watch condor_q' breaks whenever I am
> > > > > > > *submitting* jobs to the condor queue. i.e. when the dots
> > > are
> > > > > > > being printed out in messages such as
> > > > > > >
> > > > > > > INFO:root:creating driver
> > > > > > > Submitting
> > > > > > >
> > > job(s).........................................................
> > > > > > >
> > > ...............................................................
> > > > > > > .....................
> > > > > > > 141 job(s) submitted to cluster 434109.
> > > > > > >
> > > > > > > So perhaps it is related to the condor driver sending
> > > inputs to
> > > > > > > the nodes?
> > > > > > >
> > > > > > > I generally access input files via the XRootD door, so I
> > > guess
> > > > > > > it isn't the staging of input files themselves, but
> > > rather the
> > > > > > > analysis code ...
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Matt
> > > > > > >
> > > > > > > On Sun, Jan 13, 2019 at 6:50 PM Lincoln Bryant <lincolnb@
> > > uchica
> > > > > > > go.edu> wrote:
> > > > > > > > Hi folks,
> > > > > > > >
> > > > > > > > It looks like something is causing a lot of file I/O,
> > > either
> > > > > > > > through /faxbox or /home when someone's jobs are
> > > running.
> > > > > > > > This is causing Condor to block on the I/O, so condor_q
> > > will
> > > > > > > > not respond. I rebooted the node and added an
> > > additional
> > > > > > > > scratch disk.
> > > > > > > >
> > > > > > > > Could I ask you to first copy files to /scratch first,
> > > and
> > > > > > > > have Condor pick up the jobs from there if you are
> > > using any
> > > > > > > > kind of condor file transfer mechanisms? (e.g.
> > > > > > > > transfer_input_files / transfer_output_files) This
> > > storage is
> > > > > > > > local to the machine. I just created a 2TB volume here
> > > - let
> > > > > > > > me know if more is needed. This should avoid some
> > > network I/O
> > > > > > > > that might be causing things to hang.
> > > > > > > >
> > > > > > > > Alternatively you can use Faxbox via its XRootD door if
> > > you
> > > > > > > > are using xrdcp.
> > > > > > > > e.g. root://faxbox.usatlas.org//faxbox2/user/lincolnb/
> > > maps
> > > > > > > > to /faxbox/user/lincolnb on the login node.
> > > > > > > >
> > > > > > > > Let me know if you have any questions.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Lincoln
> > > > > > > >
> > > > > > > > On 1/13/2019 10:25 AM, Matthew Epland wrote:
> > > > > > > > > Hi guys,
> > > > > > > > >
> > > > > > > > > It looks like something broke with the condor system
> > > again
> > > > > > > > > this morning, everything went off line around 8AM CST
> > > > > > > > > according to grafana. I can't connect to the cluster
> > > when
> > > > > > > > > running condor_q.
> > > > > > > > >
> > > > > > > > > Yesterday the login machine was acting strange as
> > > well. It
> > > > > > > > > hung when trying to source ~/.bashrc on log in, and I
> > > > > > > > > couldn't rsync or scp files from remote machines -
> > > though I
> > > > > > > > > could rsync them out from a session on login. Both
> > > issues
> > > > > > > > > appear to have sorted themselves out however.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Matt Epland
> > > > > > > > >
> > > > > > > > > On Mon, Jan 7, 2019 at 11:36 AM Lincoln Bryant <linco
> > > lnb@uc
> > > > > > > > > hicago.edu> wrote:
> > > > > > > > > > Ah, no that's me restarting Condor.
> > > > > > > > > >
> > > > > > > > > > Anyhow, I found a firewall issue that is now
> > > resolved.
> > > > > > > > > >
> > > > > > > > > > Can you try submitting again?
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Lincoln
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, 2019-01-07 at 17:33 +0100, Matt LeBlanc
> > > wrote:
> > > > > > > > > > > Hi Lincoln,
> > > > > > > > > > >
> > > > > > > > > > > It looks like I may have broken the condor
> > > daemon?
> > > > > > > > > > >
> > > > > > > > > > > login:~ mleblanc$ condor_q
> > > > > > > > > > > Error:
> > > > > > > > > > >
> > > > > > > > > > > Extra Info: You probably saw this error because
> > > the
> > > > > > > > > > condor_schedd is
> > > > > > > > > > > not
> > > > > > > > > > > running on the machine you are trying to query.
> > > If the
> > > > > > > > > > condor_schedd
> > > > > > > > > > > is not
> > > > > > > > > > > running, the Condor system will not be able to
> > > find an
> > > > > > > > > > address and
> > > > > > > > > > > port to
> > > > > > > > > > > connect to and satisfy this request. Please make
> > > sure
> > > > > > > > > > the Condor
> > > > > > > > > > > daemons are
> > > > > > > > > > > running and try again.
> > > > > > > > > > >
> > > > > > > > > > > Extra Info: If the condor_schedd is running on
> > > the
> > > > > > > > > > machine you are
> > > > > > > > > > > trying to
> > > > > > > > > > > query and you still see the error, the most
> > > likely
> > > > > > > > > > cause is that you
> > > > > > > > > > > have
> > > > > > > > > > > setup a personal Condor, you have not defined
> > > > > > > > > > SCHEDD_NAME in your
> > > > > > > > > > > condor_config file, and something is wrong with
> > > your
> > > > > > > > > > > SCHEDD_ADDRESS_FILE
> > > > > > > > > > > setting. You must define either or both of those
> > > > > > > > > > settings in your
> > > > > > > > > > > config
> > > > > > > > > > > file, or you must use the -name option to
> > > condor_q.
> > > > > > > > > > Please see the
> > > > > > > > > > > Condor
> > > > > > > > > > > manual for details on SCHEDD_NAME and
> > > > > > > > > > SCHEDD_ADDRESS_FILE.
> > > > > > > > > > >
> > > > > > > > > > > Cheers,
> > > > > > > > > > > Matt
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Jan 7, 2019 at 5:10 PM Matt LeBlanc
> > > <matt.lebla
> > > > > > > > > > nc AT cern.ch>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > Hi Lincoln,
> > > > > > > > > > > >
> > > > > > > > > > > > No problem! I'll empty it, and start a few at a
> > > time
> > > > > > > > > > in a little
> > > > > > > > > > > > while from now.
> > > > > > > > > > > >
> > > > > > > > > > > > Cheers,
> > > > > > > > > > > > Matt
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Jan 7, 2019 at 5:06 PM Lincoln Bryant
> > > <lincol
> > > > > > > > > > nb AT uchicago.ed
> > > > > > > > > > > > u> wrote:
> > > > > > > > > > > > > Matt,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I found at least one problem that is fixed
> > > now.
> > > > > > > > > > However there are
> > > > > > > > > > > > > a
> > > > > > > > > > > > > _lot_ of jobs in queue (70k). Any way you
> > > could
> > > > > > > > > > temporarily
> > > > > > > > > > > > > remove some
> > > > > > > > > > > > > of the queued jobs and reduce it to say <20k
> > > or
> > > > > > > > > > <10k? I think our
> > > > > > > > > > > > > glidein system is timing out trying to query
> > > the
> > > > > > > > > > condor queue for
> > > > > > > > > > > > > your
> > > > > > > > > > > > > jobs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Lincoln
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 2019-01-07 at 14:21 +0000, Lincoln
> > > Bryant
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > Hi Matt,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Will take a look.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --Lincoln
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 1/7/2019 4:24 AM, Matt LeBlanc wrote:
> > > > > > > > > > > > > > > Hi Lincoln,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > All of my open sessions crashed a few
> > > minutes
> > > > > > > > > > ago similarly
> > > > > > > > > > > > > to how
> > > > > > > > > > > > > > > they broke on Saturday. I am able to log
> > > in
> > > > > > > > > > already, though
> > > > > > > > > > > > > my
> > > > > > > > > > > > > > > condor jobs appear to be stuck idle in
> > > the
> > > > > > > > > > queue.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Cheers,
> > > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sat, Jan 5, 2019 at 8:08 AM Lincoln
> > > Bryant <
> > > > > > > > > > lincolnb@uchic
> > > > > > > > > > > > > ago.ed
> > > > > > > > > > > > > > > u> wrote:
> > > > > > > > > > > > > > > > On 1/4/2019 4:36 PM, Matthew Epland
> > > wrote:
> > > > > > > > > > > > > > > > > Hello,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Is login.usatlas.org down? My ssh
> > > > > > > > > > connection just broke
> > > > > > > > > > > > > and I
> > > > > > > > > > > > > > > > can not reconnect.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Matt
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Matt,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > We had a hypervisor issue at UChicago.
> > > You
> > > > > > > > > > should be able
> > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > login
> > > > > > > > > > > > > > > > again. I am working on restoring Condor
> > > > > > > > > > services now.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > --Lincoln
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > _______________________________________________
> > > > > > > > > > > > > > > > Atlas-connect-l mailing list
> > > > > > > > > > > > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > > > > > > > > > > > https://lists.bnl.gov/mailman/listinfo/
> > > atlas-
> > > > > > > > > > connect-l
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > _______________________________________________
> > > > > > > > > > > > > > > Atlas-connect-l mailing list
> > > > > > > > > > > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > > > > > > > > > > https://lists.bnl.gov/mailman/listinfo/at
> > > las-co
> > > > > > > > > > nnect-l
> > > > > > > > > > > > > > >
> > > _______________________________________________
> > > > > > > > > > > > > > > ATLAS Midwest Tier2 mailing list
> > > > > > > > > > > > > > > http://mwt2.usatlasfacility.org
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Matt LeBlanc
> > > > > > > > > > > > University of Arizona
> > > > > > > > > > > > Office: 40/1-C11 (CERN)
> > > > > > > > > > > > https://cern.ch/mleblanc/
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Matthew Epland
> > > > > > > > > Duke University Department of Physics
> > > > > > > > > matthew.epland AT cern.ch
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Matt LeBlanc
> > > > > > > University of Arizona
> > > > > > > Office: 40/1-C11 (CERN)
> > > > > > > https://cern.ch/mleblanc/
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Matthew Epland
> > > > > > Duke University Department of Physics
> > > > > > matthew.epland AT cern.ch
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Matt LeBlanc
> > > > > University of Arizona
> > > > > Office: 40/1-C11 (CERN)
> > > > > https://cern.ch/mleblanc/
> > > > >
> > > >
> > > >
> > >
> >
> >
> > --
> > Matthew Epland
> > Duke University Department of Physics
> > matthew.epland AT cern.ch
> >
> > _______________________________________________
> > Atlas-connect-l mailing list
> > Atlas-connect-l AT lists.bnl.gov
> > https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
>
>

Re: [Atlas-connect-l] login.usatlas.org down? , (continued)