atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
- From: Matt LeBlanc <matt.leblanc AT cern.ch>
- To: Matthew Epland <matthew.epland AT cern.ch>
- Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
- Subject: Re: [Atlas-connect-l] login.usatlas.org down?
- Date: Mon, 14 Jan 2019 10:15:15 +0100
Hi all,
I'm unable to log in this morning. Things hang after
mlb-macbookpro:~ mleblanc$ usc
Last login: Sun Jan 13 15:44:37 2019 from x
Welcome to ATLAS Connect.
For registration or login problems: support AT connect.usatlas.org
ATLAS Connect user forum: atlas-connect-l AT lists.bnl.gov
For additional documentation and examples: http://connect.usatlas.org/
Last login: Sun Jan 13 15:44:37 2019 from x
Welcome to ATLAS Connect.
For registration or login problems: support AT connect.usatlas.org
ATLAS Connect user forum: atlas-connect-l AT lists.bnl.gov
For additional documentation and examples: http://connect.usatlas.org/
Cheeers,
Matt
On Sun, Jan 13, 2019 at 11:08 PM Matthew Epland <matthew.epland AT cern.ch> wrote:
Hi Lincoln,I tried moving my input data files to /scratch:[mepland@login logs]$ ls /scratch/input_files/Bkg_21.2.55_mc16a.diboson_preprocessed_with_predictions.root Bkg_21.2.55_mc16d.topEW_preprocessed_with_predictions.rootBkg_21.2.55_mc16a.singletop_hybrid_preprocessed_with_predictions.root Bkg_21.2.55_mc16d.ttbar_preprocessed_with_predictions.rootBkg_21.2.55_mc16a.topEW_preprocessed_with_predictions.root Bkg_21.2.55_mc16d.W_jets_preprocessed_with_predictions.rootBkg_21.2.55_mc16a.ttbar_preprocessed_with_predictions.root Bkg_21.2.55_mc16d.Z_jets_preprocessed_with_predictions.rootBkg_21.2.55_mc16a.W_jets_preprocessed_with_predictions.root Data_21.2.55_20152016_preprocessed_with_predictions.rootBkg_21.2.55_mc16a.Z_jets_preprocessed_with_predictions.root Data_21.2.55_2017_preprocessed_with_predictions.rootBkg_21.2.55_mc16d.diboson_preprocessed_with_predictions.root Sig_21.2.55_mc16a.Gtt_preprocessed_with_predictions_with_fake_high_masses.rootBkg_21.2.55_mc16d.singletop_hybrid_preprocessed_with_predictions.root Sig_21.2.55_mc16d.Gtt_preprocessed_with_predictions_with_fake_high_masses.rootbut the condor worker VMs aren't picking them up^[[1;31m<ERROR> PrepareHistos: input file /scratch/input_files/Bkg_21.2.55_mc16d.ttbar_preprocessed_with_predictions.root does not exist - cannot load ttbar_nominal from it^[[0mI can stick with my present method of copying the files over directly in tar, but it does take awhile to get them running. Any suggestions?Thanks,MattOn Sun, Jan 13, 2019 at 1:43 PM Matt LeBlanc <matt.leblanc AT cern.ch> wrote:Hi all,I notice that my 'watch condor_q' breaks whenever I am *submitting* jobs to the condor queue. i.e. when the dots are being printed out in messages such asINFO:root:creating driver
Submitting job(s).............................................................................................................................................
141 job(s) submitted to cluster 434109.So perhaps it is related to the condor driver sending inputs to the nodes?I generally access input files via the XRootD door, so I guess it isn't the staging of input files themselves, but rather the analysis code ...Cheers,MattOn Sun, Jan 13, 2019 at 6:50 PM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:Hi folks,
It looks like something is causing a lot of file I/O, either through /faxbox or /home when someone's jobs are running. This is causing Condor to block on the I/O, so condor_q will not respond. I rebooted the node and added an additional scratch disk.
Could I ask you to first copy files to /scratch first, and have Condor pick up the jobs from there if you are using any kind of condor file transfer mechanisms? (e.g. transfer_input_files / transfer_output_files) This storage is local to the machine. I just created a 2TB volume here - let me know if more is needed. This should avoid some network I/O that might be causing things to hang.
Alternatively you can use Faxbox via its XRootD door if you are using xrdcp.
e.g. root://faxbox.usatlas.org//faxbox2/user/lincolnb/ maps to /faxbox/user/lincolnb on the login node.
Let me know if you have any questions.
Thanks,
Lincoln
On 1/13/2019 10:25 AM, Matthew Epland wrote:
Hi guys,
It looks like something broke with the condor system again this morning, everything went off line around 8AM CST according to grafana. I can't connect to the cluster when running condor_q.
Yesterday the login machine was acting strange as well. It hung when trying to source ~/.bashrc on log in, and I couldn't rsync or scp files from remote machines - though I could rsync them out from a session on login. Both issues appear to have sorted themselves out however.
Thanks,Matt Epland
On Mon, Jan 7, 2019 at 11:36 AM Lincoln Bryant <lincolnb AT uchicago.edu> wrote:
Ah, no that's me restarting Condor.
Anyhow, I found a firewall issue that is now resolved.
Can you try submitting again?
Thanks,
Lincoln
On Mon, 2019-01-07 at 17:33 +0100, Matt LeBlanc wrote:
> Hi Lincoln,
>
> It looks like I may have broken the condor daemon?
>
> login:~ mleblanc$ condor_q
> Error:
>
> Extra Info: You probably saw this error because the condor_schedd is
> not
> running on the machine you are trying to query. If the condor_schedd
> is not
> running, the Condor system will not be able to find an address and
> port to
> connect to and satisfy this request. Please make sure the Condor
> daemons are
> running and try again.
>
> Extra Info: If the condor_schedd is running on the machine you are
> trying to
> query and you still see the error, the most likely cause is that you
> have
> setup a personal Condor, you have not defined SCHEDD_NAME in your
> condor_config file, and something is wrong with your
> SCHEDD_ADDRESS_FILE
> setting. You must define either or both of those settings in your
> config
> file, or you must use the -name option to condor_q. Please see the
> Condor
> manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.
>
> Cheers,
> Matt
>
> On Mon, Jan 7, 2019 at 5:10 PM Matt LeBlanc <matt.leblanc AT cern.ch>
> wrote:
> > Hi Lincoln,
> >
> > No problem! I'll empty it, and start a few at a time in a little
> > while from now.
> >
> > Cheers,
> > Matt
> >
> > On Mon, Jan 7, 2019 at 5:06 PM Lincoln Bryant <lincolnb AT uchicago.ed
> > u> wrote:
> > > Matt,
> > >
> > > I found at least one problem that is fixed now. However there are
> > > a
> > > _lot_ of jobs in queue (70k). Any way you could temporarily
> > > remove some
> > > of the queued jobs and reduce it to say <20k or <10k? I think our
> > > glidein system is timing out trying to query the condor queue for
> > > your
> > > jobs.
> > >
> > > Thanks,
> > > Lincoln
> > >
> > > On Mon, 2019-01-07 at 14:21 +0000, Lincoln Bryant wrote:
> > > > Hi Matt,
> > > >
> > > > Will take a look.
> > > >
> > > > --Lincoln
> > > >
> > > > On 1/7/2019 4:24 AM, Matt LeBlanc wrote:
> > > > > Hi Lincoln,
> > > > >
> > > > > All of my open sessions crashed a few minutes ago similarly
> > > to how
> > > > > they broke on Saturday. I am able to log in already, though
> > > my
> > > > > condor jobs appear to be stuck idle in the queue.
> > > > >
> > > > > Cheers,
> > > > > Matt
> > > > >
> > > > > On Sat, Jan 5, 2019 at 8:08 AM Lincoln Bryant <lincolnb@uchic
> > > ago.ed
> > > > > u> wrote:
> > > > > > On 1/4/2019 4:36 PM, Matthew Epland wrote:
> > > > > > > Hello,
> > > > > > >
> > > > > > > Is login.usatlas.org down? My ssh connection just broke
> > > and I
> > > > > > can not reconnect.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Matt
> > > > > > >
> > > > > > Hi Matt,
> > > > > >
> > > > > > We had a hypervisor issue at UChicago. You should be able
> > > to
> > > > > > login
> > > > > > again. I am working on restoring Condor services now.
> > > > > >
> > > > > > --Lincoln
> > > > > >
> > > > > >
> > > > > > _______________________________________________
> > > > > > Atlas-connect-l mailing list
> > > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > > https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
> > > > > >
> > > > >
> > > > >
> > > > > _______________________________________________
> > > > > Atlas-connect-l mailing list
> > > > > Atlas-connect-l AT lists.bnl.gov
> > > > > https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
> > > > > _______________________________________________
> > > > > ATLAS Midwest Tier2 mailing list
> > > > > http://mwt2.usatlasfacility.org
> > >
> >
> >
> > --
> > Matt LeBlanc
> > University of Arizona
> > Office: 40/1-C11 (CERN)
> > https://cern.ch/mleblanc/
> >
>
>
--
Matthew Epland
Duke University Department of Physics
--Matt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/--Matthew Epland
Duke University Department of Physics
--
Matt LeBlanc
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/
University of Arizona
Office: 40/1-C11 (CERN)
https://cern.ch/mleblanc/
-
Re: [Atlas-connect-l] login.usatlas.org down?
, (continued)
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Matt LeBlanc, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Lincoln Bryant, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Lincoln Bryant, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Matt LeBlanc, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Matt LeBlanc, 01/07/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/07/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matthew Epland, 01/13/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/13/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matt LeBlanc, 01/13/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matthew Epland, 01/13/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matt LeBlanc, 01/14/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matt LeBlanc, 01/14/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/14/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matthew Epland, 01/21/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Matt LeBlanc, 01/21/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/21/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/21/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Matt LeBlanc, 01/07/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/21/2019
- Re: [Atlas-connect-l] login.usatlas.org down?, Lincoln Bryant, 01/21/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Matt LeBlanc, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Lincoln Bryant, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Lincoln Bryant, 01/07/2019
-
Re: [Atlas-connect-l] login.usatlas.org down?,
Matt LeBlanc, 01/07/2019
Archive powered by MHonArc 2.6.24.