atlas-connect-l AT lists.bnl.gov
Subject: Atlas-connect-l mailing list
List archive
Re: [Atlas-connect-l] ATLAS Connect meeting Monday
- From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa AT gmail.com>
- To: David Lesny <ddl AT illinois.edu>
- Cc: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
- Subject: Re: [Atlas-connect-l] ATLAS Connect meeting Monday
- Date: Mon, 18 Aug 2014 12:28:11 -0700
Hi Dave,
I have been observing this kind of trend for a while(More than 1 month). Those jobs were not submitted from me but someone from "Atlasconnect" and I see the jobs were constantly went into held state. We removed many in the past as you did now. All I want to say that this cant be transient issue.
Now, I see you jobs running correctly. Since I and you can submit and get success with our jobs points more towards some particular user asking for some hard-coded path in respective job which condor doesnt understand.
Anyways, Now I will freshly monitor the jobs and keep you posted. The problem is I can debug the name of the submitter whose jobs are "running" state by logging into the node. But Dont know if I can figure out whose held jobs are those?
It would be great if we can lots of jobs from atlasconnect so as to monitor efficiently,
thanks
--Harinder
On Mon, Aug 18, 2014 at 10:35 AM, David Lesny <ddl AT illinois.edu> wrote:
I saw the jobs in hold on the your end and cleaned them out
I submitted a bunch of jobs from my end to your site
and they seemed to run without a problem.
Could there have been some type of transient problem at your end.
Perhaps a worker node with a bad NFS mount of the home areas?
Right now I have about 45 jobs running on your system
dave
On 8/18/2014 12:23 PM, Dr. Harinder Singh Bawa wrote:
Hi Dave,
As discussed, I see a lots of jobs going on held state since last 3-4 weeks. Error Code 14 indicates condor cannot access the initial working directory for the job.
See for example:
condor_q -global -l|grep HoldReason
[bawa@t3nfs ~]$ condor_q -global 104266.0 -l |grep HoldReason
HoldReasonSubCode = 2
HoldReason = "Cannot access initial working directory /nfs/t3nfs_common/home/fresnoatlas/bosco/rccf-atlas.ci-connect.net/fresnostate/sandbox/a6c3/a6c34aed/rccf-atlas.ci-connect.net_11018_rccf-atlas.ci-connect.net#40710.0#1408333977: No such file or directory"
HoldReasonCode = 14
Here are the list of jobs currently on held state:
[bawa@t3nfs ~]$ condor_q -global
-- Schedd: t3head.atlas.csufresno.edu : <129.8.242.180:9840?CCBID=129.8.242.180:9618%3fPrivNet%3datlas.csufresno.edu#234908&PrivAddr=%3c192.168.100.1:9840%3e&PrivNet=atlas.csufresno.edu>
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
103390.0 fresnoatlas 8/14 15:42 0+00:00:00 H 0 0.0 condor_exec.exe -d
103391.0 fresnoatlas 8/14 15:42 0+00:00:01 H 0 0.0 condor_exec.exe -d
103394.0 fresnoatlas 8/14 15:42 0+00:00:01 H 0 0.0 condor_exec.exe -d
103420.0 fresnoatlas 8/14 16:17 0+00:00:04 H 0 0.0 condor_exec.exe -d
103423.0 fresnoatlas 8/14 16:17 0+00:00:00 H 0 0.0 condor_exec.exe -d
103424.0 fresnoatlas 8/14 16:17 0+00:00:03 H 0 0.0 condor_exec.exe -d
103433.0 fresnoatlas 8/14 16:17 0+00:00:01 H 0 0.0 condor_exec.exe -d
103538.0 fresnoatlas 8/14 18:08 0+00:00:08 H 0 0.0 condor_exec.exe -d
103539.0 fresnoatlas 8/14 18:08 0+00:00:01 H 0 0.0 condor_exec.exe -d
103540.0 fresnoatlas 8/14 18:08 0+00:00:02 H 0 0.0 condor_exec.exe -d
103541.0 fresnoatlas 8/14 18:08 0+00:00:04 H 0 0.0 condor_exec.exe -d
103542.0 fresnoatlas 8/14 18:08 0+00:00:02 H 0 0.0 condor_exec.exe -d
103553.0 fresnoatlas 8/14 18:09 0+00:00:20 H 0 0.0 condor_exec.exe -d
103554.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103555.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103556.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103557.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103558.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103559.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103560.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103561.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103562.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103563.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103564.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103565.0 fresnoatlas 8/14 18:09 0+00:00:00 H 0 0.0 condor_exec.exe -d
103575.0 fresnoatlas 8/14 19:12 0+00:00:00 H 0 0.0 condor_exec.exe -d
103576.0 fresnoatlas 8/14 19:12 0+00:00:00 H 0 0.0 condor_exec.exe -d
103577.0 fresnoatlas 8/14 19:12 0+00:00:00 H 0 0.0 condor_exec.exe -d
103578.0 fresnoatlas 8/14 19:12 0+00:00:00 H 0 0.0 condor_exec.exe -d
103579.0 fresnoatlas 8/14 19:12 0+00:00:00 H 0 0.0 condor_exec.exe -d
103580.0 fresnoatlas 8/14 19:13 0+00:00:00 H 0 0.0 condor_exec.exe -d
103607.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103608.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103609.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103610.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103611.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103612.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103613.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103614.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103615.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103616.0 fresnoatlas 8/14 19:36 0+00:00:00 H 0 0.0 condor_exec.exe -d
103699.0 fresnoatlas 8/14 22:03 0+00:00:00 H 0 0.0 condor_exec.exe -d
103709.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103710.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103711.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103712.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103713.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103714.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103715.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103716.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103717.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103718.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103719.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103720.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103721.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103722.0 fresnoatlas 8/14 22:05 0+00:00:00 H 0 0.0 condor_exec.exe -d
103723.0 fresnoatlas 8/14 22:06 0+00:00:00 H 0 0.0 condor_exec.exe -d
103874.0 fresnoatlas 8/15 03:35 0+00:00:00 H 0 0.0 condor_exec.exe -d
103875.0 fresnoatlas 8/15 03:35 0+00:00:00 H 0 0.0 condor_exec.exe -d
103876.0 fresnoatlas 8/15 03:35 0+00:00:00 H 0 0.0 condor_exec.exe -d
103877.0 fresnoatlas 8/15 03:35 0+00:00:00 H 0 0.0 condor_exec.exe -d
104265.0 fresnoatlas 8/17 20:53 0+00:00:00 H 0 0.0 condor_exec.exe -d
104266.0 fresnoatlas 8/17 20:53 0+00:00:00 H 0 0.0 condor_exec.exe -d
104267.0 fresnoatlas 8/17 20:53 0+00:00:00 H 0 0.0 condor_exec.exe -d
104268.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104269.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104270.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104271.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104272.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104273.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104274.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104275.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104276.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104277.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104278.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104279.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104280.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104281.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104282.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104283.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104284.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
104285.0 fresnoatlas 8/17 20:54 0+00:00:00 H 0 0.0 condor_exec.exe -d
82 jobs; 0 idle, 0 running, 82 held
Let me know if I can further debug or if you have any suggestion to look for.
Thanks
Harinder
On Fri, Aug 15, 2014 at 2:06 PM, Rob Gardner <rwg AT hep.uchicago.edu> wrote:
Folks,
We haven’t had a meeting in a while. Lets synch up this Monday, which will be good to do in advance of the LBNL meeting.
Monday, 18 August 2014 from 11:30 to 12:30 (US/Central)
Agenda (for discussion):
- I will post some slides for the LBNL meeting that we can review.- The portableCVMFS solution- The replicated Stratum 1 solution- Status of unit tests (Jenkins)- Any updates to the tutorials or github coming?
Thanks
Rob
_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
--
_______________________________________________ Atlas-connect-l mailing list Atlas-connect-l AT lists.bnl.gov https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
--
David Lesny
Senior Research Physicist (Retired)
High Energy Physics
University of Illinois at Urbana-ChampaignOffice: 217-333-4972 | Fax: 217-333-4990
Skype: ddlesny | mwt2-ddlesny
--
-
[Atlas-connect-l] ATLAS Connect meeting Monday,
Rob Gardner, 08/15/2014
-
Re: [Atlas-connect-l] ATLAS Connect meeting Monday,
Dr. Harinder Singh Bawa, 08/18/2014
-
Re: [Atlas-connect-l] ATLAS Connect meeting Monday,
David Lesny, 08/18/2014
- Re: [Atlas-connect-l] ATLAS Connect meeting Monday, Dr. Harinder Singh Bawa, 08/18/2014
-
Re: [Atlas-connect-l] ATLAS Connect meeting Monday,
David Lesny, 08/18/2014
-
Re: [Atlas-connect-l] ATLAS Connect meeting Monday,
Dr. Harinder Singh Bawa, 08/18/2014
Archive powered by MHonArc 2.6.24.