atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Re: [Atlas-connect-l] Problem at Fresno

From: Jay Fowler <fowler AT csufresno.edu>
To: Lincoln Bryant <lincolnb AT uchicago.edu>
Cc: atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
Subject: Re: [Atlas-connect-l] Problem at Fresno
Date: Thu, 10 Apr 2014 16:03:09 -0400 (EDT)

I'll continue to review our environment, but not sure space was the issue - it could still be something else. The filesystem used by condor to store data ( IWD, output and error files ) has about 200Gb free.

If I am tracing this out correctly, one job ending around 13:00 central (we are 2 hours behind)on 04/07 shows it being accepted and passed off to a worker node. The worker node processes it for a bit and exits with a status of zero. Other job submissions around this time are exiting with the same status. I am making the assumption that the work was completed successfully. Feel free to comment further.

On Apr 7 our condor scheduler processed 217 jobs, all exited with a status of 0, like:
ShadowLog:04/07/14 11:15:27 (45662.0) (26949): Job 45662.0 terminated: exited with status 0

Jay

Condor Head node - /var/log/condor/ShadowLog:
==========
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
04/07/14 09:19:54 ** /usr/sbin/condor_shadow
04/07/14 09:19:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
04/07/14 09:19:54 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
04/07/14 09:19:54 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/07/14 09:19:54 ** $CondorPlatform: x86_64_rhap_5 $
04/07/14 09:19:54 ** PID = 16737
04/07/14 09:19:54 ** Log last touched 4/7 09:19:54
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 Using config source: /etc/condor/condor_config
04/07/14 09:19:54 Using local config sources:
04/07/14 09:19:54 /etc/condor-etc/condor_config.cluster
04/07/14 09:19:54 /etc/condor-etc/condor_config.head.local
04/07/14 09:19:54 DaemonCore: command socket at <.....headnodeIP......:9849?PrivAddr=%3c192.168.100.1:9849%3e&PrivNet=.......our_domain.......&noUDP>
04/07/14 09:19:54 DaemonCore: private command socket at <192.168.100.1:9849>
04/07/14 09:19:54 Setting maximum accepts per cycle 4.
04/07/14 09:19:54 Setting maximum accepts per cycle 4.

04/07/14 09:19:54 (45568.0) (16737): Request to run on slot6 AT 192.168.100.121 <192.168.100.121:9082?PrivNet=.......our_domain.......> was ACCEPTED
...
04/07/14 10:50:04 (45568.0) (16737): Job 45568.0 terminated: exited with status 0
04/07/14 10:50:04 (45568.0) (16737): **** condor_shadow (condor_SHADOW) pid 16737 EXITING WITH STATUS 100
==========

Condor Worker Node - /var/log/condor/StartLog
==========
04/07/14 09:19:54 slot6: Got activate_claim request from shadow (192.168.100.1)
04/07/14 09:19:54 slot6: Remote job ID is 45568.0
04/07/14 09:19:54 slot6: Got universe "VANILLA" (5) from request classad
04/07/14 09:19:54 slot6: State change: claim-activation protocol successful
04/07/14 09:19:54 slot6: Changing activity: Idle -> Busy
04/07/14 09:19:54 slot6: match_info called
04/07/14 10:50:04 slot6: Called deactivate_claim_forcibly()
04/07/14 10:50:04 Starter pid 24344 exited with status 0
04/07/14 10:50:04 slot6: State change: starter exited
04/07/14 10:50:04 slot6: Changing activity: Busy -> Idle
04/07/14 10:50:04 slot6: State change: received RELEASE_CLAIM command
04/07/14 10:50:04 slot6: Changing state and activity: Claimed/Idle -> Preempting/Vacating
04/07/14 10:50:04 slot6: State change: No preempting claim, returning to owner
04/07/14 10:50:04 slot6: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/07/14 10:50:04 slot6: State change: IS_OWNER is false
04/07/14 10:50:04 slot6: Changing state: Owner -> Unclaimed
==========

Condor Worker Node - /var/log/condor/StarterLog.slot6
==========
04/07/14 09:19:54 Setting maximum accepts per cycle 8.
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 ** condor_starter (CONDOR_STARTER) STARTING UP
04/07/14 09:19:54 ** /usr/sbin/condor_starter
04/07/14 09:19:54 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/07/14 09:19:54 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/07/14 09:19:54 ** $CondorVersion: 7.8.8 Nov 22 2013 $
04/07/14 09:19:54 ** $CondorPlatform: X86_64-CentOS_6.4 $
04/07/14 09:19:54 ** PID = 24344
04/07/14 09:19:54 ** Log last touched 4/7 00:59:57
04/07/14 09:19:54 ******************************************************
...
...
04/07/14 09:19:54 Done setting resource limits
04/07/14 09:19:54 Job 45568.0 set to execute immediately
04/07/14 09:19:54 Starting a VANILLA universe job with ID: 45568.0
04/07/14 09:19:54 IWD: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/....uchicago.edu#2756.0#1396887566
04/07/14 09:19:54 Output file: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/......uchicago.edu#2756.0#1396887566/_condor_stdout
04/07/14 09:19:54 Error file: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/.....uchicago.edu#2756.0#1396887566/_condor_stderr
04/07/14 09:19:54 About to exec /home/....../sandbox/8336/833633df/.....uchicago.edu#2756.0#1396887566/condor_exec.exe -dyn -f
04/07/14 09:19:54 Running job as user xxxxxx
04/07/14 09:19:54 Create_Process succeeded, pid=24355
04/07/14 10:50:04 Process exited, pid=24355, status=0
04/07/14 10:50:04 Got SIGQUIT. Performing fast shutdown.
04/07/14 10:50:04 ShutdownFast all jobs.
04/07/14 10:50:04 **** condor_starter (condor_STARTER) pid 24344 EXITING WITH STATUS 0
==========

From: "Lincoln Bryant" <lincolnb AT uchicago.edu>
To: "Jay Fowler" <fowler AT csufresno.edu>
Cc: "Rob Gardner" <rwg AT hep.uchicago.edu>, "atlas-connect-l" <atlas-connect-l AT lists.bnl.gov>
Sent: Thursday, April 10, 2014 10:21:30 AM
Subject: Re: [Atlas-connect-l] Problem at Fresno

Hi Jay,

It looks like the aforementioned user's jobs may have gone through around 1 PM Central on April 7th.

Cheers,
Lincoln

On Apr 10, 2014, at 12:15 PM, Jay Fowler wrote:

It looks like the nodes have plenty of space today, but I cannot rule anything out just yet. Understanding when the job was submitted would help us identify the issue. Any chance we could get a time frame for when the job was submitted?

Thanks,

Jay

From: "Rob Gardner" <rwg AT hep.uchicago.edu>
To: "Dr. Harinder Singh Bawa" <harinder.singh.bawa AT gmail.com>
Cc: "atlas-connect-l" <atlas-connect-l AT lists.bnl.gov>
Sent: Wednesday, April 9, 2014 4:57:12 PM
Subject: [Atlas-connect-l] Problem at Fresno

Harinder,

We got a report from a user:

Another problem I am seeing is that some of the Fresno sites cause my jobs to fail because of low disk space. Could you let the admins there know about this? I expect the size of each job to be ~500 Mb, which includes the code and the output (generated MadGraph events). So I don't think I'm filling it up.

Could you take a look and report back?

Thanks

---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

[Atlas-connect-l] Problem at Fresno, Rob Gardner, 04/09/2014
- Re: [Atlas-connect-l] Problem at Fresno, Jay Fowler, 04/10/2014
  - Re: [Atlas-connect-l] Problem at Fresno, Lincoln Bryant, 04/10/2014
    - Re: [Atlas-connect-l] Problem at Fresno, Jay Fowler, 04/10/2014
      - Re: [Atlas-connect-l] Problem at Fresno, Peter Onyisi, 04/26/2014
        
        Re: [Atlas-connect-l] Problem at Fresno, Dr. Harinder Singh Bawa, 04/26/2014
        
        Re: [Atlas-connect-l] Problem at Fresno, Dr. Harinder Singh Bawa, 04/26/2014