atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Re: [Atlas-connect-l] Problem at Fresno

From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa AT gmail.com>
To: Peter Onyisi <ponyisi AT utexas.edu>
Cc: Lincoln Bryant <lincolnb AT uchicago.edu>, atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
Subject: Re: [Atlas-connect-l] Problem at Fresno
Date: Sat, 26 Apr 2014 11:52:01 -0700

Hi Jay,

My observation:
============

After seeing the message , I checked pt3wrk8 and there was enough space on pt3wrk8 for /tmp dir. (2.2G). But after a while when I was trying to submit my jobs there, I see /tmp changes from 75% to 95% and then 98% (Fluctuating) a lot.

[bawa@pt3wrk8 ~]$ df -mh

Filesystem Size Used Avail Use% Mounted on

/dev/sdb2 50G 2.5G 45G 6% /

tmpfs 24G 0 24G 0% /dev/shm

/dev/sdb1 504M 72M 407M 16% /boot

/dev/mapper/vgsys-disk 616G 198M 609G 1% /disk

/dev/mapper/vgsys-opt 50G 180M 47G 1% /opt

/dev/mapper/vgsys-tmp 9.9G 7.2G 2.2G 77% /tmp

/dev/mapper/vgsys-var 50G 579M 47G 2% /var

/dev/mapper/vgsys-condor_lib 9.9G 151M 9.2G 2% /var/lib/condor

/dev/mapper/vgsys-condor_log 25G 209M 24G 1% /var/log/condor

/dev/mapper/vgsys-cvmfs 99G 1.8G 96G 2% /var/cache/cvmfs2

pt3head:/etc/condor-etc 9.7G 151M 9.1G 2% /nfs/t3head/condor-etc

pt3nfs:/NFSv3exports/home 243G 34G 197G 15% /nfs/t3nfs_common/home

pt3head:/xdata 3.4T 485G 2.7T 16% /nfs/t3head/xdata

after while:

[bawa@pt3wrk8 ~]$ df -mh

Filesystem Size Used Avail Use% Mounted on

/dev/sdb2 50G 2.5G 45G 6% /

tmpfs 24G 0 24G 0% /dev/shm

/dev/sdb1 504M 72M 407M 16% /boot

/dev/mapper/vgsys-disk 616G 198M 609G 1% /disk

/dev/mapper/vgsys-opt 50G 180M 47G 1% /opt

/dev/mapper/vgsys-tmp 9.9G 8.9G 484M 95% /tmp

pt3head:/xdata 3.4T 485G 2.7T 16% /nfs/t3head/xdata

cvmfs2 79G 1.7G 77G 3% /cvmfs/atlas.cern.ch

cvmfs2 79G 1.7G 77G 3% /cvmfs/atlas-condb.cern.ch

cvmfs2 79G 1.7G 77G 3% /cvmfs/sft.cern.ch

cvmfs2 79G 1.7G 77G 3% /cvmfs/atlas-nightlies.cern.ch

I see two files taking a lot of disk : ls /tmp/

-rw-------. 1 cvmfs cvmfs 460586494 Apr 26 11:38 cvmfs.log.cachemgr

-rw-------. 1 cvmfs cvmfs 7096647168 Apr 26 11:38 cvmfs.log

I dont see this is any other node.

Harinder

On Sat, Apr 26, 2014 at 12:08 AM, Peter Onyisi <ponyisi AT utexas.edu> wrote:

Hi all:

Just as a datapoint, I'm getting a lot of jobs held due to disk space issues now, e.g.:

000 (18145.092.000) 04/26 02:02:51 Job submitted from host: <192.170.227.199:39142>

...
007 (18145.092.000) 04/26 02:06:03 Shadow exception!
Error from 4575 AT pt3wrk8.atlas.csufresno.edu: STARTER at 192.168.100.108 failed to write to file /tmp/rcc_fresnostate/rcc.ecoWwIfWOI/execute.192.168.100.108-4575/dir_6945/999999_sherpa_4l_j.tgz: (errno 28) No space left on device

0 - Run Bytes Sent By Job
121259304 - Run Bytes Received By Job
...
012 (18145.092.000) 04/26 02:06:03 Job was held.

Error from 4575 AT pt3wrk8.atlas.csufresno.edu: STARTER at 192.168.100.108 failed to write to file /tmp/rcc_fresnostate/rcc.ecoWwIfWOI/execute.192.168.100.108-4575/dir_6945/999999_sherpa_4l_j.tgz: (errno 28) No space left on device

Code 12 Subcode 28

On Thu, Apr 10, 2014 at 3:03 PM, Jay Fowler <fowler AT csufresno.edu> wrote:

I'll continue to review our environment, but not sure space was the issue - it could still be something else. The filesystem used by condor to store data ( IWD, output and error files ) has about 200Gb free.

If I am tracing this out correctly, one job ending around 13:00 central (we are 2 hours behind)on 04/07 shows it being accepted and passed off to a worker node. The worker node processes it for a bit and exits with a status of zero. Other job submissions around this time are exiting with the same status. I am making the assumption that the work was completed successfully. Feel free to comment further.

On Apr 7 our condor scheduler processed 217 jobs, all exited with a status of 0, like:
ShadowLog:04/07/14 11:15:27 (45662.0) (26949): Job 45662.0 terminated: exited with status 0

Jay

Condor Head node - /var/log/condor/ShadowLog:
==========
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
04/07/14 09:19:54 ** /usr/sbin/condor_shadow
04/07/14 09:19:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
04/07/14 09:19:54 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
04/07/14 09:19:54 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/07/14 09:19:54 ** $CondorPlatform: x86_64_rhap_5 $
04/07/14 09:19:54 ** PID = 16737
04/07/14 09:19:54 ** Log last touched 4/7 09:19:54
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 Using config source: /etc/condor/condor_config
04/07/14 09:19:54 Using local config sources:
04/07/14 09:19:54    /etc/condor-etc/condor_config.cluster
04/07/14 09:19:54    /etc/condor-etc/condor_config.head.local
04/07/14 09:19:54 DaemonCore: command socket at <.....headnodeIP......:9849?PrivAddr=%3c192.168.100.1:9849%3e&PrivNet=.......our_domain.......&noUDP>
04/07/14 09:19:54 DaemonCore: private command socket at <192.168.100.1:9849>
04/07/14 09:19:54 Setting maximum accepts per cycle 4.
04/07/14 09:19:54 Setting maximum accepts per cycle 4.

04/07/14 09:19:54 (45568.0) (16737): Request to run on slot6 AT 192.168.100.121 <192.168.100.121:9082?PrivNet=.......our_domain.......> was ACCEPTED
...
04/07/14 10:50:04 (45568.0) (16737): Job 45568.0 terminated: exited with status 0
04/07/14 10:50:04 (45568.0) (16737): **** condor_shadow (condor_SHADOW) pid 16737 EXITING WITH STATUS 100
==========

Condor Worker Node - /var/log/condor/StartLog
==========
04/07/14 09:19:54 slot6: Got activate_claim request from shadow (192.168.100.1)
04/07/14 09:19:54 slot6: Remote job ID is 45568.0
04/07/14 09:19:54 slot6: Got universe "VANILLA" (5) from request classad
04/07/14 09:19:54 slot6: State change: claim-activation protocol successful
04/07/14 09:19:54 slot6: Changing activity: Idle -> Busy
04/07/14 09:19:54 slot6: match_info called
04/07/14 10:50:04 slot6: Called deactivate_claim_forcibly()
04/07/14 10:50:04 Starter pid 24344 exited with status 0
04/07/14 10:50:04 slot6: State change: starter exited
04/07/14 10:50:04 slot6: Changing activity: Busy -> Idle
04/07/14 10:50:04 slot6: State change: received RELEASE_CLAIM command
04/07/14 10:50:04 slot6: Changing state and activity: Claimed/Idle -> Preempting/Vacating
04/07/14 10:50:04 slot6: State change: No preempting claim, returning to owner
04/07/14 10:50:04 slot6: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/07/14 10:50:04 slot6: State change: IS_OWNER is false
04/07/14 10:50:04 slot6: Changing state: Owner -> Unclaimed
==========

Condor Worker Node - /var/log/condor/StarterLog.slot6
==========
04/07/14 09:19:54 Setting maximum accepts per cycle 8.
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 ** condor_starter (CONDOR_STARTER) STARTING UP
04/07/14 09:19:54 ** /usr/sbin/condor_starter
04/07/14 09:19:54 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/07/14 09:19:54 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/07/14 09:19:54 ** $CondorVersion: 7.8.8 Nov 22 2013 $
04/07/14 09:19:54 ** $CondorPlatform: X86_64-CentOS_6.4 $
04/07/14 09:19:54 ** PID = 24344
04/07/14 09:19:54 ** Log last touched 4/7 00:59:57
04/07/14 09:19:54 ******************************************************
...
...
04/07/14 09:19:54 Done setting resource limits
04/07/14 09:19:54 Job 45568.0 set to execute immediately
04/07/14 09:19:54 Starting a VANILLA universe job with ID: 45568.0
04/07/14 09:19:54 IWD: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/....uchicago.edu#2756.0#1396887566
04/07/14 09:19:54 Output file: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/......uchicago.edu#2756.0#1396887566/_condor_stdout
04/07/14 09:19:54 Error file: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/.....uchicago.edu#2756.0#1396887566/_condor_stderr
04/07/14 09:19:54 About to exec /home/....../sandbox/8336/833633df/.....uchicago.edu#2756.0#1396887566/condor_exec.exe -dyn -f
04/07/14 09:19:54 Running job as user xxxxxx
04/07/14 09:19:54 Create_Process succeeded, pid=24355
04/07/14 10:50:04 Process exited, pid=24355, status=0
04/07/14 10:50:04 Got SIGQUIT. Performing fast shutdown.
04/07/14 10:50:04 ShutdownFast all jobs.
04/07/14 10:50:04 **** condor_starter (condor_STARTER) pid 24344 EXITING WITH STATUS 0
==========

From: "Lincoln Bryant" <lincolnb AT uchicago.edu>
To: "Jay Fowler" <fowler AT csufresno.edu>
Cc: "Rob Gardner" <rwg AT hep.uchicago.edu>, "atlas-connect-l" <atlas-connect-l AT lists.bnl.gov>
Sent: Thursday, April 10, 2014 10:21:30 AM
Subject: Re: [Atlas-connect-l] Problem at Fresno

Hi Jay,

It looks like the aforementioned user's jobs may have gone through around 1 PM Central on April 7th.

Cheers,
Lincoln

On Apr 10, 2014, at 12:15 PM, Jay Fowler wrote:

It looks like the nodes have plenty of space today, but I cannot rule anything out just yet. Understanding when the job was submitted would help us identify the issue. Any chance we could get a time frame for when the job was submitted?

Thanks,

Jay

From: "Rob Gardner" <rwg AT hep.uchicago.edu>
To: "Dr. Harinder Singh Bawa" <harinder.singh.bawa AT gmail.com>
Cc: "atlas-connect-l" <atlas-connect-l AT lists.bnl.gov>
Sent: Wednesday, April 9, 2014 4:57:12 PM
Subject: [Atlas-connect-l] Problem at Fresno

Harinder,

We got a report from a user:

Another problem I am seeing is that some of the Fresno sites cause my jobs to fail because of low disk space. Could you let the admins there know about this? I expect the size of each job to be ~500 Mb, which includes the code and the output (generated MadGraph events). So I don't think I'm filling it up.

Could you take a look and report back?

Thanks

---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

--
Peter Onyisi |
CERN: 4-R-028  | Department of Physics

UT: RLM 10.211 | University of Texas at Austin

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

Dr. Harinder Singh Bawa

[web][facebook][youtube][twitter]

[Atlas-connect-l] Problem at Fresno, Rob Gardner, 04/09/2014
- Re: [Atlas-connect-l] Problem at Fresno, Jay Fowler, 04/10/2014
  - Re: [Atlas-connect-l] Problem at Fresno, Lincoln Bryant, 04/10/2014
    - Re: [Atlas-connect-l] Problem at Fresno, Jay Fowler, 04/10/2014
      - Re: [Atlas-connect-l] Problem at Fresno, Peter Onyisi, 04/26/2014
        
        Re: [Atlas-connect-l] Problem at Fresno, Dr. Harinder Singh Bawa, 04/26/2014
        
        Re: [Atlas-connect-l] Problem at Fresno, Dr. Harinder Singh Bawa, 04/26/2014