Skip to Content.
Sympa Menu

atlas-connect-l - Re: [Atlas-connect-l] Problem at Fresno

atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Chronological Thread  
  • From: "Dr. Harinder Singh Bawa" <harinder.singh.bawa AT gmail.com>
  • To: Peter Onyisi <ponyisi AT utexas.edu>
  • Cc: Lincoln Bryant <lincolnb AT uchicago.edu>, atlas-connect-l <atlas-connect-l AT lists.bnl.gov>
  • Subject: Re: [Atlas-connect-l] Problem at Fresno
  • Date: Sat, 26 Apr 2014 11:52:01 -0700

Hi Jay,

My observation:
============
After seeing the message , I checked pt3wrk8 and there was enough space on pt3wrk8 for /tmp dir. (2.2G). But after a while when I was trying to submit my jobs there, I see /tmp changes from 75%  to 95% and then 98% (Fluctuating) a lot.


[bawa@pt3wrk8 ~]$ df -mh
Filesystem                    Size  Used Avail Use% Mounted on
/dev/sdb2                      50G  2.5G   45G   6% /
tmpfs                          24G     0   24G   0% /dev/shm
/dev/sdb1                     504M   72M  407M  16% /boot
/dev/mapper/vgsys-disk        616G  198M  609G   1% /disk
/dev/mapper/vgsys-opt          50G  180M   47G   1% /opt
/dev/mapper/vgsys-tmp         9.9G  7.2G  2.2G  77% /tmp
/dev/mapper/vgsys-var          50G  579M   47G   2% /var
/dev/mapper/vgsys-condor_lib  9.9G  151M  9.2G   2% /var/lib/condor
/dev/mapper/vgsys-condor_log   25G  209M   24G   1% /var/log/condor
/dev/mapper/vgsys-cvmfs        99G  1.8G   96G   2% /var/cache/cvmfs2
pt3head:/etc/condor-etc       9.7G  151M  9.1G   2% /nfs/t3head/condor-etc
pt3nfs:/NFSv3exports/home     243G   34G  197G  15% /nfs/t3nfs_common/home
pt3head:/xdata                3.4T  485G  2.7T  16% /nfs/t3head/xdata

after while:


[bawa@pt3wrk8 ~]$ df -mh
Filesystem                    Size  Used Avail Use% Mounted on
/dev/sdb2                      50G  2.5G   45G   6% /
tmpfs                          24G     0   24G   0% /dev/shm
/dev/sdb1                     504M   72M  407M  16% /boot
/dev/mapper/vgsys-disk        616G  198M  609G   1% /disk
/dev/mapper/vgsys-opt          50G  180M   47G   1% /opt
/dev/mapper/vgsys-tmp         9.9G  8.9G  484M  95% /tmp

pt3head:/xdata                3.4T  485G  2.7T  16% /nfs/t3head/xdata
cvmfs2                         79G  1.7G   77G   3% /cvmfs/atlas.cern.ch
cvmfs2                         79G  1.7G   77G   3% /cvmfs/atlas-condb.cern.ch
cvmfs2                         79G  1.7G   77G   3% /cvmfs/sft.cern.ch
cvmfs2                         79G  1.7G   77G   3% /cvmfs/atlas-nightlies.cern.ch



I see two files taking a lot of disk : ls /tmp/

-rw-------.  1 cvmfs       cvmfs        460586494 Apr 26 11:38 cvmfs.log.cachemgr
-rw-------.  1 cvmfs       cvmfs       7096647168 Apr 26 11:38 cvmfs.log

I dont see this is any other node.

Harinder





On Sat, Apr 26, 2014 at 12:08 AM, Peter Onyisi <ponyisi AT utexas.edu> wrote:
Hi all:

Just as a datapoint, I'm getting a lot of jobs held due to disk space issues now, e.g.:

000 (18145.092.000) 04/26 02:02:51 Job submitted from host: <192.170.227.199:39142>
...
007 (18145.092.000) 04/26 02:06:03 Shadow exception!
        Error from 4575 AT pt3wrk8.atlas.csufresno.edu: STARTER at 192.168.100.108 failed to write to file /tmp/rcc_fresnostate/rcc.ecoWwIfWOI/execute.192.168.100.108-4575/dir_6945/999999_sherpa_4l_j.tgz: (errno 28) No space left on device
        0  -  Run Bytes Sent By Job
        121259304  -  Run Bytes Received By Job
...
012 (18145.092.000) 04/26 02:06:03 Job was held.
        Error from 4575 AT pt3wrk8.atlas.csufresno.edu: STARTER at 192.168.100.108 failed to write to file /tmp/rcc_fresnostate/rcc.ecoWwIfWOI/execute.192.168.100.108-4575/dir_6945/999999_sherpa_4l_j.tgz: (errno 28) No space left on device
        Code 12 Subcode 28



On Thu, Apr 10, 2014 at 3:03 PM, Jay Fowler <fowler AT csufresno.edu> wrote:

I'll continue to review our environment, but not sure space was the issue - it could still be something else. The filesystem used by condor to store data ( IWD, output and error files ) has about 200Gb free.

If I am tracing this out correctly, one job ending around 13:00 central (we are 2 hours behind)on 04/07 shows it being accepted and passed off to a worker node. The worker node processes it for a bit and exits with a status of zero.  Other job submissions around this time are exiting with the same status. I am making the assumption that the work was completed successfully. Feel free to comment further.

On Apr 7 our condor scheduler processed 217 jobs, all exited with a status of 0, like:
  ShadowLog:04/07/14 11:15:27 (45662.0) (26949): Job 45662.0 terminated: exited with status 0


Jay


Condor Head node - /var/log/condor/ShadowLog:
==========
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 ** condor_shadow (CONDOR_SHADOW) STARTING UP
04/07/14 09:19:54 ** /usr/sbin/condor_shadow
04/07/14 09:19:54 ** SubsystemInfo: name=SHADOW type=SHADOW(6) class=DAEMON(1)
04/07/14 09:19:54 ** Configuration: subsystem:SHADOW local:<NONE> class:DAEMON
04/07/14 09:19:54 ** $CondorVersion: 7.6.6 Jan 17 2012 BuildID: 401976 $
04/07/14 09:19:54 ** $CondorPlatform: x86_64_rhap_5 $
04/07/14 09:19:54 ** PID = 16737
04/07/14 09:19:54 ** Log last touched 4/7 09:19:54
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 Using config source: /etc/condor/condor_config
04/07/14 09:19:54 Using local config sources:
04/07/14 09:19:54    /etc/condor-etc/condor_config.cluster
04/07/14 09:19:54    /etc/condor-etc/condor_config.head.local
04/07/14 09:19:54 DaemonCore: command socket at <.....headnodeIP......:9849?PrivAddr=%3c192.168.100.1:9849%3e&PrivNet=.......our_domain.......&noUDP>
04/07/14 09:19:54 DaemonCore: private command socket at <192.168.100.1:9849>
04/07/14 09:19:54 Setting maximum accepts per cycle 4.
04/07/14 09:19:54 Setting maximum accepts per cycle 4.

04/07/14 09:19:54 (45568.0) (16737): Request to run on slot6 AT 192.168.100.121 <192.168.100.121:9082?PrivNet=.......our_domain.......> was ACCEPTED
...
04/07/14 10:50:04 (45568.0) (16737): Job 45568.0 terminated: exited with status 0
04/07/14 10:50:04 (45568.0) (16737): **** condor_shadow (condor_SHADOW) pid 16737 EXITING WITH STATUS 100
==========


Condor Worker Node - /var/log/condor/StartLog
==========
04/07/14 09:19:54 slot6: Got activate_claim request from shadow (192.168.100.1)
04/07/14 09:19:54 slot6: Remote job ID is 45568.0
04/07/14 09:19:54 slot6: Got universe "VANILLA" (5) from request classad
04/07/14 09:19:54 slot6: State change: claim-activation protocol successful
04/07/14 09:19:54 slot6: Changing activity: Idle -> Busy
04/07/14 09:19:54 slot6: match_info called
04/07/14 10:50:04 slot6: Called deactivate_claim_forcibly()
04/07/14 10:50:04 Starter pid 24344 exited with status 0
04/07/14 10:50:04 slot6: State change: starter exited
04/07/14 10:50:04 slot6: Changing activity: Busy -> Idle
04/07/14 10:50:04 slot6: State change: received RELEASE_CLAIM command
04/07/14 10:50:04 slot6: Changing state and activity: Claimed/Idle -> Preempting/Vacating
04/07/14 10:50:04 slot6: State change: No preempting claim, returning to owner
04/07/14 10:50:04 slot6: Changing state and activity: Preempting/Vacating -> Owner/Idle
04/07/14 10:50:04 slot6: State change: IS_OWNER is false
04/07/14 10:50:04 slot6: Changing state: Owner -> Unclaimed
==========


Condor Worker Node - /var/log/condor/StarterLog.slot6
==========
04/07/14 09:19:54 Setting maximum accepts per cycle 8.
04/07/14 09:19:54 ******************************************************
04/07/14 09:19:54 ** condor_starter (CONDOR_STARTER) STARTING UP
04/07/14 09:19:54 ** /usr/sbin/condor_starter
04/07/14 09:19:54 ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
04/07/14 09:19:54 ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
04/07/14 09:19:54 ** $CondorVersion: 7.8.8 Nov 22 2013 $
04/07/14 09:19:54 ** $CondorPlatform: X86_64-CentOS_6.4 $
04/07/14 09:19:54 ** PID = 24344
04/07/14 09:19:54 ** Log last touched 4/7 00:59:57
04/07/14 09:19:54 ******************************************************
...
...
04/07/14 09:19:54 Done setting resource limits
04/07/14 09:19:54 Job 45568.0 set to execute immediately
04/07/14 09:19:54 Starting a VANILLA universe job with ID: 45568.0
04/07/14 09:19:54 IWD: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/....uchicago.edu#2756.0#1396887566
04/07/14 09:19:54 Output file: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/......uchicago.edu#2756.0#1396887566/_condor_stdout
04/07/14 09:19:54 Error file: /nfs/t3nfs_common/home/....../sandbox/8336/833633df/.....uchicago.edu#2756.0#1396887566/_condor_stderr
04/07/14 09:19:54 About to exec /home/....../sandbox/8336/833633df/.....uchicago.edu#2756.0#1396887566/condor_exec.exe -dyn -f
04/07/14 09:19:54 Running job as user xxxxxx
04/07/14 09:19:54 Create_Process succeeded, pid=24355
04/07/14 10:50:04 Process exited, pid=24355, status=0
04/07/14 10:50:04 Got SIGQUIT.  Performing fast shutdown.
04/07/14 10:50:04 ShutdownFast all jobs.
04/07/14 10:50:04 **** condor_starter (condor_STARTER) pid 24344 EXITING WITH STATUS 0
==========






From: "Lincoln Bryant" <lincolnb AT uchicago.edu>
To: "Jay Fowler" <fowler AT csufresno.edu>
Cc: "Rob Gardner" <rwg AT hep.uchicago.edu>, "atlas-connect-l" <atlas-connect-l AT lists.bnl.gov>
Sent: Thursday, April 10, 2014 10:21:30 AM
Subject: Re: [Atlas-connect-l] Problem at Fresno


Hi Jay,

It looks like the aforementioned user's jobs may have gone through around 1 PM Central on April 7th.

Cheers,
Lincoln

On Apr 10, 2014, at 12:15 PM, Jay Fowler wrote:


It looks like the nodes have plenty of space today, but I cannot rule anything out just yet. Understanding when the job was submitted would help us identify the issue. Any chance we could get a time frame for when the job was submitted? 

Thanks,

Jay


From: "Rob Gardner" <rwg AT hep.uchicago.edu>
To: "Dr. Harinder Singh Bawa" <harinder.singh.bawa AT gmail.com>
Cc: "atlas-connect-l" <atlas-connect-l AT lists.bnl.gov>
Sent: Wednesday, April 9, 2014 4:57:12 PM
Subject: [Atlas-connect-l] Problem at Fresno

Harinder, 

We got a report from a user:

Another problem I am seeing is that some of the Fresno sites cause my jobs to fail because of low disk space. Could you let the admins there know about this? I expect the size of each job to be ~500 Mb, which includes the code and the output (generated MadGraph events). So I don't think I'm filling it up.

Could you take a look and report back?

Thanks

---
Rob Gardner • Twitter: @rwg • Skype: rwg773 • g+: rob.rwg • +1 312-804-0859 • University of Chicago


_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l



_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l




--
Peter Onyisi   |
CERN: 4-R-028  | Department of Physics
UT: RLM 10.211 | University of Texas at Austin


_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l




--
Dr. Harinder Singh Bawa

                                          
[web][facebook][youtube][twitter]
California State University, Fresno Logo




Archive powered by MHonArc 2.6.24.

Top of Page