Skip to Content.
Sympa Menu

atlas-connect-l - Re: [Atlas-connect-l] Failed jobs due to xrootd issues?

atlas-connect-l AT lists.bnl.gov

Subject: Atlas-connect-l mailing list

List archive

Chronological Thread  
  • From: Ilija Vukotic <ivukotic AT uchicago.edu>
  • To: Christopher Meyer <chris.meyer AT cern.ch>
  • Cc: "atlas-connect-l AT lists.bnl.gov" <atlas-connect-l AT lists.bnl.gov>
  • Subject: Re: [Atlas-connect-l] Failed jobs due to xrootd issues?
  • Date: Fri, 12 Feb 2016 18:10:16 +0000

Hi,

There is some networking issue at MWT2, so the issue will affect xrdcp too. We are trying to understand it, it goes slowly as Lincoln is on vacation…
If you use dagman it should resend the job. We are still not sure if there is a pattern in which nodes can not access the data, is it all the time or the problem “moves”.

Ilija




On Feb 12, 2016, at 12:00 , Christopher Meyer <chris.meyer AT cern.ch> wrote:

Hello again,

To work around this issue, is there a better way for me to submit jobs? For example, xrdcp the file locally, and then run? Or is it possible to send a return code to condor requesting that the job be retried on a different node?

Thanks,
Chris


On Thu, Feb 11, 2016, 10:27 Christopher Meyer <chris.meyer AT cern.ch> wrote:
Hi ilija,

Thanks for taking a look. I submitted another set of jobs last night, and this morning it looks like ~15% failed with the same issues as above.

Cheers,
Chris

On Wed, Feb 10, 2016 at 9:50 PM Ilija Vukotic <ivukotic AT uchicago.edu> wrote:
Hi Chris,

Now that you excluded a possibility that the files are actually corrupted, it must be a networking issue. We will test again from these hosts just to make sure this was a transient issue.

Thanks for reporting,
Ilija

On Feb 10, 2016, at 14:06 , Christopher Meyer <chris.meyer AT cern.ch> wrote:

Dear Experts,

A few days ago I submitted a large number of jobs using condor through ATLAS connect. However, a number of them failed with errors like those in the log file I've attached. I've run on some of these files directly from login.usatlas.org (using the root:// path) and everything works fine.

There was only one of the first type, which failed on:

The second type failed on these nodes:

Does anyone have an idea what might be going wrong? Or if there's something I can do to protect against this?

Thanks!
Chris
<crash_type1.txt><crash_type2.txt>_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l

_______________________________________________
Atlas-connect-l mailing list
Atlas-connect-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/atlas-connect-l
_______________________________________________
ATLAS Midwest Tier2 mailing list
http://mwt2.usatlasfacility.org




Archive powered by MHonArc 2.6.24.

Top of Page