phys-npps-mgmt-l AT lists.bnl.gov

Subject: NPPS Leadership Team

List archive

Re: [Phys-npps-mgmt-l] Rucio for ePIC

From: pinkenburg <pinkenburg AT bnl.gov>
To: phys-npps-mgmt-l AT lists.bnl.gov
Subject: Re: [Phys-npps-mgmt-l] Rucio for ePIC
Date: Fri, 28 Apr 2023 17:50:13 -0400

actually most job failures I see are from STAR nodes. The reason is always the same - xrootd completely hogs that node, the internal disk is completely unresponsive (a simple md5 of a 4GB files takes hours). It also has the tendency to block the network so jobs die because they cannot communicate. I think Chris runs some cgroup magic which made this less of a problem. dCache doesn't do that and I have the same gripe with the central storage. But with xrootd I am with sdcc since it really impacts the compute nodes (and in a shared pool - everybody gets hit)

Chris

On 4/28/2023 5:44 PM, Van Buren, Gene via Phys-npps-mgmt-l wrote:

Right....whoever is pushing for the large servers is pushing to negate this
benefit of xrootd. I've left the fight to Jerome, which seems to make it an
internal SDCC conflict.

-Gene

On Apr 28, 2023, at 5:42 PM, Torre Wenaus <wenaus AT gmail.com> wrote:

Thanks Gene. I suppose this applies to any data access layer sitting over
large servers, but with xrootd you have other options.
Torre

On Fri, Apr 28, 2023 at 5:39 PM Van Buren, Gene <gene AT bnl.gov> wrote:
Hi, all

I'm unfamiliar with a lot of these arguments, but STAR makes a lot of use of
xrootd and I figured I'd share one of the xrootd practices of SDCC that STAR
isn't thrilled about (and Jerome is with me on this)....

STAR has had a mix of distributed computing nodes in xrootd plus a few larger
dedicated xrootd servers. The larger servers are probably easier to maintain,
but carry two main risks for us:
1) When a larger server node is having issues (e.g. network load or other
slowness), many users and their jobs suffer instead of a few.
2) When a larger server loses data (fails, node goes down), a lot of data is
lost instead of a little.

The SDCC is unfortunately pushing us towards the larger servers moving
forward.

-Gene

On Apr 28, 2023, at 12:20 PM, Torre Wenaus via Phys-npps-mgmt-l
<phys-npps-mgmt-l AT lists.bnl.gov> wrote:

ah, I forgot that one on the list I'm assembling, thanks Brett

On Fri, Apr 28, 2023 at 12:19 PM Brett Viren <bv AT bnl.gov> wrote:
Thanks Kolja,

I wonder if the S3 solution also provides the "read-ahead" feature that
makes XrootD so effective at latency hiding.

-Brett.

Kolja Kauder <kkauder AT gmail.com> writes:

I _think_ Jerome essentially was just intrigued by the new and shiny S3 at
the beginning. Work on adding an
XrootD layer did happen but stalled for reasons I'm not privy to and was put "on
pause" in July '22.
Importantly, the following is an inference, not a thing I've actually heard,
but I strongly suspect that SDCC
will point out that S3 has been working well for ECCE, ATHENA, and now ePIC,
so why divert resources from that?

On Fri, Apr 28, 2023 at 8:44 AM Torre Wenaus via Phys-npps-mgmt-l
<phys-npps-mgmt-l AT lists.bnl.gov> wrote:

Thanks a lot Brett. Kolja is well versed in the (Jerome) arguments and
can comment.
Torre
On Fri, Apr 28, 2023 at 8:42 AM Brett Viren <bv AT bnl.gov> wrote:
Hi Torre,
Mostly on XRootD:
DUNE @ SDCC is small fry in comparison to EIC @ SDCC, both in resource
needs and in that BNL is not the host, but XRootD and Rucio are
needed
in DUNE as well. Maybe this "synergy" is somehow useful in your
arguments.
I don't know the backstory but it seems really weird that XRootD is a
sticking point. At least I fail to see any effort/resource argument
against it. Here's why:
Back when LBNE (pre DUNE) and Daya Bay had RACF nodes, Ofer gave us
great service to set up and operate XRootD servers and clients.
These
groups were in the "free tier" of RACF service with only about a
dozen
nodes each. I think the effort to scale from that to what EIC needs
would be rather less than linear in the number of nodes. We also had
the extra complication that the XRootD storage nodes doubled as
batch or
interactive nodes. I expect keeping these roles separate would make
for
an even easier provision. And adding that EIC is not in the "free
tier"
it seems historically inconsistent for the XRootD request to be
denied.

It may be useful to know the arguments from SDCC to refuse the
request.
Maybe they can be included in what you are assembling.

-Brett.
Torre Wenaus via Phys-npps-mgmt-l <phys-npps-mgmt-l AT lists.bnl.gov>
writes:
> Hi all,
> In addition to getting past the BNL embarrassment of not offering
XRootD to EIC despite years of
requests, I'd
> like to explore ways NPPS could help ePIC with a further and much
more impactful change in their data
> management, adopting Rucio. JLab is working on this now. At BNL it
sits (rightly) in the wings while
we focus
> on sPHENIX. If people have thoughts on how we can help, including
leveraging sPHENIX work without
impeding it,
> I'd like to hear them.
> Torre
>
> --
> -- Torre Wenaus, BNL NPPS Group, ATLAS Experiment
> -- BNL 510A 1-222 | 631-681-7892
> _______________________________________________
> Phys-npps-mgmt-l mailing list
> Phys-npps-mgmt-l AT lists.bnl.gov
> https://lists.bnl.gov/mailman/listinfo/phys-npps-mgmt-l

--
-- Torre Wenaus, BNL NPPS Group, ATLAS Experiment
-- BNL 510A 1-222 | 631-681-7892
_______________________________________________
Phys-npps-mgmt-l mailing list
Phys-npps-mgmt-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/phys-npps-mgmt-l

--
________________________
Kolja Kauder, Ph.D.
NPPS, EIC
Brookhaven National Lab, Upton, NY
+1 (631) 344-5935
he/him/his
________________________

--
-- Torre Wenaus, BNL NPPS Group, ATLAS Experiment
-- BNL 510A 1-222 | 631-681-7892
_______________________________________________
Phys-npps-mgmt-l mailing list
Phys-npps-mgmt-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/phys-npps-mgmt-l

--
-- Torre Wenaus, BNL NPPS Group, ATLAS Experiment
-- BNL 510A 1-222 | 631-681-7892

_______________________________________________
Phys-npps-mgmt-l mailing list
Phys-npps-mgmt-l AT lists.bnl.gov
https://lists.bnl.gov/mailman/listinfo/phys-npps-mgmt-l

--
*************************************************************

Christopher H. Pinkenburg ; pinkenburg AT bnl.gov
; http://www.phenix.bnl.gov/~pinkenbu

Brookhaven National Laboratory ; phone: (631) 344-5692
Physics Department Bldg 510 C ; fax: (631) 344-3253
Upton, NY 11973-5000

*************************************************************

[Phys-npps-mgmt-l] Rucio for ePIC, Torre Wenaus, 04/28/2023
- Re: [Phys-npps-mgmt-l] Rucio for ePIC, Brett Viren, 04/28/2023
  - Re: [Phys-npps-mgmt-l] Rucio for ePIC, Torre Wenaus, 04/28/2023
    - Re: [Phys-npps-mgmt-l] Rucio for ePIC, Kolja Kauder, 04/28/2023
      - Re: [Phys-npps-mgmt-l] Rucio for ePIC, Torre Wenaus, 04/28/2023
      - Re: [Phys-npps-mgmt-l] Rucio for ePIC, Brett Viren, 04/28/2023
        
        Re: [Phys-npps-mgmt-l] Rucio for ePIC, Torre Wenaus, 04/28/2023
        
        Re: [Phys-npps-mgmt-l] Rucio for ePIC, Van Buren, Gene, 04/28/2023
        
        Re: [Phys-npps-mgmt-l] Rucio for ePIC, Torre Wenaus, 04/28/2023
        Re: [Phys-npps-mgmt-l] Rucio for ePIC, Van Buren, Gene, 04/28/2023
        Re: [Phys-npps-mgmt-l] Rucio for ePIC, pinkenburg, 04/28/2023