Distributed Analysis Challenge

Resources

Contacts

(for anyone, please email atlas-canada-tier2-ops@cern.ch)

  • Cloud: Asoka De Silva
  • Sites
    • Alberta: Bryan Caron
    • SFU: Sergey Chelsky
    • Toronto: Leslie Groer
    • TRIUMF: Reda Tafirout
    • Victoria: Ashok Agarwal

CERN (Results, Details, etc)

Ganglia plots for CA cloud

Challenges

  • For November/December 2008: The tests will be 1 job per dataset matching the dataset pattern as described (and in order of preference) in the links below. They are running on release 14.2.20. Output is expected to be ~ 1.1 GB per site for a full test. These are only Ganga (WMS) tests, not Panda.
  • Note that old tests are moved to the archive so you will need the test numbers below to look for them.

Short Tests

Input dataset details in the links. Pie chart color code: blue=ok, green=running, red=fail. Note that the page is not updated after the "End Time" so any jobs not run will remain in the submitted state.

2008

2009

  • 13 Jan 09 (5 jobs/site to many clouds) (test 105) This is a FileStager test.
  • 30 Jan 09 (5 jobs/site to many clouds) (test 127)
  • 3 Apr 09 (20 jobs/site except TO) (test 213)
  • 3 Apr 09 (20 jobs/site except TO) (test 215) This is a FileStager test.

Full Scale Tests

Dec 2008

  • 03 Dec 08 (test 57) (completed)
  • 09 Dec 08 ( test 68) (completed)
    • Jobs submitted: TRIUMF: 500, Alberta: 151, SFU: 200, Toronto: 126, Victoria: 85.
    • Issues:
      • SFU: Same issue as in previous test. Still under investigation.
      • TRIUMF: About 300 jobs attacked one pool node when the jobs were submitted and there were failures, after retries, to access input files. Reason is being investigated.
Feb 2009

  • 10 Feb 09 (test 132) (completed)
    • Jobs (normal access, not FileStager) submitted: TRIUMF, SFU: 200, Alberta: 182, Toronto: 157, Victoria: 141. (Probably some of the datasets that we thought were there were not closed.)
    • Issues:
      • SFU: all jobs failed; BDII had "GlueCEPolicyMaxCPUTime: 0" whereas requirements for stress test has a 1440s minimum requirement. Fixed Feb 12, 2009.
      • Alberta: Reason for 53% failures not understood. Jobs were aborted with the following reason:
        Logged Reason(s):
        - Got a job held event, reason: Unspecified gridmanager error
        - Job got an error while in the CondorG queue
        The error is explained in this goc site. The problem was probably a bad node which has been tracked down and removed from the list of WNs.
      • TRIUMF: There was a false positive temperature alarm which forced WNs to shutdown. Apparently stress tests jobs were not affected despite what the following plot shows. (Note: jobs were not resubmitted by WMS or Hammercloud so what caused the blue bars to be restored ? Could be a stale qstat result which is used for the plots.)
        TRIUMF PBS Queue

  • 13 Feb 09 (test 140) (in progress)
    • Jobs (normal access, not FileStager) submitted to Alberta (182 jobs) and SFU (200 jobs).
    • Issues:
      • 160 jobs failed at SFU on the following nodes
        s16 - 60 jobs
        s17 - 24 jobs
        s21 - 32 jobs
        s30 - 43 jobs
Mar 2009

  • 3 Mar 09 (test 171) (completed)
    • 200 jobs (TRIUMF, SFU), 178 jobs (Alberta), 162 jobs (Toronto) 142 jobs (Victoria) with FileStager.
    • All jobs to SFU failed. Reason still under investigation (although SFU was in the blacklist for Ganga, dq2 0.1.26 was missing at the site).
    • Toronto experiencing SE issues.
    • Alberta had unscheduled downtime at 04-03-2009 03:19 UTC; jobs did not fail presumably because they are in the queue and waiting to run ?
    • Web site for Hammercloud does not seem to be updating properly. Waiting to hear from experts.
  • 5 Mar 09 (test 177) (completed)
    • 200 jobs to SFU only. (no longer on blacklist and have DQ2 0.1.26 installed at site).
April 2009
  • 7 Apr 09 (test 214) (completed)
    • (200 = TRIUMF, SFU, 189 = Alberta, 148 = Victoria) Filestager jobs / site except for Toronto.
    • Slow processing at SFU - overwhelmed SE, planning to move to new SE soon.
    • O(1 job) slow at UVic with output; also noticed that in last test (171)
  • 9 Apr 09 (test 221) (completed)
    • (200 = TRIUMF, SFU, 189 = Alberta, 148 = Victoria) Normal jobs (no FileStager) / site except for Toronto.
    • Cooling issue at TRIUMF forced shutdown of some WNs but this mostly seems to have happened after the Hammercloud tests finished running. (2 jobs still in running state after window closed.)
    • Some jobs are still in running state at SFU (60 run + 6 submitted) possibly because of fairshare issue (competing with Westgrid users for resources now.)
    • Some jobs are still in running state at Alberta (41 submitted, 1 running). Could be a fairshare issue completing with Production jobs (jobs slots for production was increased.)
Comparisons of tests with Filestager (FS) (214) and without (221): see HC_214vs221.pdf
  • CPU/Wallclock: SFU, Alberta show higher efficiency with FS while Victoria is better without FS. (Also TRIUMF seems to skew towards a higher efficiency without FS).
  • Similar behavior seen for events/sec - SFU and Alberta better with FS, Victoria, TRIUMF (marginally ?) better without FS.
  • Preparing inputs: Seeing improvement with FS in these plots - SFU is striking.
May 2009

  • 5 May 09 (test 277) (completed)
    • 200 (TRIUMF, SFU), 190 (Alberta), 173 (Toronto) and 148 (Victoria) jobs (no filestager).
    • Note that TRIUMF is still ramping up after some blades were turned off for cooling reasons (~40% at 10:00 PST and with more coming up.) TRIUMF 100% at ~1100 PST.
    • SFU: (from Sergey) 10 hammercloud test jobs have failed at SFU recently because of the problem with our file storage (gridstore).
      NFS server crashed on gridstore which made the Atlas installation area not available on all WNs. 10 jobs were running at this time and they failed.
    • Toronto: Password/shadow/group file incorrectly added to head/compute nodes and so new virtual grid accounts were not being accepted by pbs. This has been fixed now but it is outside the test 227 window so the jobs will not run.
    • Results: please compare with last month ( test 221) and note the changes.
August 2009

(No tests in June/July because of Step09.)

  • 100 Panda and 100 Ganga jobs to each site:
  • Toronto: experiencing SE issues. Site admin: "I restarted the SRM daemons and I think transfers should be occuring again. I did not turn up a specific error condition or cause yet." at 12:00 PST (19 Aug).
  • SFU: datasets which are registered but are not in pool are being removed by Stephane.
  • Victoria: I reported SE issues from siteTests and past 24 h of jobs. Site admin: "There are too many jobs with lcg-cp and saturating the network. I have reduced the limit." at 13:42 PST (19 Aug). Clarification: "single analysis user for atlaspt4 was misconfigured to take maximum 100 jobs. Drew has corrected to maximum 40."
  • Alberta: Site admin: "I restarted the problematic pools and the dcap services look fine now" at around 13:36 PST (19 Aug) to resolve dcap issue flushed out by siteTest job.
  • Test completed.
    • Looks like more than 100 jobs / site ran for each test so there are jobs in submitted state which may skew efficiency plots. Note also that Panda jobs were split.
    • CPU/Walltime looks better for Panda than Ganga at TRIUMF. (keep in mind that these tests ran at the same time.)
    • New plot: Events/Athena (rate using Athena wall clock time) indicates Panda is more efficient in processing events at all sites. (see below).
    • Alberta, Victoria, Alberta seem to have long times for preparing inputs in Panda but these were probably related to the SE issues listed above. (Don't see this in Ganga but then it uses dcap unlike Panda which does a copy.)
    • The athena running time comparisons are quite different for Panda vs Ganga. Hence the better Events/Athena rate. The Panda jobs are copying the AOD files over whereas the Ganga ones are accessing with dcap (checked for TRIUMF's HC test.)
    • Output storage time is much longer for Ganga than Panda ... There is an error (checked Alberta and TRIUMF jobs). There was a permission problem ERROR during execution of lfc-mkdir -p /grid/atlas/user09/... which caused the Ganga job to look at all CA sites before finally using CERN to store the output. I believe this is being fixed by Di this morning.
October 2009
Some tests to understand what is the best file access method. The results will eventually be tabulated on this HC Cloud Data Access page.
  • 15 Oct Panda tests 687
  • 16 Oct Ganga tests (FileStager) 697
  • 19 Oct Ganga Tests (DQ2_LOCAL) 703
  • General observations: (keep in mind that Panda does dccp for the AODs. Ganga FS will stage the next input file while the first is being processed whereas Ganga Local will be using dcap access on our sites.) I am also not focusing on why jobs failed here, except for brief remarks, since they have been reported daily to sites. Click on the pdf files below for side-by-side comparisons of the tests.
    • TRIUMF: TRIUMF.pdf
      • Ganga FS vs Local - this one is striking; FileStager has a much better performance than dcap access - cpu/wall mean was 83 vs 31 s and event rate was also better (mean 20 vs 7). (Note: dcap access from test 703 yesterday had some pool nodes hit hard at TRIUMF.)
      • Panda vs Ganga (FS) is not too far behind.
    • Alberta: Alberta.pdf
      • Hard to say if Ganga FS or Local is better here. Local (dcap) seems less spread.
      • Panda vs Ganga: CPU/Wall and event rate looks slightly better with Panda.
      • Ganga seems to take longer to store output than Panda - why ?
    • SFU: SFU.pdf
      • 24% of jobs failed for Ganga(FS) 697.
      • Comparing Ganga FS 697 and Local 703, it looks like local access (dcap) is better in terms of less spread as well as events processed. For FS, output times were long but that could be related to site issues ?
      • Comparing Ganga vs Panda, Panda looks better than Ganga local (dccp vs dcap) in event rates.
      • Also see that Ganga takes slightly longer to store output than Panda. Should understand why we see this on Tier2s.
    • Toronto: Toronto.pdf
      • 25% Panda, 93% of Ganga FileStager and 9% of Ganga Local jobs failed.
      • Ignoring FileStager results or Ganga - Ganga comparisons - too few statistics (only 19 FS jobs).
        • they show dcap access perhaps better but I recommend we redo this test if we want to rely on the results.
      • Ganga (Local) vs Panda, Panda has better ev rate (4.9 vs 2.4)
      • Hmm, something funny about the way the means are calculated for CPU/Wallclock. Bug in HC ?
    • Victoria: Victoria.pdf
      • 14% Panda jobs and 18% of Ganga FS jobs failed.
      • Ganga FS versus Local - FS is better CPU/Wall (57 vs 40), ev rate (7.7 vs 4.6).
      • Panda vs Ganga (FS), hard to say which is doing better.
      • very long output storage times seen with Panda. (SE issues reported to site).

Summary for the DA team

-- AsokaDeSilva - 28 Nov 2008

Topic attachments
I Attachment History Action Size Date Who Comment
PDFpdf Alberta.pdf r1 manage 241.1 K 2009-10-20 - 22:01 AsokaDeSilva  
PDFpdf HC_214vs221.pdf r1 manage 271.1 K 2009-04-14 - 17:53 AsokaDeSilva  
PDFpdf SFU.pdf r1 manage 242.5 K 2009-10-20 - 22:01 AsokaDeSilva  
PDFpdf TRIUMF.pdf r1 manage 246.5 K 2009-10-20 - 22:02 AsokaDeSilva  
PDFpdf Toronto.pdf r1 manage 243.3 K 2009-10-20 - 22:02 AsokaDeSilva  
PDFpdf Victoria.pdf r1 manage 245.5 K 2009-10-20 - 22:02 AsokaDeSilva  
Texttxt ce1-day.php.png.txt r1 manage 36.0 K 2009-02-12 - 17:53 AsokaDeSilva TRIUMF PBS Queue
Edit | Attach | Watch | Print version | History: r47 < r46 < r45 < r44 < r43 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r47 - 2009-10-20 - AsokaDeSilva
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback