Tier1 and Tier-2 Weekly Report 2019-01-29

General News

  • ADC Technical Coordination Meeting [20181029]
    • HTTP and xrootd experience (3rd party copies) - x509 and token based
    • CERN tape archive tests
  • ADC Weekly Meeting [20190129]
    • Group Derivation Production
      • New cache built last week
      • Have list of missing DAODs based on AMI records - working on recovery
    • MC Production
      • Stable week
      • Currently running ~100k deriv, 100k simul, 60k reco, very little evgen
      • Another 30k from BOINC/HPC
      • Stuck AODmerge jobs being discussed
      • Many older requests are getting done, mostly <90 days old tasks
      • Simulation backlog down to 890M (=210M after removing the paused AF2 top mass)
      • More requests are in the pipeline (30 already approved)
    • Data17 Reprocessing
      • HI 2015-16 data rel 21
        • physics_UPC - was the highest priority, now finished
        • physics_HardProbes is now running (though lots of jobs queued)
        • physics_MinBias and physics_MinBiasOverlay are submitted with lower priority
      • Reprocessing of the full data12 physics_Bphysics stream has started
      • Just started on data18_hi calibration_CCPEB and calibration_PCPEB streams from T0 and assessing T0 vs grid performance
      • Numerous other tasks underway or in the pipeline
    • CRC/ADCoS Report
      • Using >20k slots at T0 most of the time
      • Some ptag issues on DPD production tasks cauing 100% failures
      • Looking at ways to speed up analysis jobs
        • Allow to move data from an occupied site to a less occupied one
          • Try increasing number of allowed analysis jobs to move data from 20 to 30%
          • analysis transfer rate increased from 3 GB/s to 8 GB/s which impacted production
          • Brokering issue found, corrected, decreased the allowed jobs back to 20% and rate stabilised at 5GB/s
        • Need some work to find good equilibrium
      • Everyday up to 10 SCRATCHDISKS fill up in last few weeks
        • Deletion campaigns to regularly clean up old files
        • Request sites to adjust scratch disk size: not too large to avoid wasting resources, not too small to allow analysis jobs to run (see https://twiki.cern.ch/twiki/bin/view/AtlasComputing/StorageSetUp#The_ATLASSCRATCHDISK )
        • Watermark modified: from 50% to 25% below size of the storage and max 30 TB for big sites
        • Cleaning done if scratch disks full with big DAOD put there by users not issued from analysis job (user transfers)
      • Using new Graphana monitoring but still debugging or finding discrepancies
    • Optimize Resource Usage and Operation of lightweight Grid sites
      • Idea to redirect funds from storage infrastructure to CPU for sites with <0.5PB so can support 200-1000 cores
      • Test low IO load jobs to minimize stress on remote SE and network
      • Can get away with just SCRATCHDISK too
      • Need new switcher3
      • Beyond 2019, might be looking at cache sites and allowing more network load as applicable
  • CA Cloud Squad Actions
  • Open Issues and Ongoing Discussions
  • WLCG and General Computing News
    • Security News
    • gLite/EMI News
  • ROC_Canada News

ABCD Determination and Monitoring

SSB Results

SAM Results (based on VO tests)

APEL Accounting

APEL Synchronization Test Links

Production and Usage Status

-- Leslie Groer - 2019-01-29

Comments

Topic revision: r1 - 2019-01-29 - LeslieGroer
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback