Datasets

This section will summarize how to look for Datasets, to subscribe them to another site, or download to your desktop. Please see the General Concepts section which will help you understand the terminology used - dataset, container, replicas, etc.

Resources for all sections:

  1. DQ2Clients How-to (recommended reading !)

Finding Data

Resources:

  1. Dataset Searching and Information (Emphasis on datasets in the Canadian cloud)
  2. Panda Monitor (see left frame for search, browse links)
  3. ATLAS Metadata Interface (AMI)
  4. DDM Catalog Browser (Look for datasets filtered by location)

If you know the name of the datasets, the easiest is to use the DQ2Clients tools (eg. dq2-ls, dq2-list-dataset-replicas-container) to look for replicas. However, if you do not know the names of the datasets, the next easiest avenue is to use the Panda Monitor page. The Panda Monitoring page will search all of the Atlas datasets, including data that are known to be bad. AMI is more powerful but also a little more complicated to use. The ATLAS Metadata Interface is intended to catalog metadata, and is very useful when you need more information on an individual dataset. It does not have information on where data are stored, but it does have information on dataset make-up, quality, ownership, etc.. (By default, it only displays information on good datasets.) Information on datasets in the CA cloud is also available.You can see some examples of ways to search for datasets.

Note that physics data are now distributed as containers (of datasets).

Downloading Data to Desktop

You can use the DQ2Clients tools (eg. dq2-get) to download datasets or a subset of files to your local desktop disk. This is useful for testing your code prior to job submissions to the grid. Note that this should not be a way to run analysis jobs on a full-scale for anything except DPDs. (ATLAS can throttle your bandwidth for data access if this is abused - eg more than 10 GB/day.)

Subscribing Data to Other Sites

References:

  1. DDM Documentation

Sometimes you may want to subscribe a dataset to have it replicated at another location where your job will run. (You should have a good reason to want to do this - analysis jobs go to run where the data lives, not the other way around !)

Another situation can arise when you have run your job on the grid and stored your output dataset at a remote site but on a scratch area and now want to store it on a more permanent place on the grid.

A third reason is that the datasets are on tape at a Tier-1 and so you will need to subscribe them to disk before running your jobs.

This section will help you accomplish this task ... Note that, in general, ATLAS will allow for small amounts of data to be subscribed without much fuss, however if the size is large or on tape, please remember to enter your reasons on the subscription web page.

Note that:

  • only closed or frozen datasets can be subscribed ( dq2-get-metadata <name> has a state field which shows "frozen" or "closed" if it is).
  • Do not use DQ2 to subscribe; use the web interface instead. There is a registration link on the web page which you will first need to do. You will have to select a destination (see the sub-sections below).

The sub-sections below outline the process of dataset subscription:

How to Freeze / Close Own Dataset

You can only do this on a dataset which you created. The command to freeze the dataset is
dq2-freeze-dataset <name>
while the command to close a dataset is
dq2-close-dataset <name>
and you can see its status field change to "frozen" or "closed" by typing
dq2-get-metadata <name>
(Type --help to see all options in DQ2 commands.)

Using Containers

If you have a number of frozen datasets which are related, you may wish to perform some bulk action on them. In this case, it may be helpful to define a container and place the datasets inside it. A couple of example applications:

  • You wish to copy all related datasets to a certain site. One easy way to do this is to define a dataset container, put the datasets inside it, and then subscribe the container to the remote site. This saves many separate dataset subscriptions. Hint: you can keep updating the container with more datasets later. If you want them to also be subscribed automatically, make the subscription "periodic".
  • You wish to run a job over all of the related datasets. Put them in a container and simply use the container name as job input (eg. --inputDS=<containername>)

Here is how you define a container and add datasets to it:

Create the container:

dq2-register-container user10.<username>.<my-container-name>/

Add datasets to it:

dq2-register-datasets-container user10.<username>.<my-container-name>/  user09.<username>.<DATASET1>
dq2-register-datasets-container user10.<username>.<my-container-name>/  user09.<username>.<DATASET2>

Note that the dq2-register-datasets-container command can take a large number of new datasets on a single line. So, if you have 10 names to add, just put them one after another (separated by spaces) in the same command.

Destination

Data on the grid are stored in areas designated by "Space Tokens". For example, jobs submitted by Panda or Ganga by default write to <site>-LCG2_SCRATCHDISK which, as the name implies, is a scratch area (ie, data will be periodically deleted from this area by ATLAS.) Each site has a <site>-LCG2_LOCALGROUPDISK, controlled by ATLAS Canada, where deletions are not automatic and where you can store your output files for a longer period.

  • If you run at a Canadian site, you can configure your job to send the output directly to that site's <site>-LCG2_LOCALGROUPDISK instead of <site>-LCG2_SCRATCHDISK.
  • If you are subscribing data from outside the Canadian cloud, you can subscribe the data to <site>-LCG2_LOCALGROUPDISK on the web form (next section).

Subscribing

Resources:

  1. Subscription page

Monitoring

Resources

  1. Requests status page

Result

Resources:

  1. Savannah bug report page

Suppose you have subscribed a frozen dataset from one site to another. You will receive an email confiriming your subscription. At this stage, it will need to be approved and then, once that is done, ATLAS will replicate the data to the destination you specified. Once successful, you will receive another email confirming that it is done.

If the subscription fails, you will also receive an email. To pursue the reason for the failure,

  • please respond to the person listed on the email as the contact.
  • If that does not provoke a response after 4 days, please file a Savannah bug report.

Deleting Own Datasets

From DQ2Clients 0.1.32 and newer, users are able to delete their own datasets; for more information, please see this link

-- AsokaDeSilva - 09 Mar 2009

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r9 - 2010-04-27 - DuganONeil
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2018 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback