Data Management

Introduction

The final topic to be looked at is Data Management. A significant number of the programs on the grid today require large volumes of data to be analysed. To that end there are number of services within the EGEE middleware dedicated to handling the data. One of the key issues to remember with data is that moving large volumes of data can be financially expensive and slow. The solution to this problem is to keep copies of the data at key locations on the grid and then choose the CE that matches the requirements of the job and is closest (in grid terms) to the data. Clearly this requires a specialised way of handling the data, there is no way you can be expected to know where each copy of the data is and how to choose the closest appropriate CE. In fact the way the data is handled is by using severally different ways of addressing the file. Each piece of data is allocated a Globally Unique Identifier (GUID) which is used to track the data. On top of this there will be several Storage Elements (SE's) on the grid that keep a copy of the data. Each of these copies is accessed using a Storage URL (SURL) which contains the name of the SE and the path to where the data is found on the SE. Neither of these are however very user friendly so a third way of addressing the data is used called a Logical File Name (LFN). The LFN is whatever name makes sense for the data provided it is unique, it is comparable to an alias for the data. This can be summarized by the following diagram:

The actual middleware components involved are the Replica Manager Catalogue (RMC) that maps the LFN's to the GUID, and the Replica Location Service (RLS) which maps the GUID to the SURL's.

For the purposes of this tutorial there is already a file registered with the above services and which has the name "TutorialLFN.mpg". For example you could have an LFN "lfn:TutorialLFN.mpg" with a GUID of "guid:efc34e09-1239-11d9-80e8-e10f1b1a805c" and a SURL of "sfn://grid009.ct.infn.it/flatfiles/SE00/gilda/generated/2004-09-29/fileec064608-1239-11d9-80e8-e10f1b1a805c".

Exercise 5

Modify your previous jdl file to include the parameter InputData with a value of lfn:TutorialLFN.mpg and the parameter DataAccessProtocol with the values file, gridftp and rfio. Run the job and again observe where the Resource Broker is able to match the jobs requirements. These parameters have the effect of specifying that the job needs to run on a CE that is close to the specified LFN. The Access Protocols listed are the means by which you wish to access the data. When you submit your job the RB matches those CE's that advertise (via the Information Service) support for at least one of these protocols. You may wish to experiment and find out if all of these protocols are actually supported on gilda.

Whilst the previous example demonstrated how to ensure the job runs 'near' the data it is common that more sophisticated data handling tasks must be performed. The easiest way to do this is to use the edg-rm command from within a shell script. The edg-rm command is actually an alias to edg-resource-manager and is used to talk to the resource manager (edg is the European DataGrid project, one of the ancestors of EGEE). Within your account there is a simple example of a shell script "runshell.sh" which calls the edg-rm command. An associated jdl file is also included ("runshell.jdl"). These may be used as the base for the following exercise. If you wish to run this example you will find that it prints a list of all the known information about the gilda VO.

Exercise 6

Modify "runshell.sh" and "runshell.jdl" so that you can see the contents of TutorialLFN.mpg. To copy the file you should use the command

edg-rm --vo gilda copyFile <LFNNAME> <LOCALNAME>

where <LFNNAME> and <LOCALNAME> should be replaced appropriately. Make sure <LOCALNAME> contains the full path to the file (e.g."file://home/genius01/myfile.mpg"). The immediate question then is what is the full path to the file. The job when it runs on the WN is not running with your username, instead it is running with a generic username for your VO. This means there is no way of knowing at the time of writing your shell script what the local directory you can write to is. The one thing that is known is that the shell script will be running from a directory you can write to, hence use the environmental variable $PWD in your path.

This is the end of the tutorial.

Previous section

Latest News

Try the GRID

Become a User

FAQ

Contact EGEE

Information Sheets

EGEE sites

Data Management