EPP Grid - Software Project Proposal


Start of topic | Skip to actions

Software Project Proposal

Grid Sandbox Tool

DISCONTINUED: This proposal was written early in 2001 and no further investigation has taken place as similar functionality is available in "3rd generation" middleware such as LCG.


Overview

In normal operation of the grid a job may require access to many resources. These resources include the application required to execute the job, the data (if any) on which the job will run, auxiliary files that may be required for settings or other purposes, and hardware resources such as temporary disk space and CPU. The term "sandbox" has been used in the context of the grid to describe a temporary work space created on a computing resource that encompasses disk space, auxiliary files, and possibly the job application itself. As yet, this has been an abstract concept and has only been used descriptively. This software project will attempt to solidify the concept of an application sandbox and provided tools for it's implementation and use.

Design

The central feature of the "sandbox" will be a configuration file. Due to the disparate nature of resources on the grid this configuration file and all files referenced within may be accessible across the grid, either via low level remote access (eg. GSIFTP, GASS, or HTTP) or as files within a virtual file system (such as a Replica Catalog). The configuration file will be divided into 4 parts: environment variables; temporary file structure; a begin script; and an end script. The environment variables will be set first, the temporary file structure will be created (to be discussed in detail), the begin script will be run before or at the start of the job application, the application will execute with the set environment and within the temporary file structure, and finally the end script will run at the end of the job.

The sandbox tool may be accessed using one of two methods. The first is by way of command line tools. These may be placed at the start and end of a script (eg. gsandbox-start and gsandbox-end) or alternatively called manually by a user. The gsandbox-start command is passed the location (URL) of the configuration file either as an argument or an environment variable. The second method, is to embed sandbox API calls into an application. API calls may be placed at the start and end of the program to enable the sandbox (eg. gsbstart and gsbend). Alternatively, the first API call to any sandbox function will automatically call the start API call (gsbstart) if not done already.

The temporary file structure will be constructed within the local file system in a unique location for each sandbox (eg. /tmp/sandbox.user.uniqueid). The temporary file structure will be destroyed during the end routine (either gsandbox-end or gsbend) after any specified end script has run.

Files within the sandbox can be accessed/identified via a filename like "sandbox:???", or via a relative path if the current working directory is within the sandbox, or via a full path that specifies a file within the sandbox temporary directory.

Specific Features

1.0 API Features
1.1 gsbstart()

This function should be called by the user initially but will be called from any other API call if not done already. This functions loads and interprets the configuration file. It creates a temporary directory space to store files (eg. /tmp/sandbox.user.uniqueid). All files that are required immediately will be downloaded. It then runs the configuration file's begin-script before returning. (All of this will be skipped if the "gridsandbox-start" command line feature has already been run.)

1.2 gsbend()

This function is called by the user before exiting the application. It stops any downloading of files, destroys the sandbox temporary directory, runs the configuration file's end-script, and returns.

1.3 gsbfopen()

This function is called by the user in place of the open() function. If open is for reading and the accessed file is within the sandbox (ie. not an attempt to access a file outside the sandbox) it ensures the file has been downloaded first and then returns an open handle to the file. If open is for writing and the file is within the sandbox the file is added to the sandbox and opened for writing. Files within the sandbox are determined by the prefix "sandbox:", or by a relative path if the current working directory is within the "sandbox", or by a full path that specifies a file within the sandbox temporary directory.

1.4 gsbfread() gsbfseek() gsbftell() gsbfrewind() gsbfgetpos() gsbfsetpos() gsbfscanf() gsbselect() gsbfgets() gsbfgetc() gsbferror() gsbfeof()

These functions act exactly as their counter part functions. Later functionality may be added.

1.5 gsbfwrite() gsbfprintf() gsbfputs() gsbfputc()

These functions act exactly as their counter part functions. Later functionality may be added.

1.6 gsbfclose()

This function is called by the user in place of the close() function. This function acts exactly as their counter part function. Later functionality may be added.

1.7 gsbchdir() gsbmkdir() gsbdiskfree()

These functions act as their counter part functions, except they also accept sandbox path names (eg. "sandbox:").

2.0 Internal Features

These features are for internal use by the sandbox tools. They are not to be called by user code but are central to the operation of the sandbox tools.

2.1 gsb_createbox()

This function will be used internally to create a temporary directory for the sandbox files.

2.2 gsb_destroybox()

This function will be used internally to destroy the temporary directory.

2.3 gsb_addfile()

This function will be used to register a file with the sandbox. This file may be loaded straight away or downloaded later.

2.4 gsb_loadfile()

This function will be used internally to download a single file if required or not already downloaded. Load methods that will be supported are FTP, GSIFTP, GASS, HTML, and potentially REPCAT (replica catalog).

2.5 gsb_filestate()

There is a need to maintain status information about each file. This function is used to determine if a file has been downloaded, is currently downloading, or needs to be downloaded. This information will be stored in the temporary sandbox directory in a hidden file.

3.0 Command-line Features

3.1 gridsandbox-start

This command will load the configuration file, sets up the environment and temporary sandbox directory, downloads necessary files, and run the configuration begin-script.

3.2 gridsandbox-end

This command will delete the temporary sandbox directory and run the configuration end-script. (All of this will be skipped if the "gsbend()" API feature has already been called by the application.)

4.0 Sandbox Configuration File

The format of the configuration file is outlined here:

# some comment ignored by sandbox
ENVIRONMENT {
   VARIABLE=value
   VARIABLE2=another value
   VARIABLE3=sandbox:filename
}
FILES {
   # some comment ignored by sandbox
   CWD=start directory [default sandbox root sandbox:/]
   FILE=/dir/filename  
   FILE=/mydir/otherdir/myfile.conf http://host/myfile.conf stage
   DIR=/dir/createdir
   DIR=/somedir
}
BEGIN {
   <some script to run before execution of application>
}
END {
   <some script to run after end of execution>
}
</font size=4>

The "stagetype" may be equal to "stage" for files that must be downloaded immediately before executing the application, "auto" for files that should only be downloaded if they are accessed, or "never" for files that should not be downloaded (this is just a feature to disable a file download).

Requirements

Calls to sandbox API within a user application should work transparently if either no sandbox is present or the user attempts to access a file outside the sandbox. In this way, using sandbox API functions in place of normal file functions will not break the original functionality of the users application.

Future Development

The following features may be incorporated at a future date:

  • Ability to copy whole FTP or HTTP directories of files specified in the configuration file.
  • Ability to limit the total size of files within the sandbox to some maximum value specified in the configuration file.
  • Ability to specify some files as stagetype "link" within the configuration file. This does not download the file but actually reads the file directly from it?s remote location when accessed. We may use grid GASS functions to incorporate this.
  • Ability to specify some files as stagetype "async" within the configuration file. This starts download immediately but in the background. Checking must be performed to determine if the file download has finished before it is accessed by the application.
  • Ability to specify some files as stagetype "autoasync" within the configuration file. This does not start download immediately but only once accessed. The download, once accessed, begins in the background. Checking must be performed to determine if the file download has finished before a read operation.
  • Access to partially downloaded file. This means that an open and read operation will not halt until the whole file is downloaded. It will halt until enough of the file has been downloaded to satisfy the next read operation.
  • A sandbox cache that stores frequently downloaded or accessed files. If the same file is used again on the system within a short period of time (to be specified in local system configuration file) the file is obtained from the cache and not downloaded.
  • Segmented download of files. To enable faster transfer, files over a certain size can be download in parts in parallel.
  • Random partial file download if seek is used to access a file. If the random access routine seek is used to access a file we can skip the current portion of the file that is downloading and download only the portion at the seek position required for reading. This means we need to keep track of which parts of a file have been downloaded. (This feature may be complimentary to the "segmented download of files" feature.)
  • Extraction of files from TAR-GZ archives.
  • Due to the possibility that a user has omitted a call to the end routine the start routine (either gsandbox-start or gsbstart) may destroy older temporary file structures in an attempt at self clean-up. The minimum age of temporary file structures eligible for clean-up will be set in a local system configuration file. (This local configuration file may contain other settings for the local system?s use of the sandbox tools.)

EPPGrid.SoftwareSandbox moved from EPP.GridToolsSandbox on 15 Feb 2005 - 04:15 by Main.LyleWinton - put it back
key Log In Revision:  r2 - 15 Feb 2005 - LyleWinton
Authorised by:  Geoff Taylor (G.Taylor @ physics.unimelb.edu.au)
Maintained using:  This site is powered by the TWiki collaboration platform
Copyright © 2000-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.