EPP Grid - Belle Metadata Proposal


Start of topic | Skip to actions

Belle Metadata Proposal

CCLRC's Scientific Meta-data Model (CSMDM)

This is an investigation of the use of the CCLRC's scientific metadata model, CSMDM, in the context of the Belle collaboration and experimental data.

CSMDM Terms

I will refer to the following terms often throughout this document:
Term Abrev Meaning
Study S Complex object within the model
Investigation I Complex object within the model
DataHolding DH Complex object within the model
DataCollection DC Complex object within the model
AtomicDataObject ADO Complex object within the model

Basics of CSMDM

The following basic rules apply to the model:
  • DH/DC/ADO can have one or more Location objects.
  • DH records have a parent ID (InvestigationID).
  • I can only have one DH
  • DH can only belong to one I
  • DH can contain 0 or many DCs
    • eg. raw, intermediate, final collections
    • data is reprocessed new final collections emerge (what happens to old)
  • DC can contain DCs
  • DHs or DCs can both contain DataDescriptions which can have parameter values
  • "Fixed" parameters are set conditions, "variable" parameters are derived or measured. Fixed parameters are either experimental conditions or input data.
  • In DH/DC/ADOs RelatedReference objects refer to data that these depend on or were used as input data to create.
  • In DataDescription a Parameter can contain 0 or many Parameters
  • In DataDescription a Parameter may have a ParamValue, a Range, both (if value specified with error), or neither (to indicate a description of the data, eg. Ntuple parameters)
  • A RelatedReference object allows you to specify relationships between DC.

Belle Recommendations

The following is a list of notes and proposals:
  • A "Study" element is effectively a first level directory structure.
    • Proposal: Any domain of responsibility (for data or a paper) is a "Study". eg. Collaboration (collaborative data), Working group (specific data and papers), Individual (specific data or papers).
  • Any logical collection of data can be represented as an "Investigation" element with a "DataHolding" element.
    • Proposal: Any logical collection of data and papers is an "Investigation". eg. Specific production, specific working group study, work towards a paper/thesis.
  • Any logical or physical collection of data can be a "DataCollection" element.
    • Proposals:
    • The smallest collection is a group or batch of output at one physical location or path. This is expected to be a physical collection of similar data.
    • Any "DataCollection" can be a virtual collection of other collections.
    • Any logical collection of data related to a specific task like a production run will likely be a virtual DataCollection of collections.
  • A "RelatedReference" could be of type "Derived From", "Prior Study" or "Used By".
    • Proposals:
    • RelatedReferences should only be used within DataCollections.
    • "Derived From" specifies input data (with Direction=From) such as generator input used to create this MDST, or MDST/skim used to create this this NTuple/Histo.
    • "Prior Study" specifies new version (with Direction=Peer) such as reprocessed MDST.
    • "Used By" probably not necessary but indicates further processing such as final MDST created from this generator data. (with Direction=To)

The structure of the CSMDM must be interpreted within the context of the Belle experiment. The following is the basic structure within the Belle context in parent-child order:

  • Policy - unknown (unused)
  • Programme - "Belle Experiment on KEK B factory" (unused)
  • Study - collaboration itself, working groups, and individual research.
    eg. One for the general "Belle production", one for each working group or activity, and one for each Belle member. (There is always a 1-1 relationship between MetadataRecord and Study objects.)
  • Investigation - logical collection of data and papers.
    eg. Specific production/skim, specific working group study, work towards a paper/thesis. (There is always a 1-1 relationship between Investigation and DataHolding objects.)
  • DataCollection - any further logical/physical collection of data.
    eg. physical collections might include specific production log files, data for various experimental periods (e000025), previous version of data. logical collections might include all current collections associated with a specific skim (eg. skim1-e000015, skim1-e000017, ...).

*Investigation*s come in 4 categories: Experiment, Simulation, Measurement, and Other. Categorisation is left up to the Investigation owner however the following might be considered...

  • Experiment -
    For the "Belle production" study these might include run periods (eg. e000025) with data release versions and production skims. For working groups these might include skims. For individuals these might include individual skims.
  • Simulation -
    For the "Belle production" study these might be broken into run periods, or data releases (tagged), or event production types. For individuals these might include personal production runs.
  • Measurement -
    For the "Belle production" study these might be for beam measurements and conditions. For working groups or individuals these might include specific papers or work towards results.

Most day to day operations within Belle will probably involve creation and management of information at the level of DataCollection and AtomicDataObject. It would be cumbersome to have to create/manage whole CSMDM objects for all daily operations. Some sections will likely change with less frequency. Basic information in objects such as StudyPerson, Study and Investigation, will change very little during the lifetime of projects once established. It is recommended that Belle break the structure of CSMDM into independent sections (still inter-linked) to reflect logically grouped information that might be managed with similar frequency. These sections will be referred to as "domains". 4 domains can be identified, each representing sub-documents in XML or starting points in the structure tree. The higher level domains will not include information from the low levels but will include enough information to link in the lower level domains and reconstruct the whole CSMDM structure.

  • Domains:
    • MetadataRecord (top level, includes Topic and Study information)
    • StudyPerson (personal information, address etc.)
    • Investigation
    • DataCollection

There are several levels of conformance available within the CSMDM:

Level Included
L1 DC study and investigation info
L2 also data description and location
L3 also with related material, access, and data collection info
L4 includes atomic data objects
Level 4 includes the referencing of individual HBOOK/ROOT files and their contents. However, logical collections of information that do not specify individual files may be useful (eg. production data collections or skims). To allow this, a lower level of conformance, level 3, is necessary.

Belle Use Cases

Here are a number of use cases which may be instructive to explore.

Generic Data Collection use case

The CSMDM may be used as a virtual navigation tree for data collections. The structure might be navigated in this way:

line_lrdfolder.gif Study
line_udrfolder.gif Study
line_udrfolder.gif Study
emptyline_udrfolder.gif Investigation
emptyline_udrfolder.gif Investigation
emptyline_udline_udrfolder.gif DataCollection
emptyline_udline_urfolder.gif DataCollection
emptyline_udemptyline_urfolder.gif DataCollection
emptyline_udemptyemptyline_urfolder.gif DataCollection
emptyline_udrfolder.gif Investigation
emptydot_ud

As will be shown in the examples, optional meta-data can be attached to the tree structure and all elements within. This could be used to aid navigation or searching of the tree.

DataCollection could be stored as stand-alone objects (see Database Structure). Collections can exist in multiple locations within the tree, similar to symbolic links. DataCollections may have no parent Investigation and are not presented to users navigating the tree.

Manual Collection Process

  1. Create DataCollection stand-alone template object(s) in XML format
  2. Modify to provide DataName, Status, Description, Parameters, Software, Locator, RelatedReferences for each collection
    • Locator may take the form of an SRB wildcard pattern such as srb:/zone/collection/*
  3. Ingest DataCollection stand-alone object(s)
  4. Include/attach to parent Investigation or DataCollection
    • A parent object could be specified upon template creation and it's ID built into the template. Attachment could then be automatic on ingestion.

Automated Collection and ADO Process

  1. Create DataCollection stand-alone template object(s) in XML format
  2. Modify to provide DataName, Status, Description, Parameters, Software, Locator, RelatedReferences
  3. Automatic generation of AtomicDataObjects for inclusion in above template object(s).
    • Possible automated creation of ADO DataDescription Parameters and Software.
    • Idea: Create a tool that takes a local/remote data object, an absolute remote data location, auxiliary information sources (BASF log file), identifies the file type (HBOOK,ROOT,MDST,GEN,LOG), extracts information from the local file (if necessary), extracts information from auxiliary sources (if necessary), and generates an ADO with DataDescription and Parameters. (See JHOVE project.)
  4. Merging of generated ADOs with their related template object.
    • Possible check for agregate parameters. If same Parameter object specified on each ADO this could be moved to the DataCollection level.
  5. Ingest DataCollection stand-alone object(s)
  6. Include/attach to parent Investigation or DataCollection

From the point of view of a system user many of these steps could be integrated. For example, the last 3 steps could be part of the one merge and ingest operation.

Collection example

   $  meta-template list
   $  meta-template --interactive create collection-analysis-data > mycoll1.xml
   Parent Investigation or Collection (default=none):  mydata
   Found specified collection with write access:
     Study:  Martin Sevior, Belle investigations
     Investigation:  B -> D* D* Ks paper 2006
     DataCollection:  mydata
   Use this collection?  yes
   DataName:  ddks_skim_20060514
   Status (default=complete):
   Description:  Skim from from latest bugfix in code.
   Parameters:  datatype=simulation
   Parameters:  events_processed=100000
   Parameters:  event_type=charm
   Parameters:  skim_type=martin's ddks skim
   Parameters:  experiment=e000015
   Parameters:
   Software:  BASF
     Version:  b20030807_1600
   Locator:  srb:/kekb/belle/production/skim_xyz/charm/e000015/data/*.mdst*
   RelatedReferences:  Skim_XYZ_CharmMC_e000015_data
   Found specified collection:
     Study:  Belle Production
     Investigation:  Skim_XYZ
     DataCollection:  Skim_XYZ_CharmMC_e000015_data
   Use this collection?  yes
   $  meta-ingest mycoll1.xml

Collection and ADO example

   $  meta-template list
   $  meta-template --interactive create collection-hbook > mycoll2.xml
   Parent Investigation or Collection (default=none):  myhistos
   Found Collection with write access:
     Study:  Martin Sevior, Belle investigations
     Investigation:  B -> D* D* Ks paper 2006
     DataCollection:  myhistos
   Use this collection?  yes
   DataName:  histos_200606
   Status (default=complete):
   Description:  Analysis histos from latest bugfix in code.
   Parameters:  
   Software:
   Locator:
   RelatedReferences:  Skim_XYZ_CharmMC_e000015_data
   $  meta-guess ./test1.hbook --location=srb:guid:anusf:148860 --aux=./test.log > mydata1.xml
   Determined...
   FileType=hbook
   DETERMINED PARAMETERS...
   datatype=simulation
   experiment=e000015
   events_processed=100000
   histogram=M_bc good
     M_bc=5.2 to ...
   histogram=delta_E good
     delta_E=-0.06 to 0.06
   ntuple=all
     entries=19232
     M_bc variable
     delta_E variable
   DETERMINED SOFTWARE...
   ProgramName=BASF
   Version=b20020424_1007
   ProgramName=ddks_ana.so
   FINISHED.
   $  meta-guess ./test2.hbook --location=srb:guid:anusf:148860 --aux=./test.log > mydata2.xml
   $  meta-ingest mycoll2.xml mydata1.xml mydata2.xml

Histogram use case

This is very similar to the generic data collection use case. The main differences being that Histograms and Ntuples inside ROOT/HBOOK files can be described using Parameter objects.

Histograms and NTuples can be expressed as Parameter objects. Each Histogram/NTuple object can have further sub parameters describing the contents of the object. Histogram axes can be expressed as Parameter objects with a range. NTuple variables can be expresses as Parameter objects without value or range.

In the case of a production run that generates multiple histograms it may be desirable for histogram information to be omitted from the description.

Paper/Document use case

A paper and related files can be held in a DataCollection object. This may contain data specific to it's production, and/or may contain "RelatedReference" objects pointing to DataCollections that were used in it's production.

It is recommended that a paper is structured as a DataCollection with AtomicDataObjects for various paper formats (PDF, PS, DOC), with a sub DataCollection for source files (TEX, BIB, STY, EPS), and DataCollections/RelatedReference for data files (HBOOK, ROOT, MDST).

Open Access and Published Data

The aim of storing digital assets in a common format such as CSMDM is to be able to exchange data. Meta-data descriptions of digital assets (theses, papers and data) are typically kept in institutional digital repositories. Such repositories are generally "open access" and indexed my meta repositories (eg. arXiv, Google scholar).

It should be recognised that while the meta-data is open access the data itself need not be. Papers may have associated publisher copyright and research data may have even further restrictions. CSMDM attempts to take this into account via the AccessConditions Structure which includes such situations as "data available on application".

The significant difference between internal and published Belle data is probably access and ownership. Published Belle data is owned by the collaboration and will have a large author (StudyPerson) list. Internal data will be owned by an individual or working group and is typically a work in progress. Published data must remain static, whereas internal data may be quite dynamic. To satisfy these two types of records I believe two separate systems are required. The first is a dynamic meta-data system for internal data, the second a static system for published data. Many institutional digital repositories support static, published meta-data and data, so such systems are readily available (ARROW). This proposal will focus on the design of an internal dynamic meta-data system, with a view to migrating completed/published works to the usual static digital repository.


Martin's Ideas

  • Possibly best to operate at histogram level as this is required to reproduce results in paper.
  • Wants to work backwards from the histogram and what would be useful for theorists to extract from histos, or what experimentalists would like to extract from theorists.
    • Paper. Root file. Histo description format.


Implementation:

Database Structure

In practise we will not deal with the whole of the CSMDM structure at once. Typically only sections of the structure will need to be created, updated, moved and linked. To facility this we will break the structure into a number of "domains". There will be 4 domains in total, each representing sub-documents in XML or starting points in the structure tree. The higher level domains will not include information from the low levels but will include enough information to pull in the lower level domains and reconstruct the whole CSMDM structure.
  • Domains:
    • StudyPerson
    • DataCollection
    • Investigation
    • MetadataRecord (including Topic and Study)
  • How do we link these domains to create the CSMDM structure?
    • MetadataRecord is the top level domain. All CSMDM documents are constructed from these records.
    • Many objects within the model contain an ID attribute. Empty/blank objects can be placed in parent domains with specified ID indicating that this document is linked/inserted here.
      eg. <DataCollection dataid="ID004324"/>

CSMDM domain entity relationship diagram

Here is a breakdown of the domains including what sub-documents are included within the domain (via the symbol -> ), possible string values for simple objects (eg. TypeOfData ~ ...), what key to use for the domain records (key=), possibly useful DB indices for searching (index=), and some structural information regarding the sub-documents ([]=attribute, ?=optional, *=0 or more, +=1 or more). In most cases complex objects (eg. Topic, Subjects) are broken down further. Objects that are not broken down (eg. Discipline, StudyName) are typically simple text or enumerated value objects. This is somewhat cut-down for the Belle context:

  • MetadataRecord -> [MetadataID], [Facility], Topic, Study, AccessConditions? RelatedPublications*
    • key=MetadataID (use PURL)
    • index=Study[StudyID]
    • index=Study/Investigation[InvestigationID]
    • Topic -> Keywords*, Subjects+
      • Keywords -> Discipline, KeywordSource?, Keyword+
      • Subjects -> Discipline, SubjectSource?, Subject
        • Subject -> SubjectName, Subject?
    • Study -> [StudyID], StudyName, StudyInstitution*, StudyPerson+, StudyInformation, Notes?, RelatedReference*, Investigation+
      • StudyInstitution -> Name?, Role?
        • Name -> [InstitutionID]?, [institutiontype]
          • institutiontype ~ academic, research, government, military, commercial, nonprofit, other
      • StudyInformation -> Funding?, TimePeriod, Purpose, StudyStatus, Resources*
        • TimePeriod -> StartDate?, EndDate?
          • StartDate/EndDate -> Date, Time?
        • Purpose -> Abstract?
    • AccessConditions -> [acsystem]
      • acsystem ~ On Application, Digital Access Control System, Other
    • PublicationType -> PublicationName, Author+, Identifier*, URI*
      • Author -> [authortype]
        • authortype ~ primary, co, other
      • Identifier -> [identsystem]
  • StudyPerson -> Name, InstitutionAffiliatedTo?, ContactDetails+, RoleInStudy, RoleInInstitution?
    • key=ID
    • index=Name/Surname
    • Name -> Surname, MiddleInitials?, Forename, Title?
      • Title ~ professor, Professor, Prof, doctor, Doctor, Dr, Mr, Mrs, Ms, other
    • ContactDetails -> Address, DirectLine?, Switchboard, Fax?, Email?, WebPage?
      • Address -> AddressLine1, AddressLine2?, AddressLine3?, AddressLine4?, Town, Region?, Postcode?, Country
        • Country -> [countryabbrev]?
    • RoleInStudy ~ Post Doctoral Research Assistant, pdra, PDRA, PI, Principal Investigator, Co-Investigator, Data Holder, Data Manager, Other
    • RoleInInstitution ~ Professor, Senior Lecturer, Lecturer, PDRA, Post Doctoral Research Assistant, PG, Post Graduate, Undergraduate, other
  • Investigation -> [InvestigationID], Name, InvestigationType, Abstract, Resources*, DataHolding?
    • key=InvestigationID
    • index=DataHolding/DataCollection[dataid]
    • InvestigationType ~ Experiment, experiment, Measurement, measurement, Simulation, simulation, other
    • DataHolding -> [InvestigationID], DataDescription, DataHoldingLocator, RelatedReference*, DataCollection*, AtomicDataObject*
      • DataHoldingLocator -> same as for CollectionLocator
  • DataCollection -> DataDescription, DataCollectionLocator?, RelatedReference*, AtomicDataObject*, DataCollection*
    • key=dataid
    • index=DataDescription/Status
    • DataDescription -> DataName, TypeOfData?, Status?, DataTopic?, LogicalDescription?, Software?
      • TypeOfData ~ Collection, File, BLOB, Database Select, Named Select, other
      • LogicalDescription -> Parameter*, TimePeriod?, Description?, FacilityUsed?
        • Parameter -> ParamName, Derivation, Units?, ParamValue?, Range? Parameter*
          • Derivation ~ condition (experimental condition), measured (in data set), calculated (derived from data), environment (eg. compiler version), other
          • Units -> UnitName?, UnitAcronym?, UnitSystem?, UnitFormat?
          • Range -> Limit+, MarginOfError?
            • Limit -> [bound]?
        • FacilityUsed -> FacilityName, Resource*
      • Software -> Production*, Analysis*, Conversion*, Visualisation*, MultiPurpose*, other*
        • all software sub-docs -> LongName?, ProgramName, Version, URI, OperatingSystem?, OperatingSystemVersion?, Architecture?
    • DataCollectionLocator -> DataName (=DataCollection.DataDescription.DataName), Locator*
      • Locator -> [pathtype]
        • pathtype ~ absolute, file_absolute, relative, file_relative, database, other
    • AtomicDataObject -> [dataid], DataDescription, ADOLocator*, RelatedReference*
      • ADOLocator -> Locator*, AccessMethod*, Size?, offset?, length?
        • Locator -> [pathtype]
        • AccessMethod -> [authenticationtype]
        • NOTES: Use either <ADOLocator xsi:type="FileADOL"> or <ADOLocator xsi:type="SelectNamedADOL"> for additional fields (see docs)
    • RelatedReference -> Type, Direction?, ReferredToItem, Method, ReferenceLocation+
      • Type ~ Derived, Used By, Prior Study, Follow On Study, Parent Study
      • Direction ~ From, To, Peer
      • ReferredToItem ~ Study, Investigation, DataCollection, AtomicDataObject
      • ReferenceLocation -> Server?, Port?, Service?, Archive, ArchiveId?, StudyName, StudyId?, InvestigationName?, InvestigationId?, DataCollection? (=DataCollection.DataDescription.DataName), DataCollectionId? (=DataColletion[dataid]), ADOName?, ADOId?, Locator* (physical data location)
        • Locator -> [pathtype]
        • NOTES: specify DataCollection or ADOName in preference to URIs in Locator objects. If collections or files are stored later under the same name they can be found. Location should be stored within DataCollection or ADO, not within RelatedReference.

To simplify the structure without compromising the schema we propose a number of restrictions:

  • ReferenceLocation.Locator should only be used to reference data not in an existing DataCollection. DataCollectionId should be used in preference.
  • DataHoldingLocator vs DataCollectionLocator ???

The following information will also be held for each domain:

  • DocumentType (eg. Investigation ; example followed through next points)
  • DocumentIdentifier (eg. InvestigationID)
  • ParentDomains zero or more (eg. MetadataRecord)
  • ParentXpath for each ParentDomain, one or more (eg. Study/Investigation[InvestigationID=%ID%] )
  • WholeSubDocuments zero or more (eg. Name, InvestigationType, Abstract, Resources, DataHolding, DataDescription, RelatedReference)
    • Used for searching as we many need to determine in which domains complete sub-documents reside.

Schema Problems

StudyInstitution, StudyPerson:
An institution or person cannot have more than one role in a study or institution.

StudyInformation:
A study cannot have more than one source of funding.

ADOLocator:
Casting an ADOLocator object as xsi:type="SelectNamedADOL" allow you to specify <DataFormat formatsystem="MIME">, but there is no way to specify MIME type for FileADOL.

Software:
Impossible to specify the various modes of operation for some software. Most high energy physics software is multi purpose, that is a specific application or framework can be used for various levels of production and analysis. The classifications are too limited to allow specification or further description of processes such as skimming. The "other" software classification has no specified way of specifying the other value.

RelatedReference:
"Type", "Direction" and "ReferredToItem" are not specified within the schema.

Examples

The following is an example of the domain components associated with a paper and related data.


Study/MetadataRecord parent component used throughout this example...
<MetadataRecord MetadataID="KEK_Belle:urn:uuid:464220f8-d8f8-1028-b3a1-000E35A1F66C" Facility="KEK">
<Topic>
  <Keywords><Discipline>Physics</Discipline>
    <Keyword>KEK</Keyword>
    <Keyword>Belle</Keyword>
    <Keyword>B meson</Keyword>
  </Keywords>
  <Subjects><Discipline>Physics</Discipline>
    <SubjectSource>http://epp.ph.unimelb.edu.au/EPPGrid/MetaDataSubjectList</SubjectSource>
    <Subject><SubjectName>Experimental High Energy</SubjectName>
      <Subject><SubjectName>CP violation</SubjectName>
        <Subject><SubjectName>B physics</SubjectName>
        </Subject>
      </Subject>
    </Subject>
  </Subjects>
</Topic>
<Study StudyID="urn:uuid:464220f8-d8f8-1028-b3a1-000E35A1F66C">
  <StudyName>Martin Sevior, Belle investigations</StudyName>
  <StudyInstitution>
    <Name institutionID="KEK_Belle" institutiontype="research">The Belle Collaboration</Name>
    <Role>Experimental Collaboration</Role>
  </StudyInstitution>
  <StudyPerson>
    <Name><Surname>Sevior</Surname><Forename>Martin</Forename></Name>
    <ContactDetails>
      <Address>
      <Addressline1>School of Physics</Addressline1>
      <Addressline2>The University of Melbourne</Addressline2>
      <Town>Melbourne</Town>
      <Region>VIC</Region>
      <Postcode>3010</Postcode>
      <Country>Australia</Country>
      </Address>
      <Switchboard>(+61 3) 8344 4000</Switchboard>
      <Email>msevior@physics.unimelb.edu.au</Email>
      <WebPage>http://www.ph.unimelb.edu.au/~msevior/</WebPage>
    </ContactDetails>
    <RoleInStudy>Principal Investigator</RoleInStudy>
    <RoleInInstitution>Senior Lecturer</RoleInInstitution>
  </StudyPerson>
  <StudyInformation>
    <Funding>Australian Research Council</Funding>
    <TimePeriod>
      <StartDate><Date>2006-06-01</Date></StartDate>
    </TimePeriod>
    <Purpose>
      <Abstract>Martin Sevior's personal studies on the Belle experiment
      </Abstract>
    </Purpose>
    <StudyStatus>Internal</StudyStatus>
  </StudyInformation>
  <Investigation InvestigationID="urn:uuid:e65462c8-de51-1028-bad9-000E35A1F66C" />
  <Investigation InvestigationID="urn:uuid:f878e19a-de51-1028-bad9-000E35A1F66C" />
  <Investigation InvestigationID="urn:uuid:f8fc18bc-de51-1028-bad9-000E35A1F66C" />
</Study>
<AccessConditions acsystem="On Application">
  The user must contact Martin Seviour to obtain access to this data.
</AccessConditions>
</MetadataRecord>


Investigation parent component used throughout this example...
<Investigation InvestigationID="urn:uuid:e65462c8-de51-1028-bad9-000E35A1F66C">
  <Name>B -> D* D* Ks paper 2006</Name>
  <InvestigationType>Measurement</InvestigationType>
  <Abstract>
    Measurement of the B -> D* D* Ks branching fraction towards production
    of paper in 2006.
  </Abstract>
  <DataHolding InvestigationID="urn:uuid:e65462c8-de51-1028-bad9-000E35A1F66C">
    <!-- In this context DataHolding is not really used to hold data,
         only a place holder for collections.
         DataName can be anything so suggest using InvestigationID.
         DataHoldingLocator is also not used, so might as well point
         to absolute path of home directory, or could be pathtype="other".
    -->
    <DataDescription>
      <DataName>urn:uuid:e65462c8-de51-1028-bad9-000E35A1F66C</DataName>
    </DataDescription>
    <DataHoldingLocator>
      <DataName>urn:uuid:e65462c8-de51-1028-bad9-000E35A1F66C</DataName>
      <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf</Locator>
    </DataHoldingLocator>
    <DataCollection dataid="urn:uuid:4ef86ac0-de5b-1028-bad9-000E35A1F66C"/>
    <DataCollection dataid="urn:uuid:921253b4-de5b-1028-bad9-000E35A1F66C"/>
    <DataCollection dataid="urn:uuid:928f5012-de5b-1028-bad9-000E35A1F66C"/>
  </DataHolding>
</Investigation>


Top level data collection relating to a published paper...
<DataCollection dataid="urn:uuid:4ef86ac0-de5b-1028-bad9-000E35A1F66C">
  <DataDescription>
    <DataName>full paper formats</DataName>
    <TypeOfData>Collection</TypeOfData>
    <Status>Complete</Status>
    <LogicalDescription>
      <Description>All full paper formats are placed here.</Description>
    </LogicalDescription>
  </DataDescription>
  <DataCollectionLocator>
    <DataName>full paper formats</DataName>
    <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf/paper06/</Locator>
  </DataCollectionLocator>
  <AtomicDataObject dataid="urn:uuid:2eb4eff6-de5c-1028-bad9-000E35A1F66C">
    <DataDescription>
      <DataName>mypaper.ps</DataName>
      <TypeOfData>File</TypeOfData>
      <Status>Complete</Status>
      <LogicalDescription>
        <Description>Postscript version reviewed by collaboration.</Description>
      </LogicalDescription>
    </DataDescription>
    <ADOLocator xsi:type="FileADOL">
      <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf/paper06/mypaper.ps</Locator>
      <FileType>application/postscript</FileType>
    </ADOLocator>
  </AtomicDataObject>
  <AtomicDataObject dataid="urn:uuid:c0e745c6-de5d-1028-bad9-000E35A1F66C">
    <DataDescription>
      <DataName>mypaper.pdf</DataName>
      <TypeOfData>File</TypeOfData>
      <Status>Complete</Status>
      <LogicalDescription>
        <Description>PDF submitted to journal.</Description>
      </LogicalDescription>
    </DataDescription>
    <ADOLocator xsi:type="FileADOL">
      <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf/paper06/mypaper.pdf</Locator>
      <FileType>application/pdf</FileType>
    </ADOLocator>
  </AtomicDataObject>
</DataCollection>


Top level data collection of histograms relating to the above Investigation, one histogram referencing the production skim from which it was derived...
<DataCollection dataid="urn:uuid:921253b4-de5b-1028-bad9-000E35A1F66C">
  <DataDescription>
    <DataName>associated histograms</DataName>
    <TypeOfData>Collection</TypeOfData>
    <Status>Complete</Status>
    <LogicalDescription>
      <Description>Histograms are placed here.</Description>
    </LogicalDescription>
  </DataDescription>
  <DataCollectionLocator>
    <DataName>associated histograms</DataName>
    <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf/paper06/</Locator>
  </DataCollectionLocator>
  <AtomicDataObject dataid="urn:uuid:fe376c28-de60-1028-bad9-000E35A1F66C">
    <DataDescription>
      <DataName>analysis.hbook</DataName>
      <TypeOfData>File</TypeOfData>
      <Status>Complete</Status>
      <LogicalDescription>
        <Description>Histograms including background and signal.</Description>
        <Parameter>
          <ParamName>datatype</ParamName>
          <Derivation>environment</Derivation>
          <ParamValue>experimental</ParamValue>
        </Parameter>
        <Parameter>
          <ParamName>experiment</ParamName>
          <Derivation>condition</Derivation>
          <Range>
            <Limit bound="lower">e000015</Limit>
            <Limit bound="upper">e000025</Limit>
          </Range>
        </Parameter>
        <Parameter>
          <ParamName>cut: M_bc</ParamName>
          <Derivation>measured</Derivation>
          <Units><UnitName>GeV/c^2</UnitName></Units>
          <Range>
            <Limit bound="lower">5.2</Limit>
          </Range>
        </Parameter>
        <Parameter>
          <ParamName>histogram: M_bc good</ParamName>
          <Derivation>measured</Derivation>
          <Parameter>
            <ParamName>M_bc</ParamName>
            <Derivation>measured</Derivation>
            <Units><UnitName>GeV/c^2</UnitName></Units>
            <Range>
              <Limit bound="lower">5.2</Limit>
              <Limit bound="upper">10.0</Limit>
            </Range>
          </Parameter>
        </Parameter>
        <Parameter>
          <ParamName>histogram: delta E</ParamName>
          <Derivation>measured</Derivation>
          <Parameter>
            <ParamName>delta E</ParamName>
            <Derivation>measured</Derivation>
            <Units><UnitName>GeV</UnitName></Units>
            <Range>
              <Limit bound="lower">-0.06/Limit>
              <Limit bound="upper">0.06/Limit>
            </Range>
          </Parameter>
        </Parameter>
      </LogicalDescription>
      <Software>
         <Production>
            <ProgramName>BASF</ProgramName>
            <Version>b20030807_1600</Version>
            <OperatingSystem>Solaris</OperatingSystem>
            <OperatingSystemVersion>10</OperatingSystemVersion>
            <Architecture>SPARC</Architecture>
            <URI>?</URI>
         </Production>
         <Analysis>
            <ProgramName>BASF</ProgramName>
            <Version>b20030807_1600</Version>
            <OperatingSystem>GNU/Linux</OperatingSystem>
            <OperatingSystemVersion>Debian testing/unstable</OperatingSystemVersion>
            <Architecture>i686</Architecture>
            <URI>?</URI>
         </Analysis>
         <Analysis>
            <ProgramName>Martin's DDKs analysis</ProgramName>
            <Version>v1r8</Version>
            <OperatingSystem>GNU/Linux</OperatingSystem>
            <OperatingSystemVersion>Debian testing/unstable</OperatingSystemVersion>
            <Architecture>i686</Architecture>
            <URI>cvs:pserver:roberts.ph.unimelb.edu.au:/usr/local/cvsroot:analysis?tag=v1r8</URI>
         </Analysis>
      </Software>
    </DataDescription>
    <ADOLocator xsi:type="FileADOL">
      <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf/data/ddks/analysis.hbook</Locator>
      <FileType>hbook</FileType>
    </ADOLocator>
  </AtomicDataObject>
  <AtomicDataObject dataid="urn:uuid:fe376c28-de60-1028-bad9-000E35A1F66C">
    <DataDescription>
      <DataName>mcanalysis.hbook</DataName>
      <TypeOfData>File</TypeOfData>
      <Status>Complete</Status>
      <LogicalDescription>
        <Description>Histograms including background and signal.</Description>
        <Parameter>
          <ParamName>datatype</ParamName>
          <Derivation>environment</Derivation>
          <ParamValue>simulation</ParamValue>
        </Parameter>
        <Parameter>
          <ParamName>D*+ mass</ParamName>
          <Derivation>environment</Derivation>
          <Units><UnitName>GeV</UnitName></Units>
          <ParamValue>2.010</ParamValue>
        </Parameter>
        <Parameter>
          <ParamName>experiment</ParamName>
          <Derivation>condition</Derivation>
          <Range>
            <Limit bound="lower">e000015</Limit>
            <Limit bound="upper">e000025</Limit>
          </Range>
        </Parameter>
        <Parameter>
          <ParamName>cut: M_bc</ParamName>
          <Derivation>measured</Derivation>
          <Units><UnitName>GeV/c^2</UnitName></Units>
          <Range>
            <Limit bound="lower">5.2</Limit>
          </Range>
        </Parameter>
        <Parameter>
          <ParamName>ntuple: all</ParamName>
          <Derivation>measured</Derivation>
          <Parameter><ParamName>delta_E</ParamName><Derivation>measured</Derivation></Parameter>
          <Parameter><ParamName>M_bc</ParamName><Derivation>measured</Derivation></Parameter>
          <Parameter><ParamName>good</ParamName><Derivation>measured</Derivation></Parameter>
          <Parameter><ParamName>vfit</ParamName><Derivation>measured</Derivation></Parameter>
        </Parameter>
      </LogicalDescription>
      <Software>
         <Production>
            <ProgramName>BASF</ProgramName>
            <Version>b20030807_1600</Version>
            <OperatingSystem>Solaris</OperatingSystem>
            <OperatingSystemVersion>10</OperatingSystemVersion>
            <Architecture>SPARC</Architecture>
            <URI>?</URI>
         </Production>
         <Analysis>
            <ProgramName>BASF</ProgramName>
            <Version>b20030807_1600</Version>
            <OperatingSystem>GNU/Linux</OperatingSystem>
            <OperatingSystemVersion>Debian testing/unstable</OperatingSystemVersion>
            <Architecture>i686</Architecture>
            <URI>?</URI>
         </Analysis>
         <Analysis>
            <ProgramName>Martin's DDKs analysis</ProgramName>
            <Version>v1r8</Version>
            <OperatingSystem>GNU/Linux</OperatingSystem>
            <OperatingSystemVersion>Debian testing/unstable</OperatingSystemVersion>
            <Architecture>i686</Architecture>
            <URI>cvs:pserver:roberts.ph.unimelb.edu.au:/usr/local/cvsroot:analysis?tag=v1r8</URI>
         </Analysis>
      </Software>
    </DataDescription>
    <ADOLocator xsi:type="FileADOL">
      <Locator pathtype="absolute">srb:/anusf/home/msevior.anusf/data/ddks/mcanalysis.hbook</Locator>
      <FileType>hbook</FileType>
    </ADOLocator>
    <RelatedReference>
      <Type>Derived</Type>
      <Direction>From</Direction>
      <ReferredToItem>Collection</ReferredToItem>
      <Method>BASF Analysis</Method>
      <ReferenceLocation>
        <Archive>KEK_Belle</Archive>
        <StudyName>Belle Production</StudyName>
        <InvestigationName>Skim XYZ Charm MC</InvestigationName>
        <DataCollection>Skim_XYZ_CharmMC_e000015_data</DataCollection>
        <DataCollectionId>urn:uuid:0c25d488-e3d8-1028-994a-000E35A1F66C</DataCollectionId>
      </ReferenceLocation>
    </RelatedReference>
  </AtomicDataObject>
</DataCollection>


The production skim collection referenced by the above histogram ADO with a full description. It contains 3 MDST files with minimal description...
<DataCollection dataid="urn:uuid:0c25d488-e3d8-1028-994a-000E35A1F66C">
  <DataDescription>
    <DataName>Skim_XYZ_CharmMC_e000015_data</DataName>
    <TypeOfData>Collection</TypeOfData>
    <Status>Complete</Status>
    <LogicalDescription>
      <Description>Production skim of Charm MC for e000015</Description>
      <Parameter>
        <ParamName>datatype</ParamName>
        <Derivation>environment</Derivation>
        <ParamValue>simulation</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>experiment</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>e000015</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>skim_type</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>XYZ</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>event_type</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>charm</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>add_module.main</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>cdctable,genunpak,bpsmear,gsim,acc_mc,calsvd,addbg,...</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>bpsmear.ip_nominal_x</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>0.31741370E-01</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>bpsmear.sigma_ip_x</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>0.70936009E-02</ParamValue>
      </Parameter>
      <Parameter>
        <ParamName>decay.dec</ParamName>
        <Derivation>condition</Derivation>
        <ParamValue>
#
# DECAY.DEC,v 1.3 2002/04/17 06:25:06 itoh Exp
#
#   Revision 1.3  2002/04/17 06:25:06  itoh
#   Updated by Kakuno-san
#
#   Revision 1.2  2002/04/17 04:04:00  itoh
#   Updated by Kakuno-san. Decay tables are now compatible with current QQ tables.
#
#
#Define the B0B0bar mass difference
Define dm 0.472e12
Define dgamma 0
Define qoverp 1
Define phaseqoverp 0
# define the values of the CKM angles  (alpha=70, beta=40)
Define alpha 1.365
Define beta  0.39
Define gamma 1.387
Define twoBetaPlusGamma  2.167
Define betaPlusHalfGamma 1.0835
...
# use new VSS_BMIX mixing decay model (DK,28-Oct-1999)
Decay Upsilon(4S)
0.50000 B+  B-                          VSS;
0.49913 B0  anti-B0                     VSS_BMIX dm;
0.00000         Upsilon(2S)  pi+  pi-   PHSP;
0.00000         Upsilon(2S)  pi0  pi0   PHSP;
# V-> gamma S    Partial wave (L,S)=(0,0)
0.00007         gamma chi_b0(3P)  HELAMP 1. 0. 1. 0.;
...
        </ParamValue>
      </Parameter>
    </LogicalDescription>
    <Software>
       <Production>
          <!-- this is the original generation? -->
          <ProgramName>BASF</ProgramName>
          <Version>b20030807_1600</Version>
          <OperatingSystem>Solaris</OperatingSystem>
          <OperatingSystemVersion>10</OperatingSystemVersion>
          <Architecture>SPARC</Architecture>
          <URI>?</URI>
       </Production>
       <Production>
          <!-- this is the skim, not sure how to specify this -->
          <ProgramName>BASF</ProgramName>
          <Version>b20030807_1600</Version>
          <OperatingSystem>GNU/Linux</OperatingSystem>
          <OperatingSystemVersion>Debian testing/unstable</OperatingSystemVersion>
          <Architecture>i686</Architecture>
          <URI>?</URI>
       </Production>
    </Software>
  </DataDescription>
  <DataCollectionLocator>
    <DataName>Skim_XYZ_CharmMC_e000015_data</DataName>
    <Locator pathtype="absolute">srb:/kekb/belle/production/skim_xyz/charm/e000015/data/*.mdst*</Locator>
  </DataCollectionLocator>
  <AtomicDataObject dataid="urn:uuid:e6831d54-e719-1028-b34d-000E35A1F66C">
    <DataDescription><DataName>skimXYZ-charm-00-all-e000015r001598-b20030807_1600.mdst</DataName></DataDescription>
    <ADOLocator xsi:type="FileADOL"><Locator pathtype="absolute">srb:guid:kekb:148862</Locator>
      <FileType>mdst</FileType>
    </ADOLocator>
    <RelatedReference>
      <Type>Derived</Type>
      <Direction>From</Direction>
      <ReferredToItem>File</ReferredToItem>
      <Method>skim</Method>
      <ReferenceLocation>
        <Archive>KEK_Belle</Archive>
        <StudyName>unknown</StudyName>
        <ADOName>evtgen-charm-00-all-e000015r001598-b20030807_1600.mdst</ADOName>
      </ReferenceLocation>
    </RelatedReference>
  </AtomicDataObject>
  <AtomicDataObject dataid="urn:uuid:e75ee9ba-e719-1028-b34d-000E35A1F66C">
    <DataDescription><DataName>skimXYZ-charm-00-all-e000015r001599-b20030807_1600.mdst</DataName></DataDescription>
    <ADOLocator xsi:type="FileADOL"><Locator pathtype="absolute">srb:guid:kekb:148863</Locator>
      <FileType>mdst</FileType>
    </ADOLocator>
    <RelatedReference>
      <Type>Derived</Type>
      <Direction>From</Direction>
      <ReferredToItem>File</ReferredToItem>
      <Method>skim</Method>
      <ReferenceLocation>
        <Archive>KEK_Belle</Archive>
        <StudyName>unknown</StudyName>
        <ADOName>evtgen-charm-00-all-e000015r001599-b20030807_1600.mdst</ADOName>
      </ReferenceLocation>
    </RelatedReference>
  </AtomicDataObject>
  <AtomicDataObject dataid="urn:uuid:e7b58c72-e719-1028-b34d-000E35A1F66C">
    <DataDescription><DataName>skimXYZ-charm-00-all-e000015r001600-b20030807_1600.mdst</DataName></DataDescription>
    <ADOLocator xsi:type="FileADOL"><Locator pathtype="absolute">srb:guid:kekb:148864</Locator>
      <FileType>mdst</FileType>
    </ADOLocator>
    <RelatedReference>
      <Type>Derived</Type>
      <Direction>From</Direction>
      <ReferredToItem>File</ReferredToItem>
      <Method>skim</Method>
      <ReferenceLocation>
        <Archive>KEK_Belle</Archive>
        <StudyName>unknown</StudyName>
        <ADOName>evtgen-charm-00-all-e000015r001600-b20030807_1600.mdst</ADOName>
      </ReferenceLocation>
    </RelatedReference>
  </AtomicDataObject>
</DataCollection>


The above skim collection may be part of a larger virtual collection like this...
<DataCollection dataid="urn:uuid:3aaede42-e3fd-1028-994a-000E35A1F66C">
  <DataDescription>
    <DataName>Skim_XYZ_CharmMC_e000015</DataName>
    <TypeOfData>Collection</TypeOfData>
    <Status>Complete</Status>
    <LogicalDescription>
      <Description>Production skim of Charm MC</Description>
    </LogicalDescription>
  </DataDescription>
  <DataCollection dataid="urn:uuid:3aaede42-e3fd-1028-994a-000E35A1F66C"/>
  <DataCollection dataid="urn:uuid:3aaede42-e3fd-1028-994a-000E35A1F66C"/>
  <DataCollection dataid="urn:uuid:3aaede42-e3fd-1028-994a-000E35A1F66C"/>
</DataCollection>


For transport or exchange the record(s) would be encoded in a CLRCMetadata document.
<?xml version="1.0" encoding="utf-8"?>
<CLRCMetadata
   xmlns="http://www.escience.clrc.ac.uk/schemas/scientific"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://www.escience.clrc.ac.uk/schemas/scientific/clrcmetadata.xsd">
...
</CLRCMetadata>

Back-end Components

ALERT! In progress!

The backend will be divided into two main systems: Server and Database. The server communicates with the "database" via a plugin. The database plugin interface will be simple enough to support both XML and SQL databases (even just directory and file structures). The "database" is broken into XML "domains".

  • Pluggin Database connection interface
    • +getDocument(id,domain)
    • +createDocument(id,domain,document)
    • +updateDocument(id,domain,document)
    • +deleteDocument(id,domain)
    • +getDocuments(xpath,domain)
    • domain "access" is used for authentication documents

To allow both batch and web interfaces the "server" will provide a WS interface. A persistent URL interface must also be provided.

  • +getTemplateDocument(domain)
  • +getTemplateRecord(domain)
  • +getDocument(id,domain)
  • +getRecord(id)
    • Obtain document and children from known parent domain.
  • +createRecord(id,document)
    • Ceate IDs if none
    • Create ParentIDs if none
    • Break down into domains
    • Look for existing IDs
    • For all new createDocument(id,domain,document)
    • Warn user for all existing with non-null content
  • +createDocument(domain,document)
  • +updateDocument(domain,document)
  • +deleteDocument(domain,document)
  • +getDocuments(xpath,domain)
  • +getParentDocuments(id,domain)
    • must also return domain somehow
    • obtain parent documents
    • don't fill out missing children
  • +getParentRecord(id,domain)
  • basically authentication, collation, and breakdown service
  • need a root domain (for Record ID)

Front-end Components

ALERT! In progress!

Batch Interface

Batch tools will be required to perform all capabilities of the web interface. In particular, batch/unattended upload of DataCollection metadata is required.

Web Interface

Where possible XSL should be leveraged. The following documents may be necessary on the front end...
  • DocumentType-view.xsl - Transform CSML to HTML for view.
  • DocumentType-list.xsl - Transform CSML to HTML list view. (1 or many docs?)
  • DocumentType-edit.xsl - Transform CSML to HTML edit form.
  • DocumentType-create.xsl - Transform CSML template to HTML for new document form.
    • DocumentType-template-blank.xml - a blank new document
    • DocumentType-template-name1.xml - a named template for new document
  • DocumentType-entry.xsl - Transform HTML form output to CSML.

Form output will be converted by the front end to simplified XML for ease of transform.

<input type="text" name="Investigation_DataHolding_DataDescription_DataName" value="My value">
... becomes ...
<Investigation_DataHolding_DataDescription_DataName>
   My Value
   </Investigation_DataHolding_DataDescription_DataName>

Risks and Unknowns

ALERT! In progress!

  • ReferenceLocation object appear badly specified. The appendix documents in the specification do not conform. * they include "Location" objects * they do not include both Study and StudyID * they do not include mandatory Server, Service * however, the spec seems to indicate it could be otherwise?
  • Can the above create/edit XSL take input variables.
    • Example 1: Number of each type of sub-documents where multiple are allowed. Links could then be created for "+1 Something" to create form entries for more.
  • Can we identify the Nth form value? As we need to match sub-document values together. eg Parameter Names and Values.
    <...Parameter_ParamName>Name1</...>
    <...Parameter_ParamName>Name2</...>
    <...Parameter_ParamValue>Value1</...>
    <...Parameter_ParamValue>Value2</...>
  • How do you delete sub-documents on a webform where multiples are allowed?
    • Possibly setting one of the sub-document's key fields blank. The "entry" transform can then filter out these entries.
    • This relies on matching sub-document values together!!!


References

toggleopenShow attachmentstogglecloseHide attachments

Topic attachments
I Attachment Action Size Date Who Comment

pdfpdf csmdm.version-2.pdf manage 411.1 K 22 May 2006 - 07:09 Main.LyleWinton CCLRC Scientific Metadata Model: Version 2

pngpng metadataERD.png manage 4.5 K 22 May 2006 - 07:11 Main.LyleWinton CSMDM Metadata domain ERD

key Log In Revision:  r8 - 27 Jun 2006 - LyleWinton
Authorised by:  Geoff Taylor (G.Taylor @ physics.unimelb.edu.au)
Maintained using:  This site is powered by the TWiki collaboration platform
Copyright © 2000-2009 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.