Metadata-Based Repositories

Carl Boettiger


One of the principle advantages in creating EML is to make it easier to find the data you need, without having to standardize all your data files themselves into one giant database. Instead, the data files can be whatever you want, provided the relevant information you might search on to discover data of interest is listed in the metadata.

To do this, we will use the dataone package to upload a private copy of our data file to the central data repository.

#install.packages("dataone", repos= c("", ""))

## or
devtools::install_github(c("ropensci/datapack", "DataONEorg/rdataone"))

Imagine we have the file paths to our .csv data file and an .xml EML file providing metadata for it:

sampleData <- system.file("extdata/sample.csv", package="dataone")
sampleEML <- system.file("extdata/sample-eml.xml", package="dataone")

To upload these to a metadata repository such as the KNB, we simply create DataObjects for both files:

dataObj <- new("DataObject", format="text/csv", file=sampleData)
metadataObj <- new("DataObject", format="eml://", file=sampleEML)

Note that optionally, new("DataObject") could have been given an id argument, which could be a (namespaced) UUID from UUIDgenerate, or a DOI from a member node (see generateIdentifier()). Since no id has been given, a UUID is automatically generated for each.

accessRules <- data.frame(subject="CN=Noam Ross A45991,O=Google,C=US,DC=cilogon,DC=org", permission="write")
dataObj <- addAccessRule(dataObj, accessRules)
metadataObj <- addAccessRule(metadataObj, accessRules)

We now want to bundle these two objects (data and metadata) into a single “data package” to be uploaded. To do so, we just create a new DataPackage object and then add the data and metadata using the addData file:

dp <- new("DataPackage")
dp <- addData(dp, dataObj, metadataObj)

This both adds the files and registers that the metadata object describes the data object.

d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2")
packageId <- uploadDataPackage(d1c, dp)

Let’s see if the ID returned for our package now appears in the DataONE index:

query(CNode("STAGING"), searchTerms = list(id = packageId))

Gimme DOI!

Note that the example would fail if run here since only the Production (PROD) environment can provide DOIs (the STAGING environment is only for tests and training examples), and only then on member nodes that offer DOIs, such as the KNB (urn:node:KNB).

cn <- CNode("STAGING")
mn <- getMNode(cn, "urn:node:mnStageUCSB2")
newid <- generateIdentifier(mn, "DOI")

We also want to update the EML file itself to use the new id as the packageId:

eml <- read_eml(sampleEML)
eml@packageId <- newid
write_eml(eml, sampleEML)

We can now use the DOI in packaging the DataObject

new("DataObject", id = newid, format = "eml://", file=sampleEML)