Validation in JSON-LD

Carl Boettiger

2018-02-12

Introduction

Schema validation is a useful and important concept to the distribution of metadata in formats such as XML and JSON, in which the standard-provider creates a schema (specified in an XML-schema, XSD, for XML documents, or json-schema for JSON documents). Schemas allow us to go beyond the basic notation of making sure a file is simply valid XML or valid JSON, a requirement just to be read in by any parser. By detailing how the metadata must be structured, what elements must, can, and may not be included, and what data types may be used for those elements, schema help developers consuming the data to anticipate these details and thus build applications which know how to process them. For the data creator, validation is a convenient way to catch data input errors and ensure a consistent data structure.

Because schema validation must ensure predictable behavior without knowledge of what any specific application is going to do with the data, it tends to be very strict. A simple application may not care if certain fields are missing or if integers are mistaken for characters, while to another application these differences could lead it to throw fatal errors.

The approach of JSON-LD is less prescriptive. JSON-LD uses the notion of “framing” to let each application specify how it expects it data to be structured. JSON frames allow each developer consuming the data to handle many of the same issues that schema validation have previously assured. Readers should consult the official json-ld framing documentation for details on this approach.

library(jsonld)
library(jsonlite)
library(magrittr)
library(codemetar)

A motivating example:

Consider the following codemeta document:

codemeta <- 
'
{
  "@context": "http://purl.org/codemeta/2.0",
  "@type": "SoftwareSourceCode",
  "name": "codemetar: Generate CodeMeta Metadata for R Packages",
  "datePublished":" 2017-05-20",
  "version": 1.0,
  "author": [
    {
      "@type": "Person",
      "givenName": "Carl",
      "familyName": "Boettiger",
      "email": "cboettig@gmail.com",
      "@id": "http://orcid.org/0000-0002-1642-628X"
    }],
  "maintainer":  {"@id": "http://orcid.org/0000-0002-1642-628X"}
}
'
jsonld::jsonld_compact(codemeta, "http://purl.org/codemeta/2.0")
{
  "@context": "http://purl.org/codemeta/2.0",
  "type": "SoftwareSourceCode",
  "author": [
    {
      "id": "http://orcid.org/0000-0002-1642-628X",
      "type": "Person",
      "email": "cboettig@gmail.com",
      "familyName": "Boettiger",
      "givenName": "Carl"
    }
  ],
  "datePublished": " 2017-05-20",
  "name": "codemetar: Generate CodeMeta Metadata for R Packages",
  "version": 1,
  "maintainer": {
    "id": "http://orcid.org/0000-0002-1642-628X"
  }
} 

Framing: subsetting data

By default, frames return all the input data, while our application may only be interested in some subset. Often it is sufficient just to ignore these additional terms: in the example above it’s just as easy for our application to work with author elements whether or not we have dropped other elements such as meta$version. To restrict a frame to returning only the nodes we explicitly mention, we can use the keyword @explicit:

frame <- '{
  "@context": "http://purl.org/codemeta/2.0",
  "@explicit": "true",
  "@type": "Person",
  "givenName": {},
  "familyName": {}
}'

meta <- 
  jsonld_frame(codemeta, frame)  %>%
  fromJSON(codemeta, simplifyVector = FALSE) %>% 
  getElement("@graph") 

meta[[1]]
$id
[1] "http://orcid.org/0000-0002-1642-628X"

$type
[1] "Person"

$familyName
[1] "Boettiger"

$givenName
[1] "Carl"

Note that this has only returned the requested fields in the graph (along with the @id and @type, which are always included if provided, since they may be required to interpret the data properly). This frame extracts the givenName and familyName of any Person node it finds, regardless of where it occurs, while ommitting the rest of the data. Note that since the frame requests these elements at the top level, they are returned as such, with each match a separate entry in the @graph. Our example has only one person in meta[[1]], had we more matches they would appear in meta[[2]], etc. Note these returns are un-ordered.

Framing: expanding node references

The same underlying data can often be expressed in different ways, particularly when dealing with nested data. Framing can be of great help here to reshape the data into the structure required by the application. For instance, it would be natural to access the email of the maintainer in the same manner we did the author, but this fails for our example as maintainer is defined only by reference to an ID:

meta <- fromJSON(codemeta, simplifyVector = FALSE) 
paste("For complaints, email", meta$maintainer$email)
[1] "For complaints, email "

We can confirm that maintainer is just an ID:

meta$maintainer
$`@id`
[1] "http://orcid.org/0000-0002-1642-628X"

We can use a frame with the special directive "@embed": "@always" to say that we want the full maintainer information embedded an not just referred to by id alone. Then we can subset maintainer just like we do author.

frame <- '{
  "@context": "http://purl.org/codemeta/2.0",
  "@embed": "@always"
}'


meta <- 
  jsonld_frame(codemeta, frame) %>%
  fromJSON(codemeta, simplifyVector = FALSE) %>% 
  getElement("@graph") %>% getElement(1)

Now we can do

paste("For complaints, email", meta$maintainer$email)
[1] "For complaints, email cboettig@gmail.com"

and see that email has been successfully returned from the matching ID under author data.

Handling unexpected types

JSON-LD routines will simply refuse to compact data if the type differs from what the context expects. Here is a sample data file that declares that the buildInstructions are included as text, which differs from the context file which explicitly states that buildInstructions should be a URL:

codemeta <- 
'
{
  "@context": "https://purl.org/codemeta/2.0",
  "@type": "SoftwareSourceCode",
  "name": "codemetar: Generate CodeMeta Metadata for R Packages",
  "buildInstructions": { 
      "@value": "Just install this package using devtools::install_github", 
      "@type": "Text"
  }
}
'

When we perform a framing or compaction operation, buildInstructions gets de-referenced to codemeta:buildInstructions, because it does not match the context. This means that if our application asks for meta$buildInstructions:

meta <-
  jsonld_frame(codemeta, '{"@context": "https://purl.org/codemeta/2.0"}') %>% 
  fromJSON(codemeta, simplifyVector = FALSE) %>%  
  getElement("@graph") %>% getElement(1)    

## above is same as compacting:
#jsonld_compact(codemeta, "http://purl.org/codemeta/2.0") %>% 
#  fromJSON(codemeta, simplifyVector = FALSE)

meta$buildInstructions
NULL

We just get NULL, rather than some unexpected type of object (e.g. a string that is not a URL.) Note that the data is not lost, but simply not dereferenced:

names(meta)
[1] "id"                         "type"                      
[3] "name"                       "codemeta:buildInstructions"
meta["codemeta:buildInstructions"]
$`codemeta:buildInstructions`
$`codemeta:buildInstructions`$type
[1] "Text"

$`codemeta:buildInstructions`$`@value`
[1] "Just install this package using devtools::install_github"

Note that this behavior only happens because the data declared the "@type": "Text" explicitly. JSON-LD algorithms only believe what they are told about type and only look for consistency in declared types. If you give text but declare it as a "@type": "URL", or don’t declare the type at all, JSON-LD algorithms won’t know anything is amiss and the property will be compacted as usual.