Translating between schema using JSON-LD

Carl Boettiger

2018-02-12

library("codemetar")
library("magrittr")
library("jsonlite")
library("jsonld")
library("httr")
library("readr")

JSON-LD transforms: Expansion and Compaction

One of the central motivations of JSON-LD is making it easy to translate between different representations of what are fundamentally the same data types. Doing so uses the two core algorithms of JSON-LD: expansion and compaction, as this excellent short video by JSON-LD creator Manu Sporny describes.

Here’s how we would use JSON-LD (from R) to translate between the two examples of JSON data from different providers as shown in the video. First, the JSON from the original provider:

ex <-
'{
"@context":{
  "shouter": "http://schema.org/name",
  "txt": "http://schema.org/commentText"
},
"shouter": "Jim",
"txt": "Hello World!"
}'

Next, we need the context of the second data provider. This will let us translate the JSON format used by provider one (“Shouttr”) to the second (“BigHash”):

bighash_context <- 
'{
"@context":{
  "user": "http://schema.org/name",
  "comment": "http://schema.org/commentText"
}
}'

With this in place, we simply expand the original JSON and then compact using the new context:

jsonld_expand(ex) %>%
  jsonld_compact(context = bighash_context)
{
  "@context": {
    "user": "http://schema.org/name",
    "comment": "http://schema.org/commentText"
  },
  "comment": "Hello World!",
  "user": "Jim"
} 

Crosswalking

The CodeMeta Crosswalk table seeks to accomplish a very similar goal. The crosswalk table provides a human-readable mapping of different software metadata providers into the codemeta context (an extension of schema.org). For instance, we’ll read in some data from GitHub:

GitHub

Here we crosswalk the JSON data returned as the repository information from the GitHub API:

repo_info <- gh::gh("/repos/:owner/:repo", owner = "ropensci", repo = "EML")

Let’s just take a look at what the returned json data looks like:

repo_info %>% toJSON()
{"id":[10894022],"name":["EML"],"full_name":["ropensci/EML"],"owner":{"login":["ropensci"],"id":[1200269],"avatar_url":["https://avatars0.githubusercontent.com/u/1200269?v=3"],"gravatar_id":[""],"url":["https://api.github.com/users/ropensci"],"html_url":["https://github.com/ropensci"],"followers_url":["https://api.github.com/users/ropensci/followers"],"following_url":["https://api.github.com/users/ropensci/following{/other_user}"],"gists_url":["https://api.github.com/users/ropensci/gists{/gist_id}"],"starred_url":["https://api.github.com/users/ropensci/starred{/owner}{/repo}"],"subscriptions_url":["https://api.github.com/users/ropensci/subscriptions"],"organizations_url":["https://api.github.com/users/ropensci/orgs"],"repos_url":["https://api.github.com/users/ropensci/repos"],"events_url":["https://api.github.com/users/ropensci/events{/privacy}"],"received_events_url":["https://api.github.com/users/ropensci/received_events"],"type":["Organization"],"site_admin":[false]},"private":[false],"html_url":["https://github.com/ropensci/EML"],"description":[" Ecological Metadata Language interface for R: synthesis and integration of heterogenous data"],"fork":[false],"url":["https://api.github.com/repos/ropensci/EML"],"forks_url":["https://api.github.com/repos/ropensci/EML/forks"],"keys_url":["https://api.github.com/repos/ropensci/EML/keys{/key_id}"],"collaborators_url":["https://api.github.com/repos/ropensci/EML/collaborators{/collaborator}"],"teams_url":["https://api.github.com/repos/ropensci/EML/teams"],"hooks_url":["https://api.github.com/repos/ropensci/EML/hooks"],"issue_events_url":["https://api.github.com/repos/ropensci/EML/issues/events{/number}"],"events_url":["https://api.github.com/repos/ropensci/EML/events"],"assignees_url":["https://api.github.com/repos/ropensci/EML/assignees{/user}"],"branches_url":["https://api.github.com/repos/ropensci/EML/branches{/branch}"],"tags_url":["https://api.github.com/repos/ropensci/EML/tags"],"blobs_url":["https://api.github.com/repos/ropensci/EML/git/blobs{/sha}"],"git_tags_url":["https://api.github.com/repos/ropensci/EML/git/tags{/sha}"],"git_refs_url":["https://api.github.com/repos/ropensci/EML/git/refs{/sha}"],"trees_url":["https://api.github.com/repos/ropensci/EML/git/trees{/sha}"],"statuses_url":["https://api.github.com/repos/ropensci/EML/statuses/{sha}"],"languages_url":["https://api.github.com/repos/ropensci/EML/languages"],"stargazers_url":["https://api.github.com/repos/ropensci/EML/stargazers"],"contributors_url":["https://api.github.com/repos/ropensci/EML/contributors"],"subscribers_url":["https://api.github.com/repos/ropensci/EML/subscribers"],"subscription_url":["https://api.github.com/repos/ropensci/EML/subscription"],"commits_url":["https://api.github.com/repos/ropensci/EML/commits{/sha}"],"git_commits_url":["https://api.github.com/repos/ropensci/EML/git/commits{/sha}"],"comments_url":["https://api.github.com/repos/ropensci/EML/comments{/number}"],"issue_comment_url":["https://api.github.com/repos/ropensci/EML/issues/comments{/number}"],"contents_url":["https://api.github.com/repos/ropensci/EML/contents/{+path}"],"compare_url":["https://api.github.com/repos/ropensci/EML/compare/{base}...{head}"],"merges_url":["https://api.github.com/repos/ropensci/EML/merges"],"archive_url":["https://api.github.com/repos/ropensci/EML/{archive_format}{/ref}"],"downloads_url":["https://api.github.com/repos/ropensci/EML/downloads"],"issues_url":["https://api.github.com/repos/ropensci/EML/issues{/number}"],"pulls_url":["https://api.github.com/repos/ropensci/EML/pulls{/number}"],"milestones_url":["https://api.github.com/repos/ropensci/EML/milestones{/number}"],"notifications_url":["https://api.github.com/repos/ropensci/EML/notifications{?since,all,participating}"],"labels_url":["https://api.github.com/repos/ropensci/EML/labels{/name}"],"releases_url":["https://api.github.com/repos/ropensci/EML/releases{/id}"],"deployments_url":["https://api.github.com/repos/ropensci/EML/deployments"],"created_at":["2013-06-23T23:20:03Z"],"updated_at":["2017-05-11T21:24:40Z"],"pushed_at":["2017-07-05T18:52:34Z"],"git_url":["git://github.com/ropensci/EML.git"],"ssh_url":["git@github.com:ropensci/EML.git"],"clone_url":["https://github.com/ropensci/EML.git"],"svn_url":["https://github.com/ropensci/EML"],"homepage":["https://ropensci.github.io/EML"],"size":[5094],"stargazers_count":[48],"watchers_count":[48],"language":["HTML"],"has_issues":[true],"has_projects":[true],"has_downloads":[true],"has_wiki":[true],"has_pages":[true],"forks_count":[17],"mirror_url":{},"open_issues_count":[35],"forks":[17],"open_issues":[35],"watchers":[48],"default_branch":["master"],"organization":{"login":["ropensci"],"id":[1200269],"avatar_url":["https://avatars0.githubusercontent.com/u/1200269?v=3"],"gravatar_id":[""],"url":["https://api.github.com/users/ropensci"],"html_url":["https://github.com/ropensci"],"followers_url":["https://api.github.com/users/ropensci/followers"],"following_url":["https://api.github.com/users/ropensci/following{/other_user}"],"gists_url":["https://api.github.com/users/ropensci/gists{/gist_id}"],"starred_url":["https://api.github.com/users/ropensci/starred{/owner}{/repo}"],"subscriptions_url":["https://api.github.com/users/ropensci/subscriptions"],"organizations_url":["https://api.github.com/users/ropensci/orgs"],"repos_url":["https://api.github.com/users/ropensci/repos"],"events_url":["https://api.github.com/users/ropensci/events{/privacy}"],"received_events_url":["https://api.github.com/users/ropensci/received_events"],"type":["Organization"],"site_admin":[false]},"network_count":[17],"subscribers_count":[18]} 

We can crosswalk this information into codemeta just by supplying the column name to the crosswalk function. This performs the same expansion of the metadata in the GitHub context, followed by compaction into the codemeta context. Note that terms not recognized/included in the codemeta context will be dropped:

github_meta <- crosswalk(repo_info, "GitHub")
github_meta
{
  "@context": "http://purl.org/codemeta/2.0",
  "codeRepository": "https://github.com/ropensci/EML",
  "dateCreated": "2013-06-23T23:20:03Z",
  "dateModified": "2017-05-11T21:24:40Z",
  "description": " Ecological Metadata Language interface for R: synthesis and integration of heterogenous data",
  "downloadUrl": "https://api.github.com/repos/ropensci/EML/{archive_format}{/ref}",
  "identifier": "10894022",
  "name": "ropensci/EML",
  "programmingLanguage": "https://api.github.com/repos/ropensci/EML/languages",
  "issueTracker": "https://api.github.com/repos/ropensci/EML/issues{/number}"
} 

We can verify that the result is a valid codemeta document:

codemeta_validate(github_meta)
[1] TRUE

Transforming into other column schema

The above transform showed the process of going from plain JSON data into the codemeta standard serialization. Similarly, we can crosswalk into any of the other columns in the crosswalk table. To do so, the crosswalk function will first expand any of the recognized properties into the codemeta JSON-LD context, just as above. Unrecognized properties are dropped, since there is no consensus context into which we can expand them. Second, the expanded terms are then compacted down into the new context (Zenodo in this case.) This time, any terms that are not part of the codemeta context are kept, but not compacted, since they still have meaningful contexts (that is, full URIs, e.g. URLs) that can be associated with them:

crosswalk(repo_info, "GitHub", "Zenodo") %>%
drop_context()
{
  "relatedLink": "https://github.com/ropensci/EML",
  "schema:dateCreated": {
    "@type": "schema:Date",
    "@value": "2013-06-23T23:20:03Z"
  },
  "schema:dateModified": {
    "@type": "schema:Date",
    "@value": "2017-05-11T21:24:40Z"
  },
  "description/notes": " Ecological Metadata Language interface for R: synthesis and integration of heterogenous data",
  "schema:downloadUrl": {
    "@id": "https://api.github.com/repos/ropensci/EML/{archive_format}{/ref}"
  },
  "id": "10894022",
  "name": "ropensci/EML",
  "schema:programmingLanguage": "https://api.github.com/repos/ropensci/EML/languages",
  "codemeta:issueTracker": {
    "@id": "https://api.github.com/repos/ropensci/EML/issues{/number}"
  }
} 

Thus terms that still have a uncompacted prefix like schema: or codemeta: reflect properties that we could successfully extract from the input data, but do not have corresponding properties in the Zenodo context. This is the standard behavior of the compaction algorithm: unrecognized fields are not dropped, but are also not compacted, making them accessible only if referenced explicitly.

NodeJS example

NodeJS uses a package.json format that is very similar to a simple codemeta.json file, though it is not Linked Data as it does not declare a context. Here we crosswalk an example package.json file into proper codemeta standard.

package.json <- read_json(
"https://raw.githubusercontent.com/heroku/node-js-sample/master/package.json")
package.json
$name
[1] "node-js-sample"

$version
[1] "0.2.0"

$description
[1] "A sample Node.js app using Express 4"

$main
[1] "index.js"

$scripts
$scripts$start
[1] "node index.js"


$dependencies
$dependencies$express
[1] "^4.13.3"


$engines
$engines$node
[1] "4.0.0"


$repository
$repository$type
[1] "git"

$repository$url
[1] "https://github.com/heroku/node-js-sample"


$keywords
$keywords[[1]]
[1] "node"

$keywords[[2]]
[1] "heroku"

$keywords[[3]]
[1] "express"


$author
[1] "Mark Pundsack"

$contributors
$contributors[[1]]
[1] "Zeke Sikelianos <zeke@sikelianos.com> (http://zeke.sikelianos.com)"


$license
[1] "MIT"
crosswalk(package.json, "NodeJS")
{
  "@context": "http://purl.org/codemeta/2.0",
  "codeRepository": {},
  "creator": "Mark Pundsack",
  "description": "A sample Node.js app using Express 4",
  "keywords": [
    "node",
    "heroku",
    "express"
  ],
  "license": "MIT",
  "name": "node-js-sample",
  "version": "0.2.0"
} 

Note that while nested structures per se pose no special problem, the compaction/expansion paradigm lacks a mechanism to capture differences in nesting between schema. For instance, in codemeta (i.e. in schema.org), a codeRepository is expected to be a URL, while NodeJS package.json permits it to be another object node with sub-properties type and url. There is no way in JSON-LD transforms or context definitions to indicate that the url sub-property in the NodeJS case, e.g. codeRepository.url maps to schema’s codeRepository. (This same limitation is also true of the 2-dimensional table structure of the crosswalk itself, though it is important to keep in mind that this 1:1 mapping requirement is not unique to the the .csv representation but also inherent in JSON-LD contexts.)

Consequently, a thorough translation between formats that do not provide there own JSON-LD contexts will ultimately require more manual implementation, which would be expressed within a particular programming language (e.g. R) rather than in the generic algorithms of JSON-LD available in many common programming languages.