Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert entity table to machine-readable format #466

Closed
tsalo opened this issue May 10, 2020 · 19 comments · Fixed by #475
Closed

Convert entity table to machine-readable format #466

tsalo opened this issue May 10, 2020 · 19 comments · Fixed by #475
Labels
formatting Aesthetics and formatting of the spec schema Issues related to the YAML schema representation of the specification. Patch version release.

Comments

@tsalo
Copy link
Member

tsalo commented May 10, 2020

I'm not sure if it's feasible, but it would be nice if the entity table was stored as a json file, in order to make it both programmatically accessible and centralized. I know that there is an equivalent file in bids-validator and pybids, but if the filename construction rules were centralized under the actual specification, then it would be much easier to update the specification across the ecosystem without having to update a range of other packages as well.

@tsalo tsalo added the formatting Aesthetics and formatting of the spec label May 10, 2020
@jbteves
Copy link

jbteves commented May 11, 2020

I would like to second this as someone largely outside the ecosystem. I wanted to make a program which could generate filenames programmatically since we have a big, complicated study, but without a file like this it's hard to build tooling to that effect IMO.

@yarikoptic
Copy link
Collaborator

  • I think yaml would be a better fit because could allow for comments and being friendlier to humans (cons: no de-factor validator, so should remain simple and probably for validation still use json schema validator)
  • It could also allow for making data structure useful beyond entity table - e.g. with a listing of corresponding terms used for a given entity in a given modality (e.g. for this effort: Repository of BIDS terms and supporting the BIDS community #423)
  • we could use it also in heudiconv to construct filenames following the standard order etc
  • might be worth keeping such specifications in a separate lightweight repository which would then be either "linked" here via git submodules, or included via git subtree mechanism. That would allow similar inclusion of it into repositories of the corresponding software projects
  • similar desire/related discussion happened recently in Repository of BIDS terms and supporting the BIDS community #423 (comment)

@jbteves
Copy link

jbteves commented May 11, 2020

  • As an outsider, .json is still preferable as more people are familiar with it and it is in line with many other choices in the specification, including the fact that metadata is stored in that same format.
  • I think Repository of BIDS terms and supporting the BIDS community #423 is a good example of why approaches like this are useful, thanks for linking that.
  • Using heudiconv in tandem with this would be wonderful
  • I have no opinions on how it's stored, provided it can easily be fetched with a git clone or curl operation.

@tsalo
Copy link
Member Author

tsalo commented May 15, 2020

What about something like the following? I've formatted it as yaml because that was easier to write freehand into a file, but could easily switch to json for the real thing. The suffices are organized into groups like the entity table to keep it reasonably short, but I could drop the groups and make each suffix a dictionary under the datatypes.

entities:
  sub:
    description: Subject
    format: label
  ses:
    description: Session
    format: label
  task:
    description: Task
    format: label
  acq:
    description: Acquisition
    format: label
  ce:
    description: Contrast Enhancing Agent
    format: label
  rec:
    description: Reconstruction
    format: label
  dir:
    description: Phase-Encoding Direction
    format: label
  run:
    description: Run
    format: index
  mod:
    description: Corresponding Modality
    format: label
  echo:
    description: Echo
    format: index
  recording:
    description: Recording
    format: label
  proc:
    description: Processed (on device)
    format: label
  space:
    description: Space
    format: label
datatypes:
  anat:
    group1:
      suffices:
        - T1w
        - T2w
        - T1rho
        - T1map
        - T2map
        - T2star
        - FLAIR
        - FLASH
        - PD
        - PDmap
        - PDT2
        - inplaneT1
        - inplaneT2
        - angio
      extensions:
        - nii.gz
        - nii
        - json
      entities:
        sub: required
        ses: optional
        acq: optional
        ce: optional
        rec: optional
    group2:
      suffices:
        - defacemask
      extensions:
        - nii.gz
        - nii
        - json
      entities:
        sub: required
        ses: optional
        acq: optional
        ce: optional
        rec: optional
        mod: optional
  func:
    group1:
      suffices:
        - bold
        - cbv
        - phase
        - sbref
      extensions:
        - nii.gz
        - nii
        - json
      entities:
        sub: required
        ses: optional
        task: required
        acq: optional
        ce: optional
        rec: optional
        dir: optional
        run: optional
        echo: optional

@yarikoptic
Copy link
Collaborator

@tsalo -- this looks beautiful to me!

@jbteves : I do agree that consistency which could be achieved by using .json is indeed a benefit. But IMHO YAML is so much nicer and human friendly that I simply can't resist it. It also got a feature of XXI century -- support for # comments!

Here is a json view of the above yaml for comparison -- although not too bad yet but as it grows I would find it more and more easy to orient in yaml than json and all the clutter from everything in "" really makes it less readable to me
{
  "entities": {
    "task": {
      "description": "Task", 
      "format": "label"
    }, 
    "ses": {
      "description": "Session", 
      "format": "label"
    }, 
    "sub": {
      "description": "Subject", 
      "format": "label"
    }, 
    "space": {
      "description": "Space", 
      "format": "label"
    }, 
    "ce": {
      "description": "Contrast Enhancing Agent", 
      "format": "label"
    }, 
    "echo": {
      "description": "Echo", 
      "format": "index"
    }, 
    "recording": {
      "description": "Recording", 
      "format": "label"
    }, 
    "acq": {
      "description": "Acquisition", 
      "format": "label"
    }, 
    "rec": {
      "description": "Reconstruction", 
      "format": "label"
    }, 
    "run": {
      "description": "Run", 
      "format": "index"
    }, 
    "proc": {
      "description": "Processed (on device)", 
      "format": "label"
    }, 
    "dir": {
      "description": "Phase-Encoding Direction", 
      "format": "label"
    }, 
    "mod": {
      "description": "Corresponding Modality", 
      "format": "label"
    }
  }, 
  "datatypes": {
    "anat": {
      "group1": {
        "suffices": [
          "T1w", 
          "T2w", 
          "T1rho", 
          "T1map", 
          "T2map", 
          "T2star", 
          "FLAIR", 
          "FLASH", 
          "PD", 
          "PDmap", 
          "PDT2", 
          "inplaneT1", 
          "inplaneT2", 
          "angio"
        ], 
        "extensions": [
          "nii.gz", 
          "nii", 
          "json"
        ], 
        "entities": {
          "rec": "optional", 
          "acq": "optional", 
          "ses": "optional", 
          "sub": "required", 
          "ce": "optional"
        }
      }, 
      "group2": {
        "suffices": [
          "defacemask"
        ], 
        "extensions": [
          "nii.gz", 
          "nii", 
          "json"
        ], 
        "entities": {
          "acq": "optional", 
          "ses": "optional", 
          "sub": "required", 
          "rec": "optional", 
          "ce": "optional", 
          "mod": "optional"
        }
      }
    }, 
    "func": {
      "group1": {
        "suffices": [
          "bold", 
          "cbv", 
          "phase", 
          "sbref"
        ], 
        "extensions": [
          "nii.gz", 
          "nii", 
          "json"
        ], 
        "entities": {
          "task": "required", 
          "ses": "optional", 
          "sub": "required", 
          "ce": "optional", 
          "echo": "optional", 
          "acq": "optional", 
          "rec": "optional", 
          "run": "optional", 
          "dir": "optional"
        }
      }
    }
  }
}

@yarikoptic
Copy link
Collaborator

Unrelated to this issue, just wanted to share 1c of no value here: some not really widely known fact is that YAML 2.0 is a superset of json (any JSON is also a valid YAML). I.e. if at some point we decide "let's prepare for migration to YAML", conversion of .json into .yaml could be as easy as mv blah.json blah.yaml (although not as beneficial as proper re-serialization ;-)).

@yarikoptic
Copy link
Collaborator

Sorry for spamming... but I am just too excited! Such spec could then be used to produce almost if not all term tables we have. It could be used to produce target filename patterns. We could even manage to programmatically validate example filenames! It would reduce duplication and thus possible errors. Validators could avoid hardcoring and there would be no need to change validator upon addition of a term, entity, etc - it would make it possible to make validator to validate against specific version of bids, not just the latest!

@sappelhoff
Copy link
Member

Thanks for throwing in some ideas to improve the entity table @tsalo -> these are some related issues: #289 #290

re: the current proposal

@tsalo
Copy link
Member Author

tsalo commented May 16, 2020

@yarikoptic I was thinking the same thing! The versioning aspect will be awesome!

@sappelhoff I agree that the file will end up being prohibitively long in its current form. What about splitting the files into the following:

  • entities: Entities, their full names, their values, and their order.
  • [modality]_[datatype]: Like the above, but only for a single datatype. Possibly with lists of metadata fields as well?
  • top_level/associated_data/any other files that should be based on bids-validator rules.

I was also a little stuck on how the json/yaml file would be rendered as a table on the site. Will whatever rendering function is used need to be in a specific language?

@yarikoptic
Copy link
Collaborator

Re length: We can partition at the top level into separate files. Unfortunately yaml as json didn't have native include mechanism, but solutions exist trying avoid doing it ourselves: https://stackoverflow.com/questions/528281/how-can-i-include-a-yaml-file-inside-another . Similar approach is taken by nwb standard, see https://github.com/NeurodataWithoutBorders/nwb-schema/blob/24fba6174ddbad171ee5bb824edfa31f86b1b16d/core/nwb.namespace.yaml which defines includes for different modalities. I am yet not sure if we want to partition by modality, I feel that we might better partition by concept/structure: entities, datatypes, terms, ... as prototyped by @tsalo.

@yarikoptic
Copy link
Collaborator

And then partition per datatypes (modality!) ;-)

@yarikoptic
Copy link
Collaborator

yarikoptic commented May 16, 2020

@tsalo we will not render this structure directly. We code helper tool to render from it all the .md tables etc to include into spec upon compilation.

edit 1: we could use something like https://pypi.org/project/tabulate/ to prepare such tables.

@yarikoptic
Copy link
Collaborator

To not derail discussion here but to outline possible mechanism for establishing historical versions of schema etc suitable for reuse by bids-aware tools, I have initiated https://github.com/bids-standard/bids-schema -- see it README.md and welcome to initiate issues (probably there is nothing really to be contributed in PRs until we get a schema going here) with questions/suggestions/notes.

@tsalo
Copy link
Member Author

tsalo commented May 17, 2020

I started working on the files in tsalo/bids-specification@ref/json-entity. The datatypes are split up in the datatypes/ folder by row in the entity table. I know that the divisions in there aren't actually the same as the datatypes, but I figured it's a good start. We can figure out how to restructure them from there (including changing how they're partitioned).

@yarikoptic If we'll be using a Python script to handle the rendering then that alleviates my concerns. Thanks!

Regarding releases, I had assumed that we'd use the releases in the specification repository, but since the specification for the yaml/json files will probably change, it only makes sense to backup the schemas elsewhere and allow maintainers to adjust them as needed.

@yarikoptic
Copy link
Collaborator

Yeap, that is the purpose of that bids-schema. Also for it to be more lightweight and not carry all the bids-specification history/images etc so it could be included in tools distribution where desired... That thought triggered need to file bids-standard/bids-schema#1 ;-)

@yarikoptic
Copy link
Collaborator

yarikoptic commented May 17, 2020

Re your branch - please place all of the produced yamls into a dedicated folder (eg schema).

Edit: I think it will be useful beyond appendices, so I would have placed it on top level in the hierarchy.

@tsalo
Copy link
Member Author

tsalo commented May 17, 2020

Done!

@yarikoptic
Copy link
Collaborator

Awesome! If it was a PR here I could try on entity take generation/ embedding script (unless you just do it) ;-)

@tsalo
Copy link
Member Author

tsalo commented May 17, 2020

I just opened #475 as a draft PR.

@tsalo tsalo changed the title Convert entity table to json Convert entity table to machine-readable format Jun 13, 2020
@tsalo tsalo added the schema Issues related to the YAML schema representation of the specification. Patch version release. label Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
formatting Aesthetics and formatting of the spec schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants