Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suggested mapping for NCBI databases #76

Open
bradfordcondon opened this issue Dec 12, 2018 · 16 comments
Open

suggested mapping for NCBI databases #76

bradfordcondon opened this issue Dec 12, 2018 · 16 comments

Comments

@bradfordcondon
Copy link
Contributor

I'm creating a Tripal module that imports NCBI XML and creates chado records, as well as linked records.

I think it would be neat to have a recommended, agreed upon mapping of NCBI data into chado. This could go into the documentation/wiki.
ie bioproject -> project
biosample -> biomaterial
assembly -> analysis of type X . (note that some @laceysanderson might argue this should actualy be a project since its multiple analyses?)

@laceysanderson
Copy link
Contributor

Yes, I would argue that 🙃

@ekcannon
Copy link

I have already mapped genome metadata onto chado, which includes GenBank BioProject and BioSample accessions. I had planned to write a Tripal module implementing this - we have a draft T2 module in use at PeanutBase and LegumeInfo, but not ready for prime time - but have not had time to so. I suggest using this as a starting point so that I don't have to re-map my data:

genome_schema.pdf

Keep in mind that although it's getting better, the metadata in GenBank is still pretty sparse due to lack of researcher compliance. Loading metadata directly from GenBank should probably be viewed as a starting point, followed by hand curation.

@adf-ncgr has worked on a BioSample module (https://github.com/legumeinfo/lis_ncbi_bioproj), so it would also be worth taking a look at that as well.

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented Dec 12, 2018

hi @ekcannon thank you ill do my best to conform to this. Would you be able to clarify the yellow boxes? Those match up to NCBI, right--- which database does each come from? Specificially you have "the assembly", "the assembly project", and "the master project". The data I've been importing typically has a bioproject, and an assembly thats part of that bioproject.

One of my example assemblies:
https://www.ncbi.nlm.nih.gov/assembly/GCF_001654055.1/

which is represented via this XML. Note I have to download via their fTP service the assembly summary to get the "Assembly method" field--- thanks a lot....

does this assembly record correspond to the assembly project or the analysis in chado in your scheme? Do you have other analyses in the assembly project, and if so, how do you get them from NCBI?

It looks to me like the NCBI assembly is the project, which you split into an assembly and an annotation analysis, and the annotation analysis ingested from somewhere else, is that right? So I should be creating a project, and housing the assembly analysis in the project? But the NCBI assembly metadata gets associated with the analysis, not the project? At this point isnt the NCBI assembly record still mapped to the analysis, not the assembly project?

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary assembly 20180216//EN"
    "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20180216/esummary_assembly.dtd">
<eSummaryResult>
  <DocumentSummarySet status="OK">
    <DbBuild>Build181125-1910.1</DbBuild>
    <DocumentSummary uid="751381">
      <RsUid>4681368</RsUid>
      <GbUid>3282278</GbUid>
      <AssemblyAccession>GCF_001654055.1</AssemblyAccession>
      <LastMajorReleaseAccession>GCF_001654055.1</LastMajorReleaseAccession>
      <LatestAccession/>
      <ChainId>1654055</ChainId>
      <AssemblyName>ASM165405v1</AssemblyName>
      <UCSCName/>
      <EnsemblName/>
      <Taxid>3981</Taxid>
      <Organism>Hevea brasiliensis (rubber tree)</Organism>
      <SpeciesTaxid>3981</SpeciesTaxid>
      <SpeciesName>Hevea brasiliensis</SpeciesName>
      <AssemblyType>haploid</AssemblyType>
      <AssemblyClass>haploid</AssemblyClass>
      <AssemblyStatus>Scaffold</AssemblyStatus>
      <WGS>LVXX01</WGS>
      <GB_BioProjects>
        <Bioproj>
          <BioprojectAccn>PRJNA310386</BioprojectAccn>
          <BioprojectId>310386</BioprojectId>
        </Bioproj>
      </GB_BioProjects>
      <GB_Projects>
      </GB_Projects>
      <RS_BioProjects>
        <Bioproj>
          <BioprojectAccn>PRJNA394253</BioprojectAccn>
          <BioprojectId>394253</BioprojectId>
        </Bioproj>
      </RS_BioProjects>
      <RS_Projects>
      </RS_Projects>
      <BioSampleAccn>SAMN04451765</BioSampleAccn>
      <BioSampleId>4451765</BioSampleId>
      <Biosource>
        <InfraspeciesList>
          <Infraspecie>
            <Sub_type>cultivar</Sub_type>
            <Sub_value>reyan7-33-97</Sub_value>
          </Infraspecie>
        </InfraspeciesList>
        <Sex/>
        <Isolate/>
      </Biosource>
      <Coverage>99</Coverage>
      <PartialGenomeRepresentation>false</PartialGenomeRepresentation>
      <Primary>4681358</Primary>
      <AssemblyDescription/>
      <ReleaseLevel>Major</ReleaseLevel>
      <ReleaseType>Major</ReleaseType>
      <AsmReleaseDate_GenBank>2016/06/01 00:00</AsmReleaseDate_GenBank>
      <AsmReleaseDate_RefSeq>2017/07/14 00:00</AsmReleaseDate_RefSeq>
      <SeqReleaseDate>2016/06/01 00:00</SeqReleaseDate>
      <AsmUpdateDate>2017/07/19 00:00</AsmUpdateDate>
      <SubmissionDate>2016/06/01 00:00</SubmissionDate>
      <LastUpdateDate>2017/07/19 00:00</LastUpdateDate>
      <SubmitterOrganization>Rubber Research Institute</SubmitterOrganization>
      <RefSeq_category>representative genome</RefSeq_category>
      <AnomalousList>
      </AnomalousList>
      <ExclFromRefSeq>
      </ExclFromRefSeq>
      <PropertyList>
        <string>full-genome-representation</string>
        <string>has-chloroplast</string>
        <string>has_annotation</string>
        <string>latest</string>
        <string>latest_genbank</string>
        <string>latest_refseq</string>
        <string>refseq_has_annotation</string>
        <string>representative</string>
        <string>wgs</string>
      </PropertyList>
      <FromType/>
      <Synonym>
        <Genbank>GCA_001654055.1</Genbank>
        <RefSeq>GCF_001654055.1</RefSeq>
        <Similarity>different</Similarity>
      </Synonym>
      <ContigN50>60046</ContigN50>
      <ScaffoldN50>1281786</ScaffoldN50>
      <FtpPath_GenBank>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/654/055/GCA_001654055.1_ASM165405v1
      </FtpPath_GenBank>
      <FtpPath_RefSeq>ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/654/055/GCF_001654055.1_ASM165405v1
      </FtpPath_RefSeq>
      <FtpPath_Assembly_rpt>
        ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/654/055/GCF_001654055.1_ASM165405v1/GCF_001654055.1_ASM165405v1_assembly_report.txt
      </FtpPath_Assembly_rpt>
      <FtpPath_Stats_rpt>
        ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/654/055/GCF_001654055.1_ASM165405v1/GCF_001654055.1_ASM165405v1_assembly_stats.txt
      </FtpPath_Stats_rpt>
      <FtpPath_Regions_rpt/>
      <SortOrder>2C90016540559898</SortOrder>
      <Meta>
        <![CDATA[ <Stats> <Stat category="alt_loci_count" sequence_tag="all">0</Stat> <Stat category="chromosome_count" sequence_tag="all">0</Stat> <Stat category="contig_count" sequence_tag="all">48315</Stat> <Stat category="contig_l50" sequence_tag="all">6073</Stat> <Stat category="contig_n50" sequence_tag="all">60046</Stat> <Stat category="non_chromosome_replicon_count" sequence_tag="all">1</Stat> <Stat category="replicon_count" sequence_tag="all">1</Stat> <Stat category="scaffold_count" sequence_tag="all">7453</Stat> <Stat category="scaffold_count" sequence_tag="placed">1</Stat> <Stat category="scaffold_count" sequence_tag="unlocalized">0</Stat> <Stat category="scaffold_count" sequence_tag="unplaced">7452</Stat> <Stat category="scaffold_l50" sequence_tag="all">320</Stat> <Stat category="scaffold_n50" sequence_tag="all">1281786</Stat> <Stat category="total_length" sequence_tag="all">1373527118</Stat> <Stat category="ungapped_length" sequence_tag="all">1293730791</Stat> </Stats> <FtpSites>   <FtpPath type="Assembly_rpt">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/654/055/GCF_001654055.1_ASM165405v1/GCF_001654055.1_ASM165405v1_assembly_report.txt</FtpPath>   <FtpPath type="GenBank">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/001/654/055/GCA_001654055.1_ASM165405v1</FtpPath>   <FtpPath type="RefSeq">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/654/055/GCF_001654055.1_ASM165405v1</FtpPath>   <FtpPath type="Stats_rpt">ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/001/654/055/GCF_001654055.1_ASM165405v1/GCF_001654055.1_ASM165405v1_assembly_stats.txt</FtpPath> </FtpSites> <assembly-level>8</assembly-level> <assembly-status>Scaffold</assembly-status> <representative-status>representative genome</representative-status> <submitter-organization>Rubber Research Institute</submitter-organization>    ]]></Meta>
    </DocumentSummary>

  </DocumentSummarySet>
</eSummaryResult>

edit: ethy, is that the right link for the biosample/bioproject module? it only has javascript code for formatting i think.

@ekcannon
Copy link

Connections between the diagram labels and GenBank:

"The master project" --> No firm correspondence at GenBank though it is possible to group BioProjects together with curator assistance. I don't know what the data look like.

"The assembly project" --> BioProject + custom information for the website.

"The assembly" --> the sequence itself, typically a WGS accession, and the Assembly accession only along with extra information about the assembly, like quality statistics and how the assembly was constructed.

"The biosample" --> BioSample

Your question about "The assembly" vs GenBank's Assembly brings up a good point as I don't collect any information from the GenBank Assembly record aside from its accession. Expanding this part would be a good idea.

To the extent possible, I use terms from MIxS (which is being revised; https://gensc.org/mixs), but had to create several custom terms as GenBank uses a subset of the terms. Moreover, some terms have the same meaning as a MIxS term, but have a different name. GenBank is a member of the team that is revising MIxS, but the problem is likely to persist. For mapping contents of the XML you could either map to MIxS terms, or just use GenBank's. I'd suggest not making up new terms if you can avoid it.

@adf-ncgr
Copy link

Hi, I'm not sure I've really digested this thread, but wanted to note regarding:

"The master project" --> No firm correspondence at GenBank though it is possible to group
BioProjects together with curator assistance.

NCBI bioprojects can be nested within "umbrella" projects, a good example is here:
https://www.ncbi.nlm.nih.gov/bioproject/353637

I'm not sure if they allow unlimited levels of hierarchy, or whether this 2-level structure is as deep as it gets; but I vaguely recall that we had worked on enabling something along these lines when we were actively pursuing development around this.

@ekcannon
Copy link

You are correct, @bradfordcondon, the github link in my comment above was incorrect. Here is the Tripal module github repository:
https://github.com/ncgr/tripal_biomaterial

As you can see, it dates to quite a while ago and wasn't completed, and was written for Tripal 1, so may be of limited use.

@mestato
Copy link
Contributor

mestato commented Dec 13, 2018

Sorry if I've missed previous conversations/issues about project vs analysis, but I was wondering, is there a use case for individual reference genome instances living at the project level and then creating multiple supporting analyses? What would those analyses be exactly? Or is this just based on the table definitions from chado?

@ekcannon
Copy link

For my metadata, a genome assembly project typically has two analyses: the assembly and the annotation. Sometimes there are multiple annotations, and/or multiple versions of an annotation, each of which get new analysis records.

@mestato
Copy link
Contributor

mestato commented Dec 13, 2018

Interesting. From the perspective of mirroring NCBI, I don't think they give annotation any sort of separate identity outside the assembly. They're housed together.

On to my next question - lets say the same group of PIs that produced the first genome version then releases an updated version, a version 2.0. Does that get a new project (which then has new child assembly and annotation analyses) or is that two new analyses housed under the same original project? If the latter, are the projects grouped by a "super project" so they can be associated?

@ekcannon
Copy link

Good point, @mestato. Though GenBank does tend to do its own annotation, which then lives in a separate record. For example:
https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Zea_mays/101/

Regarding your second question, if the same group produces a second version, then a second BioProject record is created, which could be linked to to the first via an umbrella BioProject (optional), which apparently requires assistance from NCBI curators. See @adf-ncgr comment above for an example.

@mestato
Copy link
Contributor

mestato commented Dec 14, 2018

From my poking around, I think the structure at NCBI varies pretty wildly on this. Thats probably because its somewhat controlled by users - they can choose to create a new project or add an assembly to an existing project.

For example, peach:

So I think a one-to-one mapping of NCBI to chado is feasible, but will be just as inconsistent as NCBI. This may be an individual database decision - are you the type of database who is going to suck in the NCBI data as is and get on with integrating it with new data, or are you the type of database who wants to restructure the NCBI data to be consistent? I think there's plenty of room for both in tripal. That being said a recommended NCBI mapping is still helpful.

(I'm not sure this comment helped move this thread forward at all...but I'm learning a lot about NCBI :)

@ekcannon
Copy link

It will likely be easiest to just pull directly from NCBI. If more consistency is desired, and/or if specific databases want more thorough metadata, that can be done by hand.

An alternative just occurred to me (or optional behavior): instead of loading directly into Chado, produce a table or Excel file that can be cleaned up, filled out, et cetera, then loading that. A two-step process. It's challenging to change data that's already been loaded into Chado. One advantage to this is that the table/spreadsheet can be filled out before the NCBI submission, assisting both the submission process and the data loading into Chado. This is the way MaizeGDB handles assembly metadata ... but recognizing that not all databases have the luxury of controlling genome assembly submissions to NCBI.

@bradfordcondon
Copy link
Contributor Author

This is my wishlist:

  • We can provide a table of suggested mappings for each NCBI database. Bonus points if this includes CVterm assignments for the type_id.
  • I would like to map NCBI data into Chado as "simply" as possible. In a dream world, a Chado record will correspond 1:1 with an NCBI record, ie, it wont be split into multiple subrecords unless theres a clear benefit to doing so.

The discussion points you all have brought up have been really interesting. My vision for my module is to totally automate ingestion of NCBI data. I can appreciate your point Ethy, but I wonder if what you gain with that approach is worth the programming effort, vs simply creating an automated record and then choosing to edit it after its created using the existing Tripal UI.

You can see my current mapping on the docs site. Assembly, the stickiest one for sure, is here:.

For Chado tables that don't align well with NCBI content, I guess my question is, why not, and do we want to keep it that way? Analysis, for example- I readily admit that the actual table definition is pretty strict that this is a single program run, and multiple analyses shouldn't be lumped together. But NCBI doesnt tell us about the intermediate analyses used to generate the assembly, for example. We can of course create an assembly project, and then put the single assembly analysis in the project linked to it, with semi-wrong info because we're still referring to multiple programs with a single analysis. But to what end? And don't we end up with many many projects once we do this for every NCBI data type, perhaps over multiple assembly versions? We can use the project_relationship table, but i pose the question, are we doing this to conform to Chado standards, or to actually modeling the data in an ideal way? Could modeling the data in a way more in-line with how NCBI models it be a better approach?

@ekcannon
Copy link

Nice work, Bradford!

Regarding an intermediate Excel file: use of Excel spreadsheets serves two purpose for MaizeGDB: 1) helps us and/or contributors collect data for GenBank submissions, and 2) enables us to collect information beyond what GenBank collects. We don't want data providers to be accessing admin pages, and even though use of spreadsheets is a problematic source of data errors, an Excel spreadsheet is just easier for researchers to work with. If such a feature does not exist in your proposed module, I can create the necessary scripts, though I'd have to be sure my loader matches your work.

There is also the problem that submitters rarely take the effort to fill in as much metadata as they can. If you really care about the metadata for genome or some other dataset at GenBank, you will need to go back to the submitter to fill in the blanks.

Also, given what you are trying to accomplish, I would say generating and/or loading the data via an Excel spreadsheet should be an optional feature, if you wish to add it.

Your question about Assembly vs WGS records is pertinent. My mapping lumps them together into one analysis with lots of props, but this is probably not a great solution if you are trying to mirror GenBank data. Could the Assembly record be another project record, linked to the BioProject project record and the WGS the analysis?

Regarding the fact that multiple analyses are combined to create an assembly, it's true that GenBank doesn't (currently) collect much more than a single analysis field, though I don't think there's a character limit on how much information is provided. The MIxS standard (https://press3.mcs.anl.gov/gensc/mixs) does collect information about the multiple analyses throughout the process of generating a genome, such as: how the sample material was processed, sequencing technology, assembly method, QC methods, finishing strategy, SOP(s). I believe GenBank submission provides a somewhat hidden means of getting at a fuller list of IxS recommended fields through downloadable templates (https://press3.mcs.anl.gov/gensc/mixs/submit-mixs-metadata/), but even these are incomplete and I suspect they are rarely used.

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented Dec 19, 2018

re: point 1 @ekcannon you might be interested in our Tripal HeadQuarters project https://tripal-hq.readthedocs.io/en/latest/?badge=latest what it does is "wrap" content creation so that you can let users submit data into chado, but pending admin approval. Combine this with custom fields, and you can collect all the metadata you need and even store it in Chado if appropriate. I wrote a guide demonstrating the process. I'd be delighted to talk about it more but perhaps its off-topic for this thread.

That said if users prefer excel over forms then thats what matters :)

Could the Assembly record be another project record, linked to the BioProject project record and the WGS the analysis?

But is WGS a single analysis, or is it multiple analysis and therefore requiring to be put in a project? THis is what im afraid of: having projects upon projects in an effort to conform to Chado.

Re: MixS yes I havent tried submitting one that way myself. From waht I can gather hte user selects what "package" of metadata they want to conform to. I agree, I've yet to encounter something using the MixS standard "in hte wild"

@ekcannon
Copy link

But is WGS a single analysis, or is it multiple analysis and therefore requiring to be put in a project? THis is what im afraid of: having projects upon projects in an effort to conform to Chado

It's the end result of a string of analyses. I can be convinced otherwise, but given my current understanding, it's fair to call it one analysis. (Similarly, an annotation may be a pipeline of multiple algorithms and packages and rules for selecting the final set of gene models from a pool of possibilities.)

I've yet to encounter something using the MixS standard "in hte wild"

MaizeGDB does. :-) So does GOLD (Genomes OnLine Database; https://gold.jgi.doe.gov). Incidentally, I just remembered that GOLD ingests data from GenBank, then its curators or data providers can attempt to fill out missing bits.

BTW, there's a plant extension for MIxS (https://press3.mcs.anl.gov/gensc/the-plant-specimen-contextual-data-consensus; https://academic.oup.com/gigascience/article/5/1/giw002/2756883). The original specification for MIxS was skewed toward metagenomics so many of the fields make little to no sense for plants. The plant extension needs some work, but MaizeGDB has adopted the extension as well, to the extent that we could do so. I don't know if anyone else has. Also, GenBank is represented in the MIxS working group and is committed to upholding the standard; it just may take years to make changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants