Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

organism_analysis linker table #59

Open
bradfordcondon opened this issue Jan 29, 2018 · 21 comments
Open

organism_analysis linker table #59

bradfordcondon opened this issue Jan 29, 2018 · 21 comments

Comments

@bradfordcondon
Copy link
Contributor

I think there was some request for an organism_analysis linker table in the schema? We plan on creating one at HWG anyway.

-- ================================================
-- TABLE: organism_analysis
-- ================================================

create table organism_analysis (
       organism_analysis_id bigserial not null,
       primary key (organism_analysis_id),
       organism_id bigint not null,
       foreign key (organism_id) references organism (organism_id) on delete cascade INITIALLY DEFERRED,
       analysis_id bigint not null,
       foreign key (analysis_id) references analysis (analysis_id) on delete cascade INITIALLY DEFERRED,
       constraint organism_analysis_c1 unique (organism_id,analysis_id)
);
create index organism_analysis_idx1 on organism_analysis (organism_id);
create index organism_analysis_idx2 on organism_analysis (analysis_id);

COMMENT ON TABLE organism_analysis IS 'Linking table for associating organisms and analyses.';
@ekcannon
Copy link

It's difficult to say no to a linker table - the more the merrier - but I am curious how this would be used.

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented Jan 30, 2018

We (meaning HWG) typically want to list the analyses performed on an organism on the respective organism's page. this let's users visit that analysis to download the data or the associated entities (features, biosamples, what have you).

Edit here is an example organism. The Transcriptome and Analysis tabs make use of the proposed linker field to display all analyses associated with this organism.

@laceysanderson
Copy link
Contributor

I would worry a bit about this causing inconsistencies in the database. For example, what if your transcripts for a given analysis are for species X and the linker table points to species Y? I think this might be better implemented as a materialized view that pulls through the analysis -> feature -> organism connection. From an application standpoint this ends up being the same thing but from the database level it will help you keep consistency. Would that work for HWG @bradfordcondon?

@spficklin
Copy link
Contributor

spficklin commented Jan 30, 2018

If I remeber correctly this table came up as a suggestion during Chado v1.3 discussions and it was shot down because it breaks the highly normalized structure of Chado for the reason @laceysanderson specified.

I know of some Tripal sites that use this table but they add it as a custom table. But the MView is probably a better idea.

@ekcannon
Copy link

I have the same concerns as @laceysanderson and @spficklin, but analyses can also be attached to non-features, like phylotree records (which would typically be multi-species so an organism_analysis record may not make sense). Still, it seems it should be possible to create a view that identifies all analyses associated with an organism without adding this table.

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented Jan 30, 2018

@laceysanderson we originally had an mview that did that, but i have analyses that are not associated via features. Then, adding things to that view manually would be lost if the mview was repopulated. (not to mention writing a field that inserted into the mview broke because it didnt have a primary key...)

@ekcannon
Copy link

@bradfordcondon are these non-feature-related analyses connected to organisms by any other means (however circuitous), or only via organism_analysis? If not, what is an example?

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented Jan 30, 2018

Great question @ekcannon . Both, I'd say.

  • Transcriptomes or genomes that have no data stored in Chado. (IE an archival analysis where we only offer a zip of the assembly to download)
  • Analysis Expression - One for the expression data. Possibly one for the biomaterials as well. Both could be linked to features, via the MAGE Module...
  • Libaries: On our site, libraries are mostly archival, so no attached chado data. If we used the module properly, we could attach it via library_feature....
  • other analysis types (blast_analysis, interproscan analysis): Not sure about this one, because i dont know that the annotations keep ties to the analysis? I'd have to check for this one.

I suppose the feature-less analyses could have a single dummy feature loaded to associate them, if i, say, unpublished that entity...

@spficklin
Copy link
Contributor

Those are valid use cases. I can see now the need for the table. There can be other analyses that don't generate genomic features. I'm not sure what to do in this case. I would need someone much smarter than I am on highly normalized dbs to weigh in.

IMO I would suggest not to store dummy features...

@bradfordcondon
Copy link
Contributor Author

Is there any sane way to have both the linker table and the mview, and have the mview update the linker table without overwritting non-feature related entries? I recognize the importance of avoiding inconsistencies in the database.

@laceysanderson
Copy link
Contributor

The only way I can think of to ensure database integrity is to have a trigger on analysisfeature that ensures if there is a record in organism_analysis then the organism matches the current feature being inserted/updated. However, the performance hit during loading could be substantial since it would run a query per feature :-(

@bradfordcondon
Copy link
Contributor Author

ensures if there is a record in organism_analysis then the organism matches the current feature being inserted/updated

but what if you have multiple organisms associated with a single analysis?

FYI I am going ahead with the linker table on HWG but I won't necessarily expect it to be part of Chado.

@bradfordcondon
Copy link
Contributor Author

If Tripal users are looking for this, here is the tripal 3 module that defines the linker field to link stuff via organism_analysis.

https://github.com/statonlab/tripal_manage_analyses

@laceysanderson
Copy link
Contributor

Good point on the multiple organisms. Perhaps the trigger could check if the organism for the current feature being inserted is in the organism_analysis table and if not, adds it?

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented May 4, 2018

I imagine that how Tripal ensures the integrity is an issue for my linker module itself, so I've moved it there: statonlab/tripal_manage_analyses#32.

I dont think there's a way to force the integrity at the Chado level?

@laceysanderson
Copy link
Contributor

You can do it at the Chado level using triggers: https://www.postgresql.org/docs/9.4/static/rules-triggers.html This can be more reliable then ensuring integrity at the application level.

@bradfordcondon
Copy link
Contributor Author

Very cool thank you

@bradfordcondon
Copy link
Contributor Author

bradfordcondon commented May 9, 2018

This is still pseudocode but something like this?

Are there cases where an analysis would have features from multiple organisms? If so I'd need to check that first...

rule for insert is much simpler...

CREATE RULE organism_analysis_update AS ON UPDATE TO feature.organism_id
    DO UPDATE(organism_id), (NEW.organism_id) FROM organism_analysis WHERE organism_id = OLD.organism_id AND analysis_id = 
    (SELECT analysis_id FROM analysis A
    INNER JOIN analysisfeature AF ON A.analysis_id = AF.analysis_id
     INNER JOIN feature F ON F.feature_id = AF.feature_id
     INNER JOIN organism O ON O.organism_id = F.organism_id
    WHERE F.feature_id = NEW.feature_id);

puzzling it out here, also tried a trigger which seems more complicated... https://gist.github.com/bradfordcondon/366c3ca47e93d78badcaba2469885fd9

@laceysanderson
Copy link
Contributor

It looks like the best practices recommendation is to use triggers instead of rules "unless you really know what you are doing or the performance hit from them is unacceptable, as rules can work in unexpected ways." https://stackoverflow.com/questions/5894142/what-are-postgresql-rules-good-for

Ideally you would support cases where an analysis would have features from multiple organisms since that's supported by the database. For example, someone could have aligned an EST set against the current genome where the ESTs are from multiple organisms for comparative analysis.

I looked quickly at the trigger in the gist you linked to. It's looking good so far :-) You would select the analysis_id for the feature using

SELECT analysis_id FROM analysis A
    INNER JOIN analysisfeature AF ON A.analysis_id = AF.analysis_id
     INNER JOIN feature F ON F.feature_id = AF.feature_id
     INNER JOIN organism O ON O.organism_id = F.organism_id
WHERE F.feature_id = OLD.feature_id

@scottcain
Copy link
Member

@bradfordcondon what's the status of this? I'm not convinced it should go into 1.4, but if trigger is working well, we can at least consider it.

@bradfordcondon
Copy link
Contributor Author

the status is that people are using the tripal module to link organisms to analysis, but i never got around to figuring out how to implement lacey's trigger suggestion :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants