Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scatter over multiple items with literate names (as syntactic sugar) #279

Open
jtratner opened this issue Jan 4, 2019 · 7 comments
Open

Comments

@jtratner
Copy link
Contributor

jtratner commented Jan 4, 2019

While the pair syntax is helpful, if you have three or more arrays you're trying to zip, the syntax gets pretty messy.

For example, right now I can do:

scatter(pair in zip(zip(sample_bams, sample_metrics), sample_vcfs))) {
  call task { input: bam=pair.left.left, metric=pair.left.right, vcf=pair.right }
}

But I'd much prefer to be able to say:

scatter(bam, metric, vcf in zip(sample_bams, sample_metrics, sample_vcfs)) {
    call task{ input: bam=bam, metric=metric, vcf=vcf}
}

If the concern is scoping rules, perhaps it could be the equivalent to:

scatter(pair in zip(zip(sample_bams, sample_metrics), sample_vcfs)) {
    FIle bam = pair.left.left
    FIle metric = pair.left.right
    File vcf = pair.right
    call task { input: bam=bam, metric=metric }
}
@patmagee
Copy link
Member

patmagee commented Jan 8, 2019

I am not sure how you are generating the upstream information, but would it be possible to use a struct to encapsulate all of the information you are trying to group together?

IE:

struct SampleInfo {
    File bam
    File metric
    File vcf
}

workflow myWorkflow {
  Array[SampleInfo] sampleInfo

  scatter(info in SampleInfo){
     call task { input: bam=info.bam, metric=info.metric, vcf=info.vcf }
  }
}

@geoffjentry
Copy link
Member

Pair was never intended to be the final word. The origin of it was really the need for a tuple construct, but the driving use case at the time only needed tuple2 and the implementation of a Pair in Cromwell was a lot simpler than a arbitrarily sized tuple structure so that's what we did. We figured that we'd always have backwards compatibility if tuple was ever realized by just making Pair an alias for a tuple2

So I'd also consider if an arbitrary tuple would make sense, and if so, if there's something that's providing that's not provided by a struct (and/or just syntactic sugar over a struct)

@jtratner
Copy link
Contributor Author

Thanks for the response!

I am not sure how you are generating the upstream information, but would it be possible to use a struct to encapsulate all of the information you are trying to group together?

I'm not sure :) . Right now this is the combination of a number of calls (slightly contrived example, but gets at the root of it):

scatter(fastq in fastqs) {
    call bwa { input: fastq=fastq}
}
scatter (bam in bwa.bam) {
   call filter_bam { input:  bam=bam}
}
call calc_metrics { input: bams=filter_bam.bam}

scatter (bam, filtered_bam, metric in zip(bwa.bam, filter_bam.out_bam, calc_metrics.metric)) {
 call ...
}

The reason I wouldn't just put this all in one scatter is because I have a normalization step that takes in all fastqs (or I might want to call out to some subworkflows). Regardless I end up with a bunch of lists and I'm not sure how to convert that into structs.

Does that make sense?

So I'd also consider if an arbitrary tuple would make sense, and if so, if there's something that's providing that's not provided by a struct (and/or just syntactic sugar over a struct)

Arbitrary tuple makes sense in the sense that it's pretty natural to want to iterate over a set of arrays that are all have items in the same order (esp because different scatters still guarantee output in same order)

@patmagee
Copy link
Member

@jtratner One way you could go about doing this would be to build a series of structs or even using a Map[String,File] type, (or one struct with optional parameters). The struct approach may be slightly dependent on struct literals working according to their new definition, so I will provide an answer using maps:

# Approach using Maps which can be implemented now
scatter(fastq in fastqs) {
    call bwa { input: fastq=fastq}
    Map[String,File] intermediate_1 = {"fastq":intermediate["fastq"],"bam":bam}
}
scatter (intermediate in intermediate_1) {
   call filter_bam { input:  bam=intermediate["bam"]}
   Map[String,File] intermediate_2 =  {"fastq":intermediate["fastq"],"bam":intermediate["bam"],"filter_bam": filter_bam.bam}
}

# filter_bam is still defined so you can still call it and receive a `Arrary[File]` type
call calc_metrics { input: bams=filter_bam.bam}


scatter (intermediate in intermediate_2) {
  File bam = intermediate["bam"]
  File fastq = intermediate["fastq"]
  File filt_bam = intermediate["filter_bam"]
 call ...
}

@ghost
Copy link

ghost commented Feb 8, 2021

Is it possible to extract a key from an array of structs? I.e. get an array of File bams from the struct SampleInfo above? Or can you coerce a zip of arrays to an array of structs? Sorry just realized zip can only zip 2 arrays max. Makes sense since on files have an index and not another auxiliary file.

@jdidion
Copy link
Collaborator

jdidion commented Feb 7, 2024

I agree with @geoffjentry - Tuple would be more generally useful than Pair. We might want to look at deprecating the later in favor of the former in a future version. Then we could also 1) add tuple destructuring as a language feature, and 2) allow zip to take any number of arguments. The combination would address the OP's request.

@jdidion
Copy link
Collaborator

jdidion commented Feb 7, 2024

If we switch from Pair to Tuple I'd also get rid of the left/right naming as well. I'm ambivalent about whether to use array-style indexing vs named elements, e.g., t.0 to get the first element.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants