RFC: user defined functions #407

jdidion · 2020-10-19T12:42:38Z

jdidion
Oct 19, 2020
Maintainer

Cross-posted from #405

Currently, the WDL specification provides a small library of functions that meet the needs of many use-cases, but certainly not all of them. The ability to define new functions has been requested several times in the past. This proposal aims for an idiomatic specification of UDFs.

Example:

task call_funcs {
  input {
    File infile
  }

  command <<<
  process_yaml < ~{read_yaml(infile)} > output.txt
  >>>

  output {
    File outfile = "output.txt"
  }
}

func read_yaml {
  input {
    File infile
    String? encoding
  }

  command <<<
  import yaml
  with open("~{infile}", "r", encoding="~{encoding}") as inp:
    y = yaml.read(inp)
  y.pretty_print()
  >>>

  output {
    File outfile = read_string(stdout())
  }

  runtime {
    container: "python_with_yaml"
    interpreter: "python"
  }
}

The signature of the above function is: String read_yaml(File, String?)

User-defined functions are similar to tasks, with the following differences:

User-defined functions begin with the func keyword.
The order of the input parameters matters - the (left-to-right) function signature is the set of input parameters ordered from top-to-bottom.
There may be at most one optional input parameter (which may or may not have a default value), and it must be the last parameter in the signature.
Only a single output parameter is allowed.
Runtime attributes and Hints: TBD - if functions are executed in the same process as the calling task, runtime/hints must somehow be merged with the task's runtime/hints.
There is one function-specific runtime attribute:
- sections: the section(s) in which the function may be used; defaults to "*", may be a String or Array[String] with one of the four task sections that allow expressions (input, output, command, runtime)

Similar to structs, funcs exist in a common namespace (regardless of in which WDL file they are defined); however, funcs cannot be aliased, so there must not be any name collisions between funcs defined in different WDL files in the import tree.

Once defined, a func may be used by its (unqualified) name in any command block.

In conjunction with the proposed addition of the interpreter runtime attribute, users will be able to write functions in a variety of programming languages. This raises the question of how to support functions written in different languages, or a function written in a different language than the command block. There are a few possible solutions:

Require that the task container (or host) environment provides all of the interpreters required by all of the functions used in the command block.
Use a solution such as docker compose or docker run --link to enable the commands to access executables across containers. This means that each function would need to specify its container, and the runtime would be required to dynamically compose the container of the task and all functions used by that task.
Execute functions in their own environments, e.g. subprocesses or separate workers. This makes executing a task similar to executing a workflow.

mlin · 2020-10-20T02:05:23Z

mlin
Oct 20, 2020
Maintainer

I think the global namespace & aliasing system has some issues (related: #280) but it might be even-more problematic without the aliasing part. Then there could be WDL documents that can't be imported together.

A lot of the most useful/reusable functions might be polymorphic, any thoughts on how to deal with that in the future?

0 replies

jggatter · 2020-10-21T21:54:18Z

jggatter
Oct 21, 2020

From a user standpoint I'm thinking it may be confusing to discover a workflow where task-executed code has been tucked away inside a block of code independent of the task. While I can see value in the use case where more than one task is executing this code very frequently, I personally think the better practice is either simply writing it in the command block or importing it from a script found in the image.

I've always thought that user-defined functions would be fantastic, but I was thinking more in terms of interacting with WDL datatypes like Arrays, Objects/Structs, etc.. I've never been a fan of the philosophy that manipulations of WDL types outside of what the standard library provides have to be done within a VM. It would look very similar to this mock-up but instead the command code would have to be written in the language chosen by the execution engine.

Ultimately I'm not really opposed to the idea of using it for runtime code if people find it valuable! I also really do like the interpreter idea!

0 replies

rhpvorderman · 2020-11-10T10:12:13Z

rhpvorderman
Nov 10, 2020
Collaborator

If I may quote the zen of python:

There should be one-- and preferably only one --obvious way to do it.

Every user-specific function can also be implemented as a task. This has numerous advantages

keep language simple and accessible. This is one of the core goals of WDL.
Keep language simpler to implement. It is already very hard, due to its design choice of unloading all the complexity on the engine developers.

Less is more in this case. Speaking from my experience developing WDL workflows, I may need at most 1 or 2 tasks per workflow which is a more "function-like" operation (yaml to json conversion for instance). This can easily be achieved using a task with a python container.

So my primary question is: what can user-defined functions do what tasks can't do? If there is no value add, I prefer to keep the language simpler.

0 replies

jdidion · 2020-11-10T16:40:14Z

jdidion
Nov 10, 2020
Maintainer Author

UDFs are partly a matter of convenience, and partly a matter of optimization.

First, it should be explicitly stated that UDFs are intended for relatively simple, fast operations (similar to what I'd write using a javascript expression in CWL).

Let's take an example:

# With UDFs
task foo {
  input {
    String s
  }
  command <<<
    echo ~{myfunc(s)}
  >>>
  output {
    String out = read_string(stdout())
  }
}

function myfunc { ... }

# Without UDFs

workflow foo {
  input {
    String s
  }
  call myfunc { input: s = s }
  call foo { input: s = myfunc.out }
  output {
    String out = foo.out
  }
}

task myfunc { ... }
task foo { ... }

With UDFs there is 1) less code, and 2) a single task to run vs a workflow with subtasks. On a local machine, the runtime won't be that different because spinning up a subprocess is relatively fast. However, in the cloud, it takes up to several minutes to spin up each worker, and it seems wasteful just to execute a (presumably) trivial task like myfunc. In the first example, myfunc can be assumed to be short-running, and thus can be run in a sub-process of the task executor (although the engine could chose to execute it on a separate worker).

4 replies

rhpvorderman Nov 11, 2020
Collaborator

Well it is more like a single task vs two tasks, since that single task will be part of a workflow anyway and the myfunc task can be added as part of that workflow. There is no need to create a subworkflow. So the savings are one call statement. Which is one line in your example.

As regards to the wasteful spinning up of stuff in the cloud: you can set a hint that this job should be executed locally in the runtime section. So that solves that problem.

Adding extra syntax to save just one line just does not seem worth it to me.

EDIT: There is another argument too. All call statements are executed in bash. So if it is really a simple function it can be added as part of the call statement of mytask unless the function is used a lot among tasks, but I find that use case to be less common.

I am not denying the performance implications of running a whole other task instead of a function. On SLURM we also have to wait 5 minutes for a 2 second task to finish, however that 5 minutes is usually peanuts in the grand scheme of things (most workflows run 12 hours +). We haven't cared enough to make these tasks explicitly local even though that option is there. So that is also part of the reason that UDFs are not solving an important problem. But your mileage my vary of course. Are there any use cases where this was a big problem?

jdidion Nov 11, 2020
Maintainer Author

You are assuming the engine only supports running workflow. With miniwdl and dxCompiler you can execute a single task.

rhpvorderman Nov 11, 2020
Collaborator

I am assuming you are using a workflow engine to run workflows with multiple tasks. I only ever run tasks singularly if I want to test those specific tasks. If I want to run one program, I can just do that on the command line.

rhpvorderman Nov 11, 2020
Collaborator

To be clear: I see that UDFs have added value. But the value add is very minor. The edge cases where it applies can be circumvented by using a task that runs a proper programming language or by doing some bash magic in the task run section.

The Workflow Description Language (WDL) is a way to specify data processing workflows with a human-readable and -writeable syntax. WDL makes it straightforward to define analysis tasks, chain them together in workflows, and parallelize their execution. The language makes common patterns simple to express, while also admitting uncommon or complicated behavior; and strives to achieve portability not only across execution platforms, but also different types of users. Whether one is an analyst, a programmer, an operator of a production system, or any other sort of user, WDL should be accessible and understandable.

This is why I feel new language elements should be added sparingly. Less keywords means that it is easier to hold the entire language in your head. It will make it more easy to learn and more easy to master. I don't want WDL to become a programming language itself. I can already code in Python and Scala well enough to serve that particular need, and I would hate it if the scope of WDL would extend to the detriment of its intentional use: fast workflow writing and reading.

Are there use cases where it is impossible to write a workflow because of the lack of user-defined functions?

geoffjentry · 2020-11-11T14:52:14Z

geoffjentry
Nov 11, 2020
Maintainer

I've never really been a fan of this concept. I think it adds far too much in the way of cognitive complexity for far too little gain in terms of the boundary line between DSL and general purpose programming language.

In regards to resources and runtimes, I'd rather see things go the route of hints for local execution, bin packing, etc.

0 replies

patmagee · 2020-11-13T15:14:03Z

patmagee
Nov 13, 2020
Maintainer

While I can see the merits of this in "seemingly" granting much more flexibility to the user, I tend to agree with @geoffjentry here that the benefits do not necessarily outweigh the cognitive cost of it.

Functions must be callable both in the context of a task, as well as in the context of general expression evaluation from anywhere in the workflow. In the best case, ALL workflow engines will be able to natively run the code snippet without any additional work. However, I feel like this best case scenario is not realistic and what this ends up becoming is syntatic obfuscation on a task call. Since functions will require a runtime, usage of a function adds an implicit dependency on a "task like" construct. Ie the follwoing are functionally equivalent

task foo {
  input {
    File bar
  }
  command <<<
    echo ~{my_func(bar)
  >>>
}


func my_func {
  input {
    File inp
   }
  command <<<
    cat ~{inp}
  >>>

  output {
    String out = read_string(stdout())
  }
}

task foo {
  input {
   String bar
  }
  command <<<
   echo ~{my_func(bar)
  >>>
}


task my_task {
  input {
   File inp
  }
  command <<<
   cat ~{inp}
  >>>
  output {
   String out = read_string(stdout())
  }
}

workflow {
  call my_task
  call foo {
    input: my_task.out
  }

}

While we can "encourage" people to keep functions short and lightweight, I feel like in practice that is not a realistic goal. I dont see it long before people start using the func construct to wrap things like calls to bwa or other tools and having ad-hoc alignments from within a workflow without explicitly calling a task. If engines adopt a policy where they run the task in a local Docker you will run into SERIOUS performance issues if they need to, pull down files or run with resources. Additionally there are going to be many deployment scenarios where this is not an option. IE we use kubernetes and it would be a substantial lift to get this working

There is also the added issue that, currently since engines implement functions, they are able to optimize those functions on a per BE basis. IE you can use cloud specific API's for calculating things like file sizes, md5 sums, etc without having the overhead of needing to spin up a VM and run the task.

TLDR: I dont think this will make engine implementors or workflow creators jobs any easier

Alternative Proposal

What if instead of allowing users to completely define arbitrary behaviour, a UDF is simplified into an alias for an expression. Users would only be allowed to use fucntions currently definined in the specification, but they can have much more complex behaviour by chaining multiple expressions together.

func do_stuff {
  input {
    String message
    File f
  }
  # outputs would take the semantics of `keyword` `Type` `expression`
  output  String "The current file size for ~{basename(f)} is ~{size(f,"GB")}"
  }
}

1 reply

jdidion Nov 19, 2020
Maintainer Author

I think that might be a nice addition, but it doesn't address the main reason I want UDFs, which is to provide behavior not allowed by functions (or combinations of functions) in the spec.

jdidion · 2020-11-19T17:08:04Z

jdidion
Nov 19, 2020
Maintainer Author

Here is an example of why I'd really like to have UDFs: #423

0 replies

jdidion · 2020-11-19T17:18:11Z

jdidion
Nov 19, 2020
Maintainer Author

After thinking about this more, I can see a way to add UDFs that don't require a change to the spec but are still portable. Consider the following workflow:

workflow engine_specific_behavior {
  input {
    File f
  }
  call myfunc { input: f = f }
  output {
    File out = myfunc.out
  }
}

task myfunc {
  input {
    File f
  }
  command {
    ...do something...
  }
  output {
    File out
  }
}

The workflow engine that desires to provide a native implementation of myfunc can substitute the task call in the workflow with a function call, and other workflow engines will continue to work by calling the (perhaps less efficient) task. And if the user doesn't care about about their workflow being portable, then myfunc is just a no-op stub that's just provided to make the type-checker happy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: user defined functions #407

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RFC: user defined functions #407

jdidion Oct 19, 2020 Maintainer

Replies: 8 comments · 5 replies

mlin Oct 20, 2020 Maintainer

jggatter Oct 21, 2020

rhpvorderman Nov 10, 2020 Collaborator

jdidion Nov 10, 2020 Maintainer Author

rhpvorderman Nov 11, 2020 Collaborator

jdidion Nov 11, 2020 Maintainer Author

rhpvorderman Nov 11, 2020 Collaborator

rhpvorderman Nov 11, 2020 Collaborator

geoffjentry Nov 11, 2020 Maintainer

patmagee Nov 13, 2020 Maintainer

Alternative Proposal

jdidion Nov 19, 2020 Maintainer Author

jdidion Nov 19, 2020 Maintainer Author

jdidion Nov 19, 2020 Maintainer Author

jdidion
Oct 19, 2020
Maintainer

Replies: 8 comments 5 replies

mlin
Oct 20, 2020
Maintainer

jggatter
Oct 21, 2020

rhpvorderman
Nov 10, 2020
Collaborator

jdidion
Nov 10, 2020
Maintainer Author

rhpvorderman Nov 11, 2020
Collaborator

jdidion Nov 11, 2020
Maintainer Author

rhpvorderman Nov 11, 2020
Collaborator

rhpvorderman Nov 11, 2020
Collaborator

geoffjentry
Nov 11, 2020
Maintainer

patmagee
Nov 13, 2020
Maintainer

jdidion Nov 19, 2020
Maintainer Author

jdidion
Nov 19, 2020
Maintainer Author

jdidion
Nov 19, 2020
Maintainer Author