RFC: user defined functions #407
Replies: 8 comments 5 replies
-
I think the global namespace & aliasing system has some issues (related: #280) but it might be even-more problematic without the aliasing part. Then there could be WDL documents that can't be imported together. A lot of the most useful/reusable functions might be polymorphic, any thoughts on how to deal with that in the future? |
Beta Was this translation helpful? Give feedback.
-
From a user standpoint I'm thinking it may be confusing to discover a workflow where task-executed code has been tucked away inside a block of code independent of the task. While I can see value in the use case where more than one task is executing this code very frequently, I personally think the better practice is either simply writing it in the command block or importing it from a script found in the image. I've always thought that user-defined functions would be fantastic, but I was thinking more in terms of interacting with WDL datatypes like Arrays, Objects/Structs, etc.. I've never been a fan of the philosophy that manipulations of WDL types outside of what the standard library provides have to be done within a VM. It would look very similar to this mock-up but instead the command code would have to be written in the language chosen by the execution engine. Ultimately I'm not really opposed to the idea of using it for runtime code if people find it valuable! I also really do like the interpreter idea! |
Beta Was this translation helpful? Give feedback.
-
If I may quote the zen of python:
Every user-specific function can also be implemented as a task. This has numerous advantages
Less is more in this case. Speaking from my experience developing WDL workflows, I may need at most 1 or 2 tasks per workflow which is a more "function-like" operation (yaml to json conversion for instance). This can easily be achieved using a task with a python container. So my primary question is: what can user-defined functions do what tasks can't do? If there is no value add, I prefer to keep the language simpler. |
Beta Was this translation helpful? Give feedback.
-
UDFs are partly a matter of convenience, and partly a matter of optimization. First, it should be explicitly stated that UDFs are intended for relatively simple, fast operations (similar to what I'd write using a javascript expression in CWL). Let's take an example: # With UDFs
task foo {
input {
String s
}
command <<<
echo ~{myfunc(s)}
>>>
output {
String out = read_string(stdout())
}
}
function myfunc { ... }
# Without UDFs
workflow foo {
input {
String s
}
call myfunc { input: s = s }
call foo { input: s = myfunc.out }
output {
String out = foo.out
}
}
task myfunc { ... }
task foo { ... } With UDFs there is 1) less code, and 2) a single task to run vs a workflow with subtasks. On a local machine, the runtime won't be that different because spinning up a subprocess is relatively fast. However, in the cloud, it takes up to several minutes to spin up each worker, and it seems wasteful just to execute a (presumably) trivial task like |
Beta Was this translation helpful? Give feedback.
-
I've never really been a fan of this concept. I think it adds far too much in the way of cognitive complexity for far too little gain in terms of the boundary line between DSL and general purpose programming language. In regards to resources and runtimes, I'd rather see things go the route of hints for local execution, bin packing, etc. |
Beta Was this translation helpful? Give feedback.
-
While I can see the merits of this in "seemingly" granting much more flexibility to the user, I tend to agree with @geoffjentry here that the benefits do not necessarily outweigh the cognitive cost of it. Functions must be callable both in the context of a task, as well as in the context of general expression evaluation from anywhere in the workflow. In the best case, ALL workflow engines will be able to natively run the code snippet without any additional work. However, I feel like this best case scenario is not realistic and what this ends up becoming is syntatic obfuscation on a task call. Since functions will require a runtime, usage of a function adds an implicit dependency on a "task like" construct. Ie the follwoing are functionally equivalent task foo {
input {
File bar
}
command <<<
echo ~{my_func(bar)
>>>
}
func my_func {
input {
File inp
}
command <<<
cat ~{inp}
>>>
output {
String out = read_string(stdout())
}
}
task foo {
input {
String bar
}
command <<<
echo ~{my_func(bar)
>>>
}
task my_task {
input {
File inp
}
command <<<
cat ~{inp}
>>>
output {
String out = read_string(stdout())
}
}
workflow {
call my_task
call foo {
input: my_task.out
}
}
While we can "encourage" people to keep functions short and lightweight, I feel like in practice that is not a realistic goal. I dont see it long before people start using the There is also the added issue that, currently since engines implement functions, they are able to optimize those functions on a per BE basis. IE you can use cloud specific API's for calculating things like file sizes, md5 sums, etc without having the overhead of needing to spin up a VM and run the task. TLDR: I dont think this will make engine implementors or workflow creators jobs any easier Alternative ProposalWhat if instead of allowing users to completely define arbitrary behaviour, a UDF is simplified into an alias for an expression. Users would only be allowed to use fucntions currently definined in the specification, but they can have much more complex behaviour by chaining multiple expressions together. func do_stuff {
input {
String message
File f
}
# outputs would take the semantics of `keyword` `Type` `expression`
output String "The current file size for ~{basename(f)} is ~{size(f,"GB")}"
}
} |
Beta Was this translation helpful? Give feedback.
-
Here is an example of why I'd really like to have UDFs: #423 |
Beta Was this translation helpful? Give feedback.
-
After thinking about this more, I can see a way to add UDFs that don't require a change to the spec but are still portable. Consider the following workflow:
The workflow engine that desires to provide a native implementation of myfunc can substitute the task call in the workflow with a function call, and other workflow engines will continue to work by calling the (perhaps less efficient) task. And if the user doesn't care about about their workflow being portable, then myfunc is just a no-op stub that's just provided to make the type-checker happy. |
Beta Was this translation helpful? Give feedback.
-
Cross-posted from #405
Currently, the WDL specification provides a small library of functions that meet the needs of many use-cases, but certainly not all of them. The ability to define new functions has been requested several times in the past. This proposal aims for an idiomatic specification of UDFs.
Example:
The signature of the above function is:
String read_yaml(File, String?)
User-defined functions are similar to
tasks
, with the following differences:func
keyword.input
parameters matters - the (left-to-right) function signature is the set of input parameters ordered from top-to-bottom.Similar to
struct
s,func
s exist in a common namespace (regardless of in which WDL file they are defined); however,func
s cannot be aliased, so there must not be any name collisions betweenfunc
s defined in different WDL files in the import tree.Once defined, a
func
may be used by its (unqualified) name in any command block.In conjunction with the proposed addition of the
interpreter
runtime attribute, users will be able to write functions in a variety of programming languages. This raises the question of how to support functions written in different languages, or a function written in a different language than the command block. There are a few possible solutions:docker compose
ordocker run --link
to enable the commands to access executables across containers. This means that each function would need to specify its container, and the runtime would be required to dynamically compose the container of the task and all functions used by that task.Beta Was this translation helpful? Give feedback.
All reactions