Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Information passing between phases #965

Open
ittaiz opened this issue Jan 28, 2020 · 9 comments
Open

Information passing between phases #965

ittaiz opened this issue Jan 28, 2020 · 9 comments

Comments

@ittaiz
Copy link
Member

ittaiz commented Jan 28, 2020

We need to define how information should be passed between phases.
Current strategy is for each phase to return a struct and then following phases get a struct with all the structs nested (keyed by phase name).
This has the following drawbacks:

  1. The contract of what a phase exposes is unclear- especially if we want to allow people to depend on a semi-public API it needs to be more clear what a phase returns without rummaging through it's return statement.
  2. The current strategy doesn't allow for following phases to override or augment the data returned by previous phases. One need is to add files to the exposed runfiles as shown in unify all of the default info phases implementations #958. Note: Current mechanism does allow for phases which replace phases to override their return value by registering using the same name but that is not enough.
    This means that to achieve this phases need to traverse all previous structs for named attributes.

We'd like to fix both of the above. It's likely that overrides and augmentation should be achieved by using the dependency inversion principle and having downstream phases define their needs and then upstream phases can expose this in whatever strategy we choose.

One option we can choose is to have each phase expose a dict of internal_providers (keyed by the name of the provider). This dict will also be an input to each phase and so phases will be able to both override and augment internal providers.
Another advantage of providers is that they are a recognized in the eco system and tools will support them with respect to docs and such.
To the best of my understanding providers are not a lot more than structs.

@andyscott has suggested we use well defined dictionaries or structs.

We should discuss these options and what value does each bring as well as other options that people might surface.

@ittaiz
Copy link
Member Author

ittaiz commented Jan 28, 2020

cc @borkaehw

@andyscott
Copy link
Contributor

andyscott commented Jan 29, 2020

A few notes:

1 - We should audit the interactions between the current phases. I'm interested in:

  • what types are commonly being returned (depset of files, array of files, flags, etc)
  • reoccurring field names
  • one off dependencies between phases, i.e. "the foo phase is the only phase using the bar phase's flibbit field"

At the very least we can use this to establish and document some conventions for fields.

2 - Do we have any cases where a downstream phase needs to override something from an upstream phase? AFAIK we only have two cases close to flavor:

  • We only want one of a particular thing defined by any phase
  • Multiple phases need to add to an aggregate over the course of aggregation. This seems like a monoid to me.

We can support both of these cases without having overrides.

@ittaiz
Copy link
Member Author

ittaiz commented Jan 29, 2020

We should audit the interactions between the current phases

Sounds good. Can you do this audit?

We can support both of these cases without having overrides.

I guess we can, what's the con of overrides? It sounds to me like a simple and clear pattern.

Additionally I'd love to understand what are the cons in your eyes to providers compared to structs/dictionaries. AFAICT providers are that with a bit of more structure and familiarity.

@smparkes
Copy link
Contributor

Moved from #958:

I'm not convinced that providers are the way to go for passing phase
information. Providers are typically used for passing information across
rule boundaries: rule implementation a -- target graph --> rule
implementation b. Here we are passing information between functions
within the same rule implementation, i.e. phase 1 --> phase 2, all within
rule implementation c. Perhaps we should just use well structured structs
or dictionaries?

I'm consciously a bystander on the phase stuff. I'm glad you all got this
(however it falls out).

FWIW, I did have similar feelings to the comment above.

This is what I understand (you all may already know this):

I got the impression providers were intended for "inter-target"
communication: rule calls return provider instances and those instances get
attached to targets created by the rule call which (1) can be type-enforced
by rules depending on the target (I think) parameters to label attrs and
(2) can be read by things that depend on those targets in a way that the
provider type becomes the attribute/key.

This doesn't seem to relate to phases?

Putting that special behavior aside, I did explore what behavior providers
have beyond what you get with structs. Once they're created, it doesn't
look like much. They type to "struct" and have to_json like structs do.

They do have the provider declaration for documentation. And they do seem
to enforce a little bit at construction: you can't add a field that isn't
declared. You can leave out fields: it doesn't do anything that a struct
constructor without that field wouldn't.

I think I'll avoid providers when I'm not using them in the rule impl/dep
role. I find the value-add low compared to the confusion it causes me
("what target/rule is this related to?")

I will say that I find the current factoring of the scalac provider stuff a
bit confusing.

_declare_scalac_provider runs a rule to create a target which does happen
to have the provider, but what is scalac_default itself? At this point,
I'm pretty comfortable saying this should be a toolchain, though I think it
needs to be factored differently. Handling multiple version toolchains will
drive this because the current global naming can't support that.

@andyscott
Copy link
Contributor

Sounds good. Can you do this audit?

Can do, but it will have to be a few weeks from now.

I guess we can, what's the con of overrides? It sounds to me like a simple and clear pattern.

I guess I just don't see where we need to use overrides. AFAIK we either have exactly one thing we need to return (like the final executable) or we have an aggregation (like the output files). @borkaehw might be able to weigh in too, as they've pushed on the phase architecture quite a bit.

Additionally I'd love to understand what are the cons in your eyes to providers compared to structs/dictionaries. AFAICT providers are that with a bit of more structure and familiarity.

What @smparkes said 😄

@ittaiz
Copy link
Member Author

ittaiz commented Feb 21, 2020

I've been meaning to write here.
I think the cons mentioned here are low, the value add is to add some more types and structure into the APIs. I think that the bazel ecosystem currently uses providers between rules but at the same time the bazel ecosystem doesn't use phases at all.
Hopefully the two will change.
The way I see it our current architecture means that APIs between phases are at least semi-public if not public and are too vague and string based today.
I'd like to change that.

Re overrides- I think the extra power is valuable.

@eed3si9n
Copy link

As a concrete example, I'm looking to filter out some JARs after collect_jars phase. It looks like phase_compile uses the JARs from the struct named collect_jars:

jars = p.collect_jars.compile_jars
rjars = p.collect_jars.transitive_runtime_jars
transitive_compile_jars = p.collect_jars.transitive_compile_jars
jars2labels = p.collect_jars.jars2labels.jars_to_labels
deps_providers = p.collect_jars.deps_providers

so the only way this can be customized is if I substitute collect_jars phase?

Ideally I want the original collect_jars phase to do its thing, and then in a subsequent phase I can add or subtract the JARs based on some logic inclusion or exclusion logic.

@liucijus
Copy link
Collaborator

@eed3si9n can you give more context what are you trying to achieve that you need to change jar list? Currently collect_jars has more than one responsibility (collecting jars according to dependency strategy and mapping labels to jars for dep analyzer). Maybe it needs to be refactored into more phases.

As a concrete example, I'm looking to filter out some JARs after collect_jars phase. It looks like phase_compile uses the JARs from the struct named collect_jars:

jars = p.collect_jars.compile_jars
rjars = p.collect_jars.transitive_runtime_jars
transitive_compile_jars = p.collect_jars.transitive_compile_jars
jars2labels = p.collect_jars.jars2labels.jars_to_labels
deps_providers = p.collect_jars.deps_providers

so the only way this can be customized is if I substitute collect_jars phase?

Ideally I want the original collect_jars phase to do its thing, and then in a subsequent phase I can add or subtract the JARs based on some logic inclusion or exclusion logic.

@eed3si9n
Copy link

eed3si9n commented Jun 1, 2021

I work for Twitter, and we're trying to migrate existing Pants codebase to Bazel simultaneously since the BUILD files are close enough. Not so surprisingly, where we run into a major difference is the timing of 3rdparty resolution, where Pants is able perform Coursier resolution per target. I'm implementing multiversioning support so the build users can declare multiple versions of Kafka etc, and different parts of the build graph can use different versions.

There are, however, situations where majority of the users would want to use version X of something like Guava, but some needs to stay behind or ahead for specific reasons. Allowing terminal node to perform lightweight filtering can solve this sticky problem of ending up with multiple Guavas on the classpath.

Another similar use case is allowing the build users to exclude specific files for deployment images. I assume this is also done to handle conflict with the data processing runtimes such as Hadoop, or to swap out the logging libraries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants