Skip to content
This repository has been archived by the owner on Sep 2, 2023. It is now read-only.

--loader hooks for require() #6

Closed
bmeck opened this issue Feb 5, 2018 · 31 comments
Closed

--loader hooks for require() #6

bmeck opened this issue Feb 5, 2018 · 31 comments

Comments

@bmeck
Copy link
Member

bmeck commented Feb 5, 2018

Spawned from a comment on per package hooks .

I'm opening this issue to discuss if we should make the Loader mechanisms we have expand to be able to intercept require() specifiers as well. I think it is doable but requires we split the resolution between sync/async forms. Looking at this as well we might want to split up the resolution of format from path resolution, but that is a lot to discuss for the general first gathering of consensus on if this should be a hook we provide.

@mcollina
Copy link
Member

mcollina commented Feb 5, 2018

I think this is a must have. We should provide a simple enough mechanism from APMs to hook in both esm and cjs.

cc @watson

@ljharb
Copy link
Member

ljharb commented Feb 5, 2018

I think if it’s important enough to warrant first-class support in ESM, it’s important enough to have it in CJS. I also think the inverse is true.

@bmeck
Copy link
Member Author

bmeck commented Feb 5, 2018

@ljharb within reason, ESM was not made with CJS interop as a first class citizen. I would prefer if something is only viable in CJS and not ESM it not be put as a priority.

@giltayar
Copy link

giltayar commented Feb 5, 2018

The migration path is CJS => ESM. Whatever needs to be implemented to ease this transition is important. In this case, implementing loader hooks will help moving code from ESM => CJS. Neat, but not terribly important, IMO.

I would prefer using our resources elsewhere.

Actually, having loader hooks only in ESM may make the transition go faster! (people will move to ESM just to have all the goodies that loader hooks will bring...)

@mcollina
Copy link
Member

mcollina commented Feb 5, 2018

I do not see a short-term future where a Node.js application will not a have a commonjs module in its dependency tree. Having support for cjs loader hooks will provide a safe transition to ESM.
Most enterprise Node.js deployments include an APM solution, which is currently based on monkey patching.
Without cjs hooks, any apm vendor should have two solutions: one for cjs and one for esm. In fact, not supporting this will slow the adoption curve.

@giltayar
Copy link

giltayar commented Feb 5, 2018

@mcollina - wouldn't the existing solutions they have work with their existing CJS modules? If they do, and they should, why bother migrating to another, albeit newer and better, solution?

Having said that, it would be nice to have a solution that works similarly and is configured similarly. But I would say that it's a nice to have, not a must have. If, in terms of code, it's not a big thing, then, yes. But if it requires significant work, I would prefer investing energy elsewhere.

Also, could you explain what "APM" means? 😳

@devsnek
Copy link
Member

devsnek commented Feb 5, 2018

@giltayar APM = Application performance management, mostly hooking into node core to time stuff and catch errors

@bmeck
Copy link
Member Author

bmeck commented Feb 5, 2018

There are also other common workflows like Mocking or testing that use similar workflows of replacing a module with a virtual/temporary one.

@ljharb
Copy link
Member

ljharb commented Feb 5, 2018

It's absolutely essential to any adoption of ESM that things like proxyquire and jest.mock and others have a way to function - iow, there needs to be a way to "mock out" a default or named import.

Supporting this natively in node for ESM is a must; if we can also support it in a similar fashion for CJS, that's very valuable for a migration path - iow, if half my modules are CJS and the other half ESM, i'd want to mock out a foo whether it's required or imported.

@bmeck
Copy link
Member Author

bmeck commented Feb 5, 2018

@ljharb that seems fine, but various things like setters/getters on the bindings are not possible still even though CJS mocking commonly does that to properties off module.exports. If we have specific cases where they don't work (like having getter/setter named exports) in both I'd prefer not to have the functionality exposed.

@ljharb
Copy link
Member

ljharb commented Feb 6, 2018

ftr, I'm totally fine with getters and setters on module.exports not working - I find that an absurd pattern and nothing like it should be possible in ESM anyways. CJS mocking is certainly achievable without getters/setters.

@bmeck bmeck mentioned this issue Feb 7, 2018
@zackschuster
Copy link
Contributor

@bmeck for my own benefit, this is similar to/drawing from npm's loaderhooks spec?

@bmeck
Copy link
Member Author

bmeck commented Feb 7, 2018

@zackschuster this existed prior to theirs so idk which should be seen as drawing from which.

@zackschuster
Copy link
Contributor

@bmeck ah, i didn't know that. it sounded similar to their proposal so i wanted to make sure 😄 thanks!

@bmeck
Copy link
Member Author

bmeck commented Feb 7, 2018

@zackschuster theirs is basically an exact replica of Node's, there are some changes to discuss around problems with the hooks and APIs. In particular Node's intentionally does not require/allow runtime mutation of the loader semantics except by the loader instrumenting it themselves.

@zackschuster
Copy link
Contributor

@bmeck ah, hence the APM references. this is making more sense to me, thank you for taking the time to explain things.

@evanplaice
Copy link

I see two glaring omission in the recommended implementation for loader hooks

A directory containing package.json will have a termination when crossing the directory.

This assumes that the package boundaries will be exposed by each import statement. ESM doesn't support this functionality due to the nature of facades.


Use Case 1:

rxjs/operators/index.js

export { flat } from './flat';
export { flatMap } from './flatMap';

rxjs/index.js

export { flat, flatMap } from 'rxjs/operators';

app.js

import { flat, flatMap } from 'rxjs';

Which would fire, entry, root, or operators hook; if more than one fire, then in what order?


Use Case 2:

At some point, ESM will have full native support in browsers and Node, eliminating the need to transpile down to ES5. Despite that fact, bundling will still be in wide use where code size and number of requests matters.

Take for instance the following article:
https://medium.com/friendship-dot-js/i-peeked-into-my-node-modules-directory-and-you-wont-believe-what-happened-next-b89f63d21558

It's not uncommon for users of NPM packages to express their disdain for the overly bloated and unnecessarily deep dependency trees that come with some packages. One solution to this approach is for a lib to provide optimized production bundles that tree shake any and all unnecessary code.

If/when ESM is reaches full compatibility between Node and browsers, it'll be more common to see FE workflows bleed into universal modules. Assuming a directory/package structure will remain static may work for CJS workflows but fails to take into consideration how ESM will be used.


I don't really have any solutions to these problems. ESM supports some features that favor new approaches that aren't backward-compatible with CJS.

From what I've seen in the front-end ecosystem, the facade design pattern provides a lot of benefits for library development (ie abstracting API from structure, hiding internals) over the traditional way of doing things (ie deep linking). It's already starting to show up in commonly used libraries.

@bmeck
Copy link
Member Author

bmeck commented Feb 10, 2018

@evanplaice I don't really understand your first point since it is laid out depending on package boundaries, but no package.json files are in your minimal case. Per execution order, it is fairly well defined in nodejs/node#18233 .

Which would fire, entry, root, or operators hook; if more than one fire, then in what order?

This appears to be having some names that are missing in the example given.

I'm going to rewrite the question as "What hooks fire when, assuming all the given files have a package boundary in the directory they are located in, and app.js is the entry point?". I'm going to put a leading / in all the paths for my clarity on specifier vs path.

  1. /package.json resolve('rxjs', '/app.js')

I assume ends up resolving to /rxjs/index.js given the example

  1. /rxjs/package.json resolve('./index.js', '/rxjs')

I assume this leaves the path unchanged. That ends up the first resolve and now the loader gets to loading rxjs/index.js.

  1. /rxjs/package.json resolve('rxjs/operators', '/rxjs/index.js')

I assume ends up resolving to /rxjs/operators/index.js given the example

  1. /rxjs/operators/package.json resolve('./index.js', '/rxjs/operators/')

I assume this leaves the path unchanged. That ends up the second resolve and now the loader gets to loading ./flat and ./flatMap.

  1. /rxjs/operators/package.json resolve('./flat', '/rxjs/operators/index.js')

If this doesn't cross a package boundary so is only handled by this loader.

  1. /rxjs/operators/package.json resolve('./flatMap', '/rxjs/operators/index.js')

If this doesn't cross a package boundary so is only handled by this loader.

I'm still unclear on the question and example, but hopefully that gives a clear idea of the order of operations.


For point 2, that article is satire but true. I'm not sure I understand how this relates to supporting --loader hooks to require though.

@mikesamuel
Copy link

I have two use cases for loader hooks that covers CJS modules and would be nice to have for ESMs.


Per "Dynamic Bundling", I'd like to be able to write (parent, child) on every load to build a module dependency graph during testing which I later prune and use to generate bundle metadata.

In production, I use the derived metadata to whitelist loads and check resource integrity.

Right now, I'm hacking updateChildren(parent, children, scan) in lib/module.js.

Both steps require some configuration, so it'd be nice to be able to provision a loader with flags or a pointer to a config file.


I'd also like to experiment with ways to allow restricting access to modules with APIs that are prone to misuse.
Sometimes we design very powerful APIs that require a certain level of expertise to use correctly,
and then wrap those in APIs meant for general use which make it harder to footgun one's self.

In other contexts, we've had success by adding static checks in linkers and build systems to enable a workflow where we carefully review and whitelist the wrapper modules to make sure they use the misuse-prone APIs correctly, then disallow direct access by other modules.

I'd like to do a POC in a node context.

I believe it would be sufficient to

  1. be notified on the end of a module load so we can lock down out-of-band loads of sensitive modules like entries in the module cache
  2. be able to veto before a loader introduces a module to a parent based on a predicate over (parent, child).

(1) might be sufficient because there are tricks we can do with proxy objects and stack introspection but each of those have their own problems.

@bmeck
Copy link
Member Author

bmeck commented Feb 22, 2018

@mikesamuel can you clarify

be able to veto before a loader introduces a module to a parent based on a predicate over (parent, child).

Say I have a loader that wants to do this behavior. It is given ~ (request_string, parent_url) and it returns (url). At what point are trying to stop propagation of the import and produce an error?

Currently loaders do not automatically ever call the "parent" loader. So if this is the only loader, it could cause import to fail by:

  1. throwing an Error manually
  2. returning an invalid URL that cannot be loaded

@bmeck
Copy link
Member Author

bmeck commented Feb 23, 2018

If we put the loaders which are already somewhat isolated into a thread we could use a single async resolve for both module systems. The hooks would need to define what mechanism is being used to load things however so that it can switch on various differences like if the specifier is a URL component for import or a path for require.

@mikesamuel
Copy link

@bmeck

Say I have a loader that wants to do this behavior. It is given ~ (request_string, parent_url) and
it returns (url). At what point are trying to stop propagation of the import and produce an error?

Currently loaders do not automatically ever call the "parent" loader. So if this is the only loader,
it could cause import to fail by:

I'm still trying to wrap my head around all the things that 6.1.2 lets you do.

How does the lack of implicit parent loader chaining affect things?
Is that a concern for vm contexts?

Not chaining on a vetoed resolve seems like it would fail safe, but if an error from a loader
causes a fallback to a sibling loader then I'd prefer option (3) below.

  1. throwing an Error manually
  2. returning an invalid URL that cannot be loaded

I would be happy with either of those or even just

  1. Substitutes a powerless module, e.g. one that exports nothing or
    a stub that is API compatible but which has no side effects.

I think any solution needs to prevent:

  • Prevent module initialization code from running
  • Log sufficient context to make it clear what module didn't load, where the resolution started, and that it was vetoed because of a whitelist constraint.

I'll defer to people who better understand the loading process on whether the runtime should cache the fact that a module load failed and which solution achieves the right balance.

Logs need not reflect multiple attempts to load the same module identifier.

@bmeck
Copy link
Member Author

bmeck commented Feb 23, 2018

@mikesamuel I think the problem might be that you linked to a spec that is no longer being developed to land in browsers or node. Node hooks are minimal and documented at https://nodejs.org/dist/latest-v9.x/docs/api/esm.html

@mikesamuel
Copy link

@bmeck Sorry, I'm an idiot.

That node hook doc makes no mention of "parent loaders." Should I be looking at your "RFC: Per Package Loader Hooks #18233"?

Loaders therefore need to have a concept of a parent loader hooks to defer to, or to ignore.

@bmeck
Copy link
Member Author

bmeck commented Feb 23, 2018

@mikesamuel you can look at that or nodejs/node#18914

@mikesamuel
Copy link

@bmeck.

Thanks. I see the loop at https://github.com/bmeck/node/blob/6b94c9ccbfe70773001260ade15a60f81d583a17/lib/internal/process/modules.js#L56-L111

It looks like each userLoader gets to specify a factory that receives the previous links in the loader chain.

IIUC, earlier resolvers get to control what URL reaches later loaders, but the last loader gets to ultimately decide whether a load resolves to anything.

An error in a resolver would cause line 71 to resolve to a broken promise, so would propagate through the await in the next one.

So it looks like either of your options below would work.

  1. throwing an Error manually
  2. returning an invalid URL that cannot be loaded

An early user loader could defeat either by conspiring with a late user loader but my threat model treats user loaders as trusted code.

@bmeck
Copy link
Member Author

bmeck commented Feb 23, 2018

@mikesamuel can you explain the early loader and late loader vector? Early loaders can already rewrite any loader being setup during bootstrapping.

@mikesamuel
Copy link

@bmeck

If an early loader can find a fixed point URL -- one that the middle loaders return unchanged -- then it can store the specifier in a side table it shares with the late loader and return the fixed point URL.

The middle loaders never see the specifier in the side table.

The late loader can then retrieve the value that the early loader squirreled away from the table.

There side table is just one kind of shared state that serves to make the communication independent of promise resolution order.

// conspiracy.js
exports const urlSideTable = new Map();
// early loader
import urlSideTable from './conspiracy.js'
let counter = 0
export async function resolve(specifier, parentModuleURL, defaultResolver) {
  const substitute = `url-that-other-loaders-will-approve-${counter++}`
  urlSideTable.set(substitute, specifier)
  return {
    url: new URL(substitute).href,
    format: 'esm'
  };
}
// late-loader.js
import urlSideTable from './conspiracy.js'
export async function resolve(specifier, parentModuleURL, defaultResolver) {
  const original = urlSideTable.get(specifier)
  return /* some computation over original */
}

@bmeck
Copy link
Member Author

bmeck commented Feb 23, 2018

@mikesamuel I've thought about completely isolating their global space before, but don't think loaders can be seen as untrusted since you need to set them via CLI flags or environment variables on startup. My suggestion is that you can have your secure loader prevent all the following loaders from sharing the global space by putting them into a vm.Context if you want.

@mikesamuel
Copy link

@bmeck Agreed.

my threat model treats user loaders as trusted code.

@bmeck
Copy link
Member Author

bmeck commented Nov 16, 2018

going to table this for now, we can come back later and reopen if desired.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants