re-entrancy, vm bugs, double spending oh my #375

phritz · 2018-04-29T07:10:47Z

We hit a few VM problems recently that rise to the level of a mini-postmortem. Please comment on anything that I missed or got wrong.

Background

Last week @acruikshank hit the following issue:

I'm trying to get the payment channel actor to send funds to the target when they call update. I can call the Send method from within the actors execution and it updates the target account actor (in memory). The problem is the execution has been triggered by the target account sending a message, so any updates to that account get blown away as we're leaving the message call: https://github.com/filecoin-project/go-filecoin/blob/80a31fc6a08ae75adbc2147b6cbb8908bbbe175e/core/processor.go#L238.

The problem was that when sending messages between actors we were saving actors in an outer context (ApplyMessage) rather than in the inner context (Send). ApplyMessage is the entrypoint and calls Send. Send is used by actors to send messages to each other. However! Actors (in particular their balance) were saved in ApplyMessage. There were two problems with this. First, we should be saving in Send so when actors transfer value in an inner call it is recorded. Second, in ApplyMessage we continue to use an original copy of the actors after calling Send. Send and downstream callers could (do) mutate the actors, so we should reload the actor state if we want to use it after calling Send. @dignifiedquire fixed the proximate issue by moving the saves in #363.

Two potentially useful bits surfaced in discussion in slack. First, since the current vm model is to save and load actors while sending messages (as opposed to accumulating changes on one instance when sending messages) I proposed that our current implementation needed to make the following guarantees with respect to re-entrancy (here updated a bit for clarity):

caller state is saved before Sending (else re-entrant calls back into caller won't see any pre-send changes made)
caller and callee state is saved in Send after transferring value and before calling further Sends (else we won't see transfers in downstream calls, and maybe not in the caller)
callee must be loaded fresh on entry into Send (else changes made earlier in the call chain might not be visible)
caller must be reloaded after a Send (else any state modified by a downstream call won't be visible)

I commented that I wasn't even sure if this was right because tricky [I'm still not sure it is right; and there might also be a smaller basis that spans the space of concerns]. I added the last bullet point a few messages after the first three so it was easy to miss, which is a contributing factor to why we ended up with the second problem below.

Also in that thread @dignifiedquire noted that there are three separate things we need to save:

nonce increment

balance changes

storage writes

These happen in three different places (ApplyMessage, Send, in the actor with WithStorage).

Dig fixed alex's problem but the fourth bullet above (caller must be reloaded after send) is still a problem. I wrote a test to explore the problem by doublespending. In doing so it uncovered another problem, which is that we were propagating a potentially stale copy of an actor from a calling context and then saving it over any values saved by downstream callers. See https://github.com/filecoin-project/go-filecoin/pull/372/files

https://youtu.be/l-O5IHVhWj0?t=1m25s

Root causes & contributing factors

Given that we are still early in development it's not surprising that there are issues. Let's shake em out and see what we can do in the future to avoid this situation again.

Root causes:

Re-entrancy is tricky, just ask the DAO
We do not have tests covering a number of important cases (specifically re-entrant calls and longer call chains across each of the nonce/balance/storage dimensions)
We do not have a description of the contract for sending messages.

Contributing factors:

Different parts of the actor state is saved in different ways in different places, so it can be hard to reason about.

If you see others please add them to this issue.

Potential for the same thing with actor memory

I'm not sure exactly what the plan is for actor memory but there is no mechanism guaranteeing that the re-entrancy guidelines [that I am not sure are correct] are followed for it. It's up to the Actor's implementer to get it right. The tooling doesn't help at the moment. For example when using the predominant pattern (WithStorage) if an actor A calls out to other actors which call back into A the changes to A earlier in the call are not visible.

Proposed Actions

Add a suite of unit-tests covering a lot more cases (to be fleshed out here I think)
Discuss and then describe the call model and what guarantees we think it makes
Ensure the model is comprehensive of actor memory
Implement any changes to the current model or the new model decided above

Potential future directions

Why and others have suggested that we move to a model more like Ethereum's where a single instance of an actor is propagated through the call chain, accumulating changes. Basically, have the vm use an actor cache. This has the massive advantage of seeming at first blush to be much easier to reason about than the load-reload model. It also could bring potential performance improvements with it.

I would to propose that we have an additional path forward that we should consider, which is to just not allow re-entrant calls at all, or not allow them by default. The history of smart contract hacks is the history of re-entrancy bugs. It's so hard to get right that I think we should consider exploring the stance that re-entrancy considered dangerous, therefore let's not allow it, or at least not by default. Yes but contracts get more complicated. In some ways: I guess they are more complicated in that they need to be refactored into more pieces, or require users to sometimes use an extra address. But also they are way less complicated if you don't have re-entrancy because you don't have to worry about mind-bending edge cases and ordering quite as much. And it's probably less complicated to implement in the vm. Plus I'm not convinced that most applications even need re-entrancy (I'm not convinced the other way either, I see it as a open question), maybe those that do would better off factored into a small number of highly specialized services. God this stick I keep poking into my eye really hurts...

In any case I would like to have the discussion about re-entrancy based on concrete evidence: let's take all of our contracts, some contracts we hear may be hard without re-entrancy, and a few popular ethereum contracts and see how they would need to change if we didn't allow re-entrancy, and use that to guide the conversation. So, proposed additional action item:

Explore whether we should disallow re-entrancy, or at least by default (eg, enable contracts to opt-in)

Edit: sorry I accidentally hit create. Since then I added the final paragraphs, proofread, and fixed some typos in the text.

dignifiedquire · 2018-04-29T11:44:27Z

some interesting reading from ethereum land to avoid reentrancy issues like the dao

Static Call: ethereum/EIPs#214

dignifiedquire · 2018-04-29T11:49:11Z

sth important to talk about is the structure of our code implemening the VM vs what sort of contracts can be written given the defined execution structure. Currently this is very conflated in the code base making it very hard to reason about things. Do independent of how we want to move forward I believe it is critical to transition the current protoype impl of the vm into one that is well structured using abstractions that make it easy to reason about such things + test for them.

whyrusleeping · 2018-04-29T12:00:26Z

@dignifiedquire specifically, what is conflated? Be as precise as possible

…

On Sun, Apr 29, 2018, 8:49 PM Friedel Ziegelmayer ***@***.***> wrote: sth important to talk about is the structure of our code implemening the VM vs what sort of contracts can be written given the defined execution structure. Currently this is very conflated in the code base making it very hard to reason about things. Do independent of how we want to move forward I believe it is critical to transition the current protoype impl of the vm into one that is well structured using abstractions that make it easy to reason about such things + test for them. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#375 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABL4HN2Ay6duprjxEUwe0jL4-HjQdZt5ks5ttai4gaJpZM4Trsmu> .

dignifiedquire · 2018-04-29T12:32:42Z

on my phone atm, will write up sth more detailed later

…

On 29. Apr 2018, 14:00 +0200, Whyrusleeping ***@***.***>, wrote: @dignifiedquire specifically, what is conflated? Be as precise as possible On Sun, Apr 29, 2018, 8:49 PM Friedel Ziegelmayer ***@***.***> wrote: > sth important to talk about is the structure of our code implemening the > VM vs what sort of contracts can be written given the defined execution > structure. Currently this is very conflated in the code base making it very > hard to reason about things. Do independent of how we want to move forward > I believe it is critical to transition the current protoype impl of the vm > into one that is well structured using abstractions that make it easy to > reason about such things + test for them. > > > — > You are receiving this because you were assigned. > Reply to this email directly, view it on GitHub > <#375 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABL4HN2Ay6duprjxEUwe0jL4-HjQdZt5ks5ttai4gaJpZM4Trsmu> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

acruikshank · 2018-04-30T01:42:17Z

This is me trying to make sense of the questions at hand with a few observations:

assumptions

I'm making the following assumptions about the future of the VM (which may be wrong):

All actors will only ever have 3 types of mutable state: Nonce, Balance, and Memory.
Nonce is updated by and only by sending a message to another actor.
Balance is updated by and only by receiving a message from another actor.
Memory is updated by and only by its actor executing a method as a result of receiving a message.
No actor should be able to read another actor's state except by messaging it's methods (this is not currently enforced).

store vs. cache

If these are true, we could pull the WithStorage outside of the method execution so that the VM has a reference to the memory (I think we should do this anyway). If we were to persist the memory at the beginning of each call to send and restore the persisted value (which may have changed), we should have all our bases covered. If we use actor caching, each method invocation will use the same actor, so there will never be a question about which version of an actor's memory you are using.

understanding re-entrancy

So with:

Send(A, B):
  B.Memory = Mb'
  Send(B, C):
    Send(C, B):
      f(B.Memory)
      B.Memory = Mb''
  f(B.Memory)

The first call to f() will get Mb' and the second will get Mb''. This is sane in the sense that, at each step, there is only one answer to the question of "what is B's Memory?".

There is a deeper question of whether it is sane for B to message C and have it's memory changed out from under it by the time Send returns. If the answer to this is no, then I think banning re-entrancy is the only answer.

partial ban

If we do ban re-entrancy, there's still a question about whether we ban messages that have no method. I can imagine an actor short on funds could try calling in some debts before acting. This would require a message that looks re-entrant, but doesn't actually call code that could alter memory.

whyrusleeping · 2018-04-30T06:17:06Z

It doesnt appear that anyone is actually talking about re-entrancy here. The worry seems to be cyclic actor invocations. In that case, this should be treated in the same way that it would if this were all running inside one program, and each actor were an object. all modifications to memory should persist throughout the entire call stack, even across actors. If a revert occurs (the program 'throws'), then all changes for this entire message processing event are reverted.

dignifiedquire · 2018-04-30T10:10:44Z

@phritz:

The problem was that when sending messages between actors we were saving actors in an outer context (ApplyMessage) rather than in the inner context (Send)

The issue was that we were saving in both places, first in the inner and then it would get overwritten by the outer save. In addition there was one place where we weren't saving (transfer).

We do not have tests covering a number of important cases (specifically re-entrant calls and longer call chains across each of the nonce/balance/storage dimensions)

We do have too little tests covering this issue, I will not dispute that. But, and this is very important, we are not writing and designing a user facing VM right now. So the even if our implementation allows in theory for these kinds of bugs, they are not exploitable, unless a majority of the miners suddenly adopts a new actor.

For example when using the predominant pattern (WithStorage) if an actor A calls out to other actors which call back into A the changes to A earlier in the call are not visible.

This means you are likely using WithStorage wrong. What I currently see is the whole call being covered by the scope of WithStorage's callback, when actually it should only cover a small portion, namely when needing to manipulate the storage. So what should happen is sth. like this

WithStorage(func() {
  // change things in the storage
})

// do a Send to another actor, which should be able to view changes if exposed
// though there might be a bug in the current implementation
ret, err := ctx.Send(..)

WithStorage(func() {
  // use ret to change the storage
})

It's up to the Actor's implementer to get it right

Yes, and for now I think that is fine, as we are not implementing a user facing VM.
This does not mean we shouldn't improve our tooling to make it less likely that we screw up.

Discuss and then describe the call model and what guarantees we think it makes

I don't think this should be the first step. Documenting the current model, as the current intentions are should be the first step.

Ensure the model is comprehensive of actor memory

Could you explain what you mean with this?

move to a model more like Ethereum's where a single instance of an actor is propagated through the call chain

I think this is a must, the current implementation was not meant to stay in it's form, with manual SetActor calls, it was just made so it works.

I would to propose that we have an additional path forward that we should consider, which is to just not allow re-entrant calls at all, or not allow them by default.

I think this is an important discussion to be had, but not today. Figuring the right path forward here is not going to be simple and will take time and research. We should schedule this as part of the work to be done to ship a user facing VM I would suggest.

@acruikshank:

All actors will only ever have 3 types of mutable state: Nonce, Balance, and Memory

For now, ever is a very strong word that I wouldn't use quite yet.

No actor should be able to read another actor's state except by messaging it's methods (this is not currently enforced)

This is enforced right now, in the sense that the exposed ABI does not allow you to retrieve other actors.
It is also not a 100% clear to me what the best path forward here should be.

we could pull the WithStorage outside of the method execution so that the VM

No, WithStorage is a helper method that runs inside the VM. Actors will have the freedom to define their own allocation methods and ways to use the storage. We can not and should not rely on this.

@whyrusleeping:

Yes, I agree the actual worries called reentrancy in this thread seem to focus on cyclic/recursive actor invocations, not on the actual reentrancy problem.

@Myself:

Expanding on the issues with the code that I see today.

A single package. Currently ALL actors live in the core package, as well as VM internals and things that are modeled to be inside the VM. This means there is no separation between them and actors can use private details of things outside of the VM.
Builtin actors are not actually part of the genesis/configuration of a chain and can be changed without any notice.
Serialization is not fully enforced on the VM boundaries.
No memory isolation at runtime. The full extent of this will likely be pushed until we have a real VM like WASM, but we should already model interfaces as if there was memory isolation between
- each toplevel message execution
- the VM and the rest of the program.

Moving forward:

The work that needs to be done has not really changed from my perspective. The things that need to be figured out, as described in #325 are stil pressing and have always entailed for me ensuring that the implementation of our VM is modeled according to the abstractions that the filled out design that #325 gives, is done well.

The main additional thing, as mentioned above, is that we need to document the intended execution model for V1 (initial release, only builtin actors, no user actors, no real VM).

After that we should

implement proper caching
ensure the abstractions are clearly layed out in code and enforced, fixing the issues mentioned above
ensure interfaces line up, such that the VM could be swapped and/or added, with both builtin actors running and something like WASM execution contracts (this is the prep work to make this switch later on, but also forces us to ensure concerns are separated out well)

porcuquine · 2018-04-30T19:28:52Z

The issue was that we were saving in both places, first in the inner and then it would get overwritten by the outer save. In addition there was one place where we weren't saving (transfer).

We don't have to 'save' in the inner, but if we don't, the outer save has to include the inner changes. This will happen if we're always talking to the same memory -- or if we ensure a nesting discipline which reloads after events which can change the underlying storage. The former is probably preferable, since it avoids superfluously saving. However, it assumes that both inner and outer processes have the same interpretation of the raw memory. That's not true in the very general case I think we aim for. To support that, we would need to get the save/load discipline right.

This means you are likely using WithStorage wrong. What I currently see is the whole call being covered by the scope of WithStorage's callback, when actually it should only cover a small portion, namely when needing to manipulate the storage. So what should happen is sth. like this

Specifically, there cannot be any ctx.Send(…) calls within a WithStorage function. If we do not want to enforce this informally, we will have to make sure calls to Send handle this correctly (harder).

I think this is a must, the current implementation was not meant to stay in it's form, with manual SetActor calls, it was just made so it works.

Right -- so we should probably forbid SetActor (and make it impossible to call). If we don't want to do that, then we can go the caching route. One problem with the caching route is that it patches without forcing us to address the issue more deeply. However, if we start with that solution, it would have the benefit of providing a mechanism which would allow us to detect straggling inappropriate calls once we decide to enforce a different policy.

aboodman · 2018-05-17T10:48:42Z

Here are some notes that I wrote up from our discussions in Lisbon:

https://docs.google.com/document/d/1UgVjJ5yInTuya9cAbO6lmqTqH-l1_rxRKAUOXfu-7pk/edit#

phritz · 2018-06-11T21:32:42Z

The immediate issue has been fixed and this is on everyone's minds so closing.

phritz added the discussion label Apr 29, 2018

phritz assigned dignifiedquire and whyrusleeping Apr 29, 2018

porcuquine mentioned this issue Apr 29, 2018

may'18 dev week sessions suggestion ground #328

Closed

porcuquine mentioned this issue Apr 30, 2018

[DISCUSSION -- DO NOT MERGE] Proof-of-concept fix to multispend. #376

Closed

mishmosh added this to the Sprint 8 milestone May 1, 2018

mishmosh modified the milestones: Sprint 8, Sprint 9, Sprint 10 May 11, 2018

phritz closed this as completed Jun 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

re-entrancy, vm bugs, double spending oh my #375

re-entrancy, vm bugs, double spending oh my #375

phritz commented Apr 29, 2018 •

edited

Loading

dignifiedquire commented Apr 29, 2018 •

edited

Loading

dignifiedquire commented Apr 29, 2018 via email

whyrusleeping commented Apr 29, 2018 via email

dignifiedquire commented Apr 29, 2018 via email

acruikshank commented Apr 30, 2018

whyrusleeping commented Apr 30, 2018

dignifiedquire commented Apr 30, 2018

porcuquine commented Apr 30, 2018

aboodman commented May 17, 2018

phritz commented Jun 11, 2018

re-entrancy, vm bugs, double spending oh my #375

re-entrancy, vm bugs, double spending oh my #375

Comments

phritz commented Apr 29, 2018 • edited Loading

Background

Root causes & contributing factors

Potential for the same thing with actor memory

Proposed Actions

Potential future directions

dignifiedquire commented Apr 29, 2018 • edited Loading

dignifiedquire commented Apr 29, 2018 via email

whyrusleeping commented Apr 29, 2018 via email

dignifiedquire commented Apr 29, 2018 via email

acruikshank commented Apr 30, 2018

assumptions

store vs. cache

understanding re-entrancy

partial ban

whyrusleeping commented Apr 30, 2018

dignifiedquire commented Apr 30, 2018

Moving forward:

porcuquine commented Apr 30, 2018

aboodman commented May 17, 2018

phritz commented Jun 11, 2018

phritz commented Apr 29, 2018 •

edited

Loading

dignifiedquire commented Apr 29, 2018 •

edited

Loading