Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contract bytecode changes vastly when independent contracts added to the compiler #14250

Closed
kuzdogan opened this issue May 19, 2023 · 6 comments
Labels
Milestone

Comments

@kuzdogan
Copy link
Member

Summary

I came across a contract that could not be verified on Sourcify because the Sourcify's compilation output bytecode is different than the author's (Hardhat). Diving deeper I've found out the difference comes to the surface when some other contracts that are unrelated to the compilation target are added to compilation.

Specifically these two standard JSON inputs yield different bytecode for the same contract CompoundLens.sol:
CompoundLens-solc-input-Sourcify-bytecode.json.txt
CompoundLens-solc-input-Hardhat-bytecode.json.txt

The only input sources differences are the following which are unrelated to CompoundLens:

[
  '@openzeppelin/contracts/interfaces/draft-IERC1822.sol',
  '@openzeppelin/contracts/proxy/ERC1967/ERC1967Proxy.sol',
  '@openzeppelin/contracts/proxy/ERC1967/ERC1967Upgrade.sol',
  '@openzeppelin/contracts/proxy/Proxy.sol',
  '@openzeppelin/contracts/proxy/beacon/IBeacon.sol',
  '@openzeppelin/contracts/proxy/transparent/ProxyAdmin.sol',
  '@openzeppelin/contracts/proxy/transparent/TransparentUpgradeableProxy.sol',
  '@openzeppelin/contracts/utils/StorageSlot.sol'
]

You'll notice the bytecodes differ even when the metadata hashes are the same. This is unexpected as the different sources listed above are not related to CompoundLens. CompoundLens already compiles without those contracts are input in CompoundLens-solc-input-Sourcify-bytecode.json but when added in CompoundLens-solc-input-Hardhat-bytecode.json the contract's bytecode changes.

To reproduce

Using Solidity version v0.8.19

Output the bytecodes

  1. Sourcify
cat CompoundLens-solc-input-Sourcify-bytecode.json.txt | solc --standard-json | jq '.contracts."contracts/Comptroller/CompoundLens.sol".CompoundLens.evm.bytecode.object' > CompoundLens-Sourcify-creation-bytecode.txt
  1. Hardhat
cat CompoundLens-solc-input-Hardhat-bytecode.json.txt | solc --standard-json | jq '.contracts."contracts/Comptroller/CompoundLens.sol".CompoundLens.evm.bytecode.object' > CompoundLens-Hardhat-creation-bytecode.txt

Compare the bytecodes:

git diff --word-diff --word-diff-regex=. CompoundLens-Sourcify-creation-bytecode.txt CompoundLens-Hardhat-creation-bytecode.txt

Background

The contracts are on this Github repo (verify branch): https://github.com/meterio/sumer-project/tree/verify

To compile (and deploy) the original sources:

git clone https://github.com/meterio/sumer-project/
cd sumer-project
rm -rf dist
yarn
yarn compile
npx hardhat dcl --rpc http://rpctest.meter.io/ --pk $PRIV_KEY

The contract is also deployed at: https://goerli.etherscan.io/address/0x46df081108b2e2FDf1bF1E84Eeb2D7ec3AdA0061

The bytecode diff between the Sourcify output and Hardhat output was not in the metadata hash or in a certain recognizable pattern for me:
image

CompoundLens-hardhat-recompiled-creation.txt
CompoundLens-recompiled-creation.txt

To generate the diff: git diff --word-diff --word-diff-regex=. CompoundLens-hardhat-recompiled-creation.txt CompoundLens-recompiled-creation.txt

What I tried to do was to start from the standard JSON input of Sourcify and try to reach the Hardhat output bytecode in iterations:
initial Sourcify JSON input: CompoundLens-solc-input.json.txt
Hardhat JSON input: CompoundLens-hardhat-solc-input.json.txt

  1. Tried using the full compilation settings from Hardhat.
    Using the same settings in Hardhat in the initial Sourcify input didn't change the bytecode.
  2. Tried using the whole sources from the Harhat input in the initial Sourcify input

Yes by using all of the sources from Hardhat, one generates the Hardhat's bytecode output.

Next, I iteratively copied sources from the Hardhat input to the Sourcify input to see adding which sources cause the change in the bytecode.

Sourcify's initial standard JSON input:
image

Hardhat's standard JSON input (clipped):
image

On each step I copied a contract that might be a potential cause of change, resolved the dependencies by also adding them, compiled the new iterated JSON input and compared the bytecodes.

Finally, I found a minimal standard JSON diff, that adding the specific sources would change the bytecode output. These are laid out in the above Summary section.

Environment

  • Compiler version: v0.8.19
  • Target EVM version (as per compiler settings): paris
  • Framework/IDE (e.g. Truffle or Remix): Hardhat and solc CLI
  • EVM execution environment / backend / blockchain client:
  • Operating system: MacOS
@ekpyron
Copy link
Member

ekpyron commented Jun 7, 2023

For the record: we did found the cause and we'll fix this with the next release, but I'm not sure what to do for helping with verifying already deployed cases of this.

@kuzdogan
Copy link
Member Author

kuzdogan commented Jun 7, 2023

If there is a way to recognize this, we can try to compile with everything provided (instead of just the sources in metadata) through the tooling, or warn the user about the issue on the UI. In fact this is what we do for ethereum/sourcify#618 and makes me realize this really is a similar case (and the Solidity bug #12281). I should've mentioned that but it didn't cross my mind 🤦

In that case, however, it is not possible to get a "perfect match" including the metadata, because the metadata will be changed.

@ekpyron
Copy link
Member

ekpyron commented Jun 12, 2023

I'm not sure there's a really easy way to recognize this. It's very similar to the older issues we had where additional source files affected compilation results, but a different place in the compiler that we missed back then.

The main cause of the difference is that the compiler chooses a different permutation of memory offsets for moving variables from stack to memory depending on some internal IDs that can change if you add more source files...

The problem, however, is, that the smallest offset we usually use for these variables is 0x80, for which a PUSH1 suffices, but we can easily end up with offsets that require two bytes, which results in a PUSH2, which will shift all offsets of all jumps...

So overall you will see constant pushes before jumps (i.e. offsets into bytecode) to change slightly - and you'll see constant offsets preceding memory loads and stores to change slightly.

So the pattern should be PUSH1/2 <someConstant> JUMP/JUMPI, PUSH1/2 <someConstant> MSTORE , PUSH1/2 <someConstant> MLOAD, in which you may get different constant values, while flipping between a PUSH1 and a PUSH2 can happen, which shifts all bytecode offsets around a bit (causing the first pattern involving jumps)... not sure if it's feasible to detect this easily...

@ekpyron
Copy link
Member

ekpyron commented Jul 11, 2023

So we can consider this fixed in newer compiler versions by #14311, right? Based on that I'm closing the issue, but feel free to reopen.

@ekpyron ekpyron closed this as completed Jul 11, 2023
@kuzdogan
Copy link
Member Author

Yep, we'll handle this ourselves. Thanks

@cameel
Copy link
Member

cameel commented Dec 22, 2023

@marcocastignoli

does the additional files content matter in the generated bytecode or it's just the presence of the files in the compilation? (so actually the additional files can also be empty but with the same identifier)

When a file is parsed, every AST node in it gets an ID. IDs are generated sequentially across all the input files as the compiler goes through them. As long as the files are processed in the same order and have the same content, you get the same IDs. So yeah, in case of bugs like this, where the compiler fails to make code generation completely independent of those IDs, changing the content in files unrelated to a contract may result in different bytecode for that contract.

To be more specific, here's what must be avoided if you want to get identical bytecode in presence of such bugs:

  • Adding or removing files:
    • Directly, in the initial input (i.e. the names you give explicitly on the CLI or in sources in Standard JSON).
    • Indirectly, via imports (i.e. files that are not in initial input, but imported by other files).
  • Reordering files:
    • By renaming them (affects sorting order).
    • By changing their paths (affects sorting order).
    • By changing remappings, resulting in the compiler seeing different names and/or paths.
    • By changing whether they are included or not included in the initial input (because imported files are processed as soon as they're discovered).
    • By reordering imports inside a file or between files.
  • Adding or removing AST nodes:
    • By changing file content, which may result in a different number of AST nodes being generated.

Note that:

  • Changing the order of files only in the initial input does not affect IDs. The compiler iterates over those files in alphabetical order.
  • Symlinks should not affect the order either. The compiler only takes into account the path and name of the symlink itself, as if it was the target file, ignoring the path of what it points to.
  • Even when a file is empty, at least one ID is dispensed - for the node that represents the root of its tree (i.e. SourceUnit).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants