Blockchain attacks on privacy

Users can verify the blockchain to check that all the rules of bitcoin were followed throughout its history. For example, users can check that nobody printed infinite bitcoins and that every coin was only spent with a valid signature created by its private key. This is what leads to bitcoin's unique value proposition as a form of electronic cash which requires only small amounts of trust. But the same blockchain structure leads to privacy problems because every transaction must be available for all to see, forever. This section discusses known methods an adversary may use for analyzing the public blockchain.

Bitcoin uses a UTXO model. Transactions have inputs and outputs, they can have one or more of each. Previous outputs can be used as inputs for later transactions. An output which hasn't been spent yet is called an unspent transaction output (UTXO). UXTOs are often called "coins". UTXOs are associated with a bitcoin address and can be spent by creating a valid signature corresponding to the scriptPubKey of the address.

Addresses are cryptographic information, essentially random numbers. On their own they do not reveal much about the real owner of any bitcoins on them. Usually an adversary will try to link together multiple addresses which they believe belong to the same wallet. Such address collections are called "clusters", "closures" or "wallet clusters", and the activity of creating them is called "wallet clustering". Once the clusters are obtained the adversary can try to link them real-world identities of entities it wants to spy on.

For example, it may find wallet cluster A belonging to Alice and another wallet cluster B belonging to Bob. If a bitcoin transaction is seen paying from cluster A to cluster B then the adversary knows that Alice has sent coins to Bob.

It can be very difficult to fine-tune heuristics for wallet clustering that lead to obtaining actually correct information.

Common-input-ownership heuristic

Bitcoin Milan Meetup 46 - Talk by Adam Gibson

This is a heuristic or assumption which says that if a transaction has more than one input then all those inputs are owned by the same entity.

For example, consider this transaction with inputs A, B and C; and outputs X and Y.

A (1 btc) --> X (4 btc)

              B (2 btc)     Y (2 btc)

              C (3 btc)

This transaction would be an indication that addresses B and C are owned by the same person who owns address A.

One of the purposes of CoinJoin is to break this heuristic. Nonetheless this heuristic is very commonly true and it is widely used by transaction surveillance companies and other adversaries as of 2019. The heuristic is usually combined with address reuse reasoning, which along with the somewhat-centralized bitcoin economy as of 2018 is why this heuristic can be unreasonably effective[4]. The heuristic's success also depends on the wallet behaviour: for example, if a wallet usually receives small amounts and sends large amounts then it will create many multi-input transactions.

Change address detection

Many bitcoin transactions have change outputs. It would be a serious privacy leak if the change address can be somehow found, as it would link the ownership of the (now spent) inputs with a new output. Change outputs can be very effective when combined with other privacy leaks like the common-input-ownership heuristic or address reuse. Change address detection allows the adversary to cluster together newly created address, which the common-input-ownership heuristic and address reuse allows past addresses to be clustered.

DeAnonymization

Change addresses lead to a common usage pattern called the peeling chain. It is seen after a large transactions from exchanges, marketplaces, mining pools and salary payments. In a peeling chain, a single address begins with a relatively large amount of bitcoins. A smaller amount is then peeled off this larger amount, creating a transaction in which a small amount is transferred to one address, and the remainder is transferred to a one-time change address. This process is repeated - potentially for hundreds or thousands of hops - until the larger amount is pared down, at which point (in one usage) the amount remaining in the address might be aggregated with other such addresses to again yield a large amount in a single address, and the peeling process begins again.

Now are listed possible ways to infer which of the outputs of a transaction is the change output:

Address reuse

If an output address has been reused it is very likely to be a payment output not a change output. This is because change addresses are created automatically by wallet software but payment addresses are manually sent between humans. The address reuse would happen because the human user reused an address out of ignorance or apathy. This heuristic is probably the most accurate, as its very hard to imagine how false positives would arise (except by intentional design of wallets). This heuristic is also called the "shadow heuristic".

Some very old software (from the 2010-2011 era which did not have Deterministic wallets) did not use a new address change but sent the change back to the input address. This reveals the change address exactly.

Avoiding address reuse is an obvious remedy. Another idea is that that wallets could automatically detect when a payment address has been used before (perhaps by asking the user) and then use a reused address as their change address; so both outputs would be reused addresses.

Wallet fingerprinting

A careful analyst sometimes deduce which software created a certain transaction, because the many different wallet softwares don't always create transactions in exactly the same way. Wallet fingerprinting can be used to detect change outputs because a change output is the one spent with the same wallet fingerprint.

As an example, consider five typical transactions that consume one input each and produce two outputs. A, B, C, D, E refer to transactions. A1, A2, etc refer to output addresses of those transactions

   --> C1
A1 --> B2  --> C2
   --> B2  --> D1
              --> D2 --> E1
                     --> E2

If wallet fingerprinting finds that transactions A, B, D and E are created by the same wallet software, and the other transactions are created by other software, then the change addresses become obvious. The same transactions with non-matching addresses replaced by X is shown. The peel chain is visible, it's clear that B2, D2, E1 are change addresses which belong to the same wallet as A1.

		--> X
A1	--> X   --> X
	--> B2  --> X
		--> D2 --> E1
		       --> X

There are a number of ways to get evidence used for identifying wallet software:

Address formats. Wallets generally only use one address type. If a transaction has all inputs and one output of the same address type (e.g. p2pkh), with the remaining output of a different type (p2sh), then a reasonable assumption is that the same-address-format output (p2pkh) is change and the different-address-format output (p2sh) is the payment which belongs to someone else.
Script types. Each wallet generally uses only one script. For example, the sending wallet may be a P2SH 2-of-3 multisignature wallet, which makes a transaction to two outputs: one 2-of-3 multisignature address and the other 2-of-2 multisignature address. The different script is a strong indication that the output is payment and the other output is change.
BIP69 Lexicographical Indexing of Transaction Inputs and Outputs. This BIP describes a standard way for wallets to order their inputs and outputs for privacy. Right now the wallet ecosystem has a mixture of wallets which do and don't implement the standard, which helps with fingerprinting. Note that the common one-input-two-output transaction with random ordering will follow BIP69 just by chance 50% of the time.
Number of inputs and outputs. Different users often construct transactions differently. For example, individuals often make transaction with just two outputs; a payment and change, while high-volume institutions like casinos or exchanges use consolidation and batching[6][7]. An output that is later use to create a batching transaction was probably not the change. This heuristic is also called the "consumer heuristic".
Transaction fields. Values in the transaction format which may vary depending on the wallet software: nLockTime is a field in transactions set by some wallets to make fee sniping less profitable. A mixture of wallets in the ecosystem do and don't implement this feature. nLockTime can also be used as in certain privacy protocols like CoinSwap. nSequence is another example. Also the version number.
Low-R value signatures. The DER format used to encode Bitcoin signatures requires adding an entire extra byte to a signature just to indicate when the signature’s R value is on the top-half of the elliptical curve used for Bitcoin. The R value is randomly derived, so half of all signatures have this extra byte. As of July 2018 Bitcoin Core only generates signatures with a low-R value that don't require this extra byte. By doing so, Bitcoin Core transactions will save one byte per every two signatures (on average). As of 2019 no other wallet does this, so a high-R signature is evidence that Bitcoin Core is not being used.
Uncompressed and compressed public keys. Older wallet software uses uncompressed public keys[10]. A mixture of compressed and uncompressed keys can be used for fingerprinting.
Miner fees. Various wallet softwares may respond to block space pressure in different ways which could lead to different kinds of miner fees being paid. This might also be a way of fingerprinting wallets.
Coin Selection. Various wallet softwares may choose which UTXOs to spend using different algorithms that could be used for fingerprinting.

If multiple users are using the same wallet software, then wallet fingerprinting cannot detect the change address. It is also possible that a single user owns two different wallets which use different software (for example a hot wallet and cold wallet) and then transactions between different softwares would not indicate a change of ownership. Wallet fingerprinting on its own is never decisive evidence, but as with all other privacy leaks it works best with data fusion when multiple privacy leaks are combined.

Round numbers

Many payment amounts are round numbers, for example 1 BTC or 0.1 BTC. The leftover change amount would then be a non-round number (e.g. 1.78213974 BTC). This potentially useful for finding the change address. The amount may be a round number in another currency. The amount 2.24159873 BTC isn't round in bitcoin but when converted to USD it may be close to $100.

Fee bumping

BIP 0125 defines a mechanism for replacing an unconfirmed transaction with another transaction that pays a higher fee. In the context of the market for block space, a user may find their transaction isn't confirming fast enough so they opt to "fee bump" or pay a higher miner fee. However generally the new higher miner fee will happen by reducing the change amount. So if an adversary is observing all unconfirmed transactions they could see both the earlier low-fee transaction and later high-fee transaction, and the output with the reduced amount would be the change output.

This could be mitigated by some of the time reducing the amount of both outputs, or reducing only the payment amount (in a sender-pays-for-fee model).

Unnecessary input heuristic

Also called the "optimal change heuristic". Consider this bitcoin transaction. It has two inputs worth 2 BTC and 3 BTC and two outputs worth 4 BTC and 1 BTC.

2 btc --> 4 btc
3 btc     1 btc

Assuming one of the outputs is change and the other output is the payment. There are two interpretations: the payment output is either the 4 BTC output or the 1 BTC output. But if the 1 BTC output is the payment amount then the 3 BTC input is unnecessary, as the wallet could have spent only the 2 BTC input and paid lower miner fees for doing so. This is an indication that the real payment output is 4 BTC and that 1 BTC is the change output.

This is an issue for transactions which have more than one input. One way to fix this leak is to add more inputs until the change output is higher than any input, for example:

2 btc --> 4 btc
3 btc     6 btc
5 btc

Now both interpretations imply that some inputs are unnecessary. Unfortunately this costs more in miner fees and can only be done if the wallet actually owns other UTXOs.

Some wallets have a coin selection algorithm which violates this heuristic. An example might be because the wallets want to consolidate inputs in times of cheap miner fees. So this heuristic is not decisive evidence.

Sending to a different script type

Sending funds to a different script type than the one you're spending from makes it easier to tell which output is the change.

For example, for a transaction with 1 input spending a p2pkh coin and creating 2 outputs, one of p2pkh and one of p2sh, it is very likely that the p2pkh output is the change while the p2sh one is the payment.

This is also possible if the inputs are of mixed types (created by wallets supporting multiple script types for backwards compatibility). If one of the output script types is known to be used by the wallet (because the same script type is spent by at least one of the inputs) while the other is not, the other one is likely to be the payment.

This has the most effect on early adopters of new wallet technology, like p2sh or segwit. The more rare it is to pay to people using the same script type as you do, the more you leak the identity of your change output. This will improve over time as the new technology gains wider adoption.

Wallet bugs

Some wallet software handles change in a very un-private way. For example certain old wallets would always put the change output in last place in the transaction. An old version of Bitcoin Core would add input UTXOs to the transaction until the change amount was around 0.1 BTC, so an amount of slightly over 0.1 BTC would always be change.

Equal-output CoinJoin

Equal-output-CoinJoin transactions trivially reveal the change address because it is the outputs which are not equal-valued. For example consider this equal-output-coinjoin:

A (1btc)
X (5btc) ---> B (1btc)
Y (3btc)      C (4btc)
D (2btc)

There is a very strong indication that output D is change belongs to the owner of input Y, while output C is change belonging to input X. However, CoinJoin breaks the common-input-ownership heuristic and effectively hides the ownership of payment outputs (A and B), so the tradeoffs are still heavily in favour of using coinjoin.

Cluster growth

Wallet clusters created by using the common-input-ownership heuristic usually grow (in number of addresses) slowly and incrementally[11]. Two large clusters merging is rare and may indicate that the heuristics are flawed. So another way to deduce the change address is to find which output causes the clusters to grow only slowly. The exact value for "how slowly" a cluster is allowed to grow is an open question.

Transaction graph

As described in the introduction, addresses are connected together by transactions on the block chain. This information from all the transactions and addresses is often called the transaction graph.

Transactions are usually assumed to correspond to real economic transactions, but sometimes transactions actually just represent someone sending bitcoins to themselves.

Taint analysis

Taint analysis is a technique sometimes used to study the flow of bitcoins and extract privacy-relevant information. If an address A is connected to privacy-relevant information (such as a real name) and it makes a transaction sending coins to address B, then address B is said to be tainted with coins from address A. In this way taint is spread by "touching" via transactions[12]. It is unclear how useful taint analysis is for spying, as it does not take into account transfer of ownership. For example an owner of tainted coins may donate some of them to some charity, the donated coins could be said to be tainted yet the charity does not care and could not give any information about the source of those coins. Taint analysis may only be useful for breaking schemes where someone tries to hide the origin of coins by sending dozens of fake transactions to themselves many times.

Amount

Blockchain transactions contain amount information of the transaction inputs and outputs, as well as an implicit amount of the miner fee. This is visible to all.

Often the payment amount of a transaction is a round number, possibly when converted to another currency. An analysis of round numbers in bitcoin transactions has been used to measure the countries or regions where payment have happened.

Input amounts revealing sender wealth

A mismatch in the sizes of available input vs what is required can result in a privacy leak of the total wealth of the sender. For example, when intending to send 1 bitcoins to somebody a user may only have an input worth 10 bitcoins. They create a transaction with 1 bitcoin going to the recipient and 9 bitcoins going to a change address. The recipient can look at the transaction on the blockchain and deduce that the sender owned at least 10 bitcoins.

By analogy with paper money, if you hand over a 100 USD note to pay for a drink costing only 5 USD the bartender learns that your balance is at least 95 USD. It may well be higher of course, but it's at least not lower.

Exact payment amounts (no change)

Payments that send exact amounts and take no change are a likely indication that the bitcoins didn't move hands.

This usually means that the user used the "send maximum amount" wallet feature to transfer funds to her new wallet, to an exchange account, to fund a lightning channel, or other similar cases where the bitcoins remain under the same ownership.

Other possible reasons for sending exact amounts with no change is that the coin-selection algorithm was smart and lucky enough to find a suitable set of inputs for the intended payment amount that didn't require change (or required a change amount that is negligible enough to waive), or advanced users sending donations using manual coin selection to explicitly avoid change.

Batching

Medium logo
Merge avoidance. A note on privacy-enhancing techniques in the Bitcoin protocol.
Dec 11, 2013, Mike Hearn

Payment batching is a technique to reduce the miner fee of a payment. It works by batching up several payments into one block chain transaction. It is typically used by exchanges, casinos and other high-volume spenders.

The privacy implication comes in that recipients can see the amount and address of recipients.

When you receive your withdrawal from Kraken, you can look up your transaction on a block chain explorer and see the addresses of everyone else who received a payment in the same transaction. You don’t know who those recipients are, but you do know they received bitcoins from Kraken the same as you.

That’s not good for privacy, but it’s also perhaps not the worst thing. If Kraken made each of those payments separately, they might still be connected together through the change outputs and perhaps also by certain other identifying characteristics that block chain analysis companies and private individuals use to fingerprint particular spenders.

However, it is something to keep in mind if you’re considering batching payments where privacy might be especially important or already somewhat weak, such as making payroll in a small company where you don’t want each employee to learn the other employees’ salaries.

Unusual scripts

Most but not all bitcoin scripts are single-signature. Other scripts are possible with the most common being multisignature. A script which is particularly unusual can leak information simply by being so unique.

2-of-3 multisig is by far the most common non-single-signature script as of 2019.

Mystery shopper payment

A mystery shopper payment is when an adversary pays bitcoin to a target in order to obtain privacy-relevant information. It will work even if address reuse is avoided. For example, if the target is an online merchant then the adversary could buy a small item. On the payment interface they would be shown one of the merchant's bitcoin addresses. The adversary now knows that this address belongs to the merchant and by watching the blockchain for later transactions other information would be revealed, which when combined with other techniques could reveal a lot of data about the merchant. The common-input-ownership heuristic and change address detection could reveal other addresses belonging to the merchant (assuming countermeasures like CoinJoin are not used) and could give a lower-bound for the sales volume. This works because anybody on the entire internet can request one of the merchant's addresses.

Forced address reuse

Forced address reuse are when an adversary pays small amounts of bitcoin to addresses that have already been used on the block chain. The adversary hopes that users or their wallet software will use the payments as inputs to a larger transaction which will reveal other addresses via the the common-input-ownership heuristic. These payments can be understand as a way to intentionally do address reuse.

The correct behaviour by wallets is to not spend coins that have landed on an already-used addresses.

Amount correlation

Amounts correlation refers to searching the entire block chain for output amounts.

For example, say we're using any black box privacy technology that breaks the transaction graph.

V --> [black box privacy tech] --> V - fee

The privacy tech is used to mix V amount of bitcoins, and it returns V bitcoins minus fees back to the user. Amount correlation could be used to unmix this tech by searching the blockchain for transactions with an output amount close to V.

A way to resist amount correlation is to split up the sending of bitcoins back to user into many transactions with output amounts (w0, w1, w2) which together add up to V minus fees.

V --> [privacy tech] 	--> w0
--> w1
--> w2

Another way of using amount correlation is to use it to find a starting point. For example, if Bob wants to spy on Alice. Say that Alice happens to mention in passing that she's going on holiday costing $5000 with her boyfriend, Bob can search all transactions on the blockchain in the right time period and find transactions with output amounts close to $5000. Even if multiple matches are found it still gives Bob a good idea of which bitcoin addresses belong to Alice.

Timing correlation

Timing correlation refers to using the time information of transactions on the blockchain. Similar to amount correlation, if an adversary somehow finds out the time that an interesting transaction happened they can search the blockchain in that time period to narrow down their candidates.

This can be beaten by uniform-randomly choosing a time between now and an appropriate time_period in which to broadcast the bitcoin transaction. This forces an adversary to search much more of the existing transactions; they have to equally consider the entire anonymity set between now and time_period.