MEV, QUIC and QoS..

Jul 20, 2022

What is going on with Solana, why is congestion an issue and how are the current TPS problems being addressed?

Strap in and get ready for a quic ride 🫡

As a quick recap, Solana has been fighting congestion issues for the past few months, experiencing three network halts (one in September 2021, one in May 2022 and the last one in June 2022) as well as numerous periods, in which transactions regularly fail.

There were several occasions, in which Solana was hardly useable as the network hit significant stretches of congestion. Attempts to swap a token or sell an NFT would be in vain as most transactions simply failed.

While these phases have significantly decreased as the network has stabilized over the past couple of months, FUD from anti-solana maxis seems to still be omni-present. Whether it’s justified or unfounded, let’s find out!

is the fud justified or are we about to enter solana summer?

[Note: this post is an updated version of a thread I originally wrote for Ora.]

Into the Mempool

Taking a step back, Solana doesn’t have a mempool.

Say what?

Yup. Unlike its good friends in Ethereum-land, Solana uses a mempool-less transaction-forwarding protocol called “Gulf Stream”.

A mempool consists of a global pool of transactions that have been sent out but not yet confirmed.

Instead of sending TXs to such a mempool, clients & validators on Solana forward TXs directly to the expected leader.

An overview of Ethereum’s blockspace and mempool by @gakonst & @Leorzhang

This is possible since the order of leaders is known ahead of time.

MEV

The problem in the current design is that bots can spam an arbitrary number of transactions to the validators.

They do this to maximize the chance of getting their transaction confirmed (by saturating the links and boxing out competing transactions → thanks Austin for the clarification!).

For example, if there’s a popular NFT mint, a bot can spam a mint transaction as many times as it wants to the expected leader in the hopes of successfully hitting the Candy Machine (Solana’s protocol to distribute and mint NFTs).

These bots do this because the expected value for popular mints is usually quite high as the floor for secondary sales post-mint tends to be much higher than mint price.

The same concept applies to events like IDOs, liquidations or arbitrages where multiple parties compete for the same assets/transactions.

All of these can be categorized as MEV.

Formerly knowns as Miner Extractable Value, it’s been reframed as Maximal Extractable Value.

Defined by Flashbots, MEV is

“the maximal value that can be permissionlessly extracted from transaction ordering.”

On Solana, this means having your transaction go through before others in order to mint an NFT, arbitrage a token pair or fill a liquidation (a non-exhaustive list of MEV examples).

Many of these MEV instances are zero-sum games.

For liquidations or arbitrage transactions, there is usually one entity that wins (the vast majority of) a specific opportunity.

For example, if a borrower on a lending platform becomes under-collateralized, lending markets allow for a permissionless liquidation of this borrower’s assets in exchange for a rebate.

If the assets under question are large enough, this rebate fee can be of significant monetary value.

Hence, people monitor such positions, so they can be the first ones to liquidate and reap the rewards.

ceteris @ceterispar1bus

tldr of what i understand is, solana limits the amount of compute in tx's, this was fine half a year ago, but now for more complex defi stuff the amount of compute on network is skyrocketing, liquidation bots causing issues. compute fee based model will be needed.

Solana Status @SolanaStatus

Mainnet Beta Validators: Please upgrade to https://t.co/lJyNPScnH7

The problem is that various entities compete for the same liquidation opportunity.

In such cases, you can either use a private relay to bribe a validator or spam as many transactions as possible to maximize your chances.

Without a Flashbot-like geth client for Solana validators (Jito Labs is building this), basically all bots default to the latter option.

As Misaka pointed out and analyzed to great detail, during the LUNA/UST debacle MEV profits for arbitrage and liquidation MEV came close to 5 million dollars.

It goes without saying , that this is considerable value at stake for a span of a few days.

misaka @0xmisaka

$43m in total MEV on Solana 👇🧵 The collapse of LUNA / UST edition

In the case of popular IDOs, bots try to acquire as many tokens as possible.

This was the root cause of the September outage last year.

Bots spamming validators to acquire Grape tokens on Raydium put the network to a halt.

Congestions, congestions, congestions…

Well, fundamentally the question still arises: why do transaction spams lead to congestion, shouldn’t Solana be able to handle such a load?

Solana has had two halts where the network had to be restarted.

The first one was in September when bots spammed the aforementioned Raydium IDO to acquire a hotly contested token.

Some validators received over 300k transactions per second.

The second one happened a couple of months ago when bots were spamming the Candy Machine program to win an NFT mint.

An order of magnitude of difference compared to the September outage as some validators were receiving upwards of 4 million transactions per second this time around.

Laine | stakewiz.com @laine_sa_

What a weekend. Yesterday the Solana blockchain halted, which means it stopped producing blocks. This resulted in validators coordinating a cluster restart which requires 80% of stake (minimum 605 validators). A 🧵 on what occured. 👇

With so many packets trying to get processed, hardware gets overwhelmed.

Excessive forking occurs, which requires validators to keep track of a myriad of increasingly larger forks.

All of this uses RAM, which can lead to validators running out of memory.

According to Toly (Solana’s co-founder),

“in theory, there is still only 1 block per slot even with a ton of forking and if the code didn't have bugs it can handle all the forks in constant memory”.

Local solutions

Protocols have some leeway to tackle these spamming issues locally. For example, Metaplex introduced a candy machine tax that charges an account a 0.01 SOL fee if one of the following conditions are met:

Trying to mint when the candy machine is not live before or after.
Trying to mint when there are no items left in the candy machine.
Calling Candy Machine via CPI when not using gumdrop.
Crafting a transaction where mint or set collection is not the last ix.
Using the wrong collection id than the configured one for the candy machine.
Setting Collection IX with a mismatched mint than what was just minted.
Signer Payer mismatch with collection set and mint ix.
Suspicious Transactions where disallowed programs are used.
Trying to mint on an AllowList candy machine with no allow list token.

These conditions are supposed to identify and deter bots from spamming an NFT mint event. It’s not a perfect solution by any means. One could even posit that it primarily penalizes normal users, e.g. condition #2 is likely met for users who try to mint a popular, sold-out collection that sells out almost immediately.

Yet, it certainly is one method of discouraging botting. So far, the program has collected around 5000 SOL through the implemented tax. How much of that amount was derived from bots vs retail users is a question for another time…

A global solution?

Arguably, these local patches are temporary band-aids that won’t solve the long-term issue of Solana’s liveness and congestion.

So, what’s being done on the fundamental layer-1 level to prevent bots from spamming validators and crashing the network?

Three things:

QUIC.
Stake-weighted transaction QoS.
Fee-based execution priority.

Alright, what the hell do those things mean?

I. QUIC

Let’s start with QUIC.

When applications exchange data over the internet, they usually use TCP or UDP.

TCP is reliable and accurate. It verifies that delivery was successful, uses congestion control and ensures packets are delivered in-order.

UDP, on the other hand, has smaller packet sizes and is faster since it doesn’t order packets, control congestion or have error recovery.

The tradeoff between TCP vs UDP is in reliability vs speed.

Solana currently uses a custom version of UDP.

Clients send transactions encoded as UDP packets to a validator’s Transaction Processing Unit (TPU).

Solana’s underlying packet transport layer, thus, fundamentally lacks congestion control.

When you sign a transaction in Solana for a product you interact with, the transaction is sent to the dApp’s specified RPC provider.

That provider then sends your transactions as a UDP packet to the current as well as next leader (who is known in advance).

QUIC is a transport protocol that came out of Google.

It is designed to be an alternative transport layer that sits in between the purpose of UDP and TCP.

QUIC is closer to the speed of UDP while allowing for TCP-like flow control.

The switch to QUIC was proposed as a way to support

larger transactions
more reliable packet transmission
enforceable rate limiting
standardized flow control.

The original proposal by the Solana team

QUIC specifically incorporates “IP spoofing”.

This means it disables one anonymous machine to send the same transaction repeatedly.

Hence, bots would have to send each transaction from a new instance.

II. QoS

Moving on to Stake-weighted transaction QoS.

QoS = Quality of Service.

QoS comes from computer networking and is a methodology to control data traffic for limited capacity networks that ensures latency,

This concretely means that bandwidth will depend on stake-weight (how much SOL is staked with them relative to other nodes).

Since the leader bandwidth is limited, this would replace the current practice of accepting transactions on a first-come-first-served basis.

So, a node with 1% stake-weight should be able to transmit 1% of packets to the leader.

To be more precise,

“if a node has 1% of the stake, it's connection to the leader should in theory not be starved by 99% of the staked or rest of unstaked senders” (Anatoly Yakovenko in response the original thread).

This comes into play as a second priority measure after a transaction’s fee-per-compute-unit to select packets for block inclusion.

III. Fee Markets

So, finally... fee-based execution priority.

Fee markets are coming to Solana!

A user is able to attach an optional fee to their transaction to get it prioritized, with 50% of the fee going to the validator and 50% being burned.

Why 50%? If 100% of the fee would go to the validator, the validator could re-use that fee to take up bandwidth. Hence, decreasing the amount to 50% makes this more expensive and disincentivizes validators from spamming the network.

With the introduction of such a fee, validators can then start prioritizing transaction based on fees.

The fee would still be deducted even if the transaction fails. Thus, penalizing spam transactions.

Buffalu from Jito has some neat notes on the actual implementation here 👇

The tricky thing with fee prioritization is that you don’t want one local issue to propagate to the entire network.

For instance, a popular NFT mint spiking gas for every other unrelated transaction is not ideal.

SMS T◎Ly, 🇺🇸 @aeyakovenko

@chainyoda quic is already on testnet, break.solana.com/wallet?cluster…. The boffins have a bunch of PRs mid flight for fee prioritization. It's tricky because a single write account shouldn't spike fees for everyone.

SMS T◎Ly, 🇺🇸 @aeyakovenko

ELI5: fee market on solana blockspace is limited, users need to bid for it, but if blocks aren't saturated why would user fees go up? 🧵

Hence, naturally, one might wonder: don’t fee markets lead to the same problem as Ethereum’s gas spikes?

Not quite.

In Ethereum, gas is used as a means to acquire blockspace.

Transactions compete globally to be included in the next block.

In Solana, there’s another nuance.

Besides being added to the next block, competition among transactions also boils down to which state is being written to.

Solana’s runtime Sealevel is able to parallelize transactions since every instruction has to specify which account it reads and writes to.

The consequence is that accounts that transactions (which consist of multiple instructions) that read over the same state can be parallelized.

In contrast, only one transaction at a time can write to the same account.

a visualization of account write-locks by @jump_

The resulting priority fee change should, thus, introduce a local fee market instead of a global one.

This local fee market would result in spiking transaction fees solely for local states.

For example, a contested liquidation where bots outbid each other with higher fees should not significantly change fees for any other transaction like swaps or NFT purchases.

This is because only one of these liquidation transactions will be added to the block (since they’re all trying to write state to one account), leaving space in the block for other transactions that don’t affect the same account.

In reality, there are Nth order effects that might still affect competition for blockspace and net-increase fees globally.

E.g., you might have to pay an additional fee anyway if you want to make sure that your transaction goes through since other transactions might do the same.

An extra measure to fend off the spam for write-locks (one transaction writing to an account preventing others from accessing the same state) is a proposal by Anatoly to let programs add an optional fee for writes with the ability to refund successful transactions.

For example, a program could charge any failed transaction X Sol and refund any successful transaction the X Sol.

This essentially introduces a fee for spam transactions that try to write-lock state.

A proposition to add an optional write-lock fee

A lot of the proposed changes make their first appearance in Solana’s 1.10 and 1.11 versions.

So is 1.10/1.11 the end-all-be-all solution?

Nope, but an important step towards the right direction.

SMS T◎Ly, 🇺🇸 @aeyakovenko

@bennybitcoins @chuddymanee @AutismCapital not totally there in 1.10. the per state account prioritization needs to be respected across a bunch of queues that lead to the block producer. its somewhat there in 1.10 but not everywhere. and will definitely require some iteration.

While the upcoming changes will require multiple iterations to get things right, the future implications could be tremendous.

For an overview of all changes, you can follow along this GitHub issue or the Solana website where the team keeps track of the proposed solutions.

Devs are certainly doing something, and I’m optimistic about the road ahead.

If you’re interested in blockspace, MEV or security, stay tuned for more posts!

Next time, we’ll wander into the dark forest of Solana 🌳

sanny

Discussion about this post