(Significant work is currently happening in the tyler_ranges
branch)
flexible linearizable distributed store
triumvirs: operational clarity, performance and composability
currently implemented: linearized KV set/get/cas/del. client code is happy-path only, so it's only fit for playing around with at this point!
current reasons why you don't want to use this beyond playing with it:
- Mostly unimplemented. We don't have support for automatic resharding, real transactions or collection types other than KV yet. These are still in the planning phase.
- Possibly incorrect. We have not yet proven the correctness of the core consensus algorithm. We may be able to adapt the Raft Coq proof to this end, as we are essentially replacing Raft's preemptible leadership with a non-preempting lease to improve throughput in the presence of partial partitions.
- Inefficient. The write path involves a TON of copying. We are in the process of designing a much more efficient buffer management system.
- Buggy. Client code is a complete hack that only occasionally works. We have a simulator in place for teasing out bugs in the state machine, but we haven't used it for simulating common datacenter conditions like partitions, delayed message arrival, node restarts/shutdowns/pauses, etc...
- Undocumented.
- Unpopular. No community and no production users (or, at least I hope nobody is using it yet!).
cargo build
./run.sh
tail -f _rasputin_test/*log
target/debug/rasputind \
--peer-port=7777 \
--cli-port=8888 \
--seed-peers="127.0.0.1:7777" \
--storage-dir=/var/lib/rasputin/ \
--logfile=/var/log/rasputin.log
Cargo.toml:
[dependencies]
rasputin = "0.1.0"
Code:
extern crate rasputin;
fn main() {
let peers = vec!["127.0.0.1:8888".parse().unwrap()];
let nthreads = 1;
let mut cli = rasputin::Client::new(peers, nthreads);
cli.set(b"k1", b"v1").unwrap();
assert!(cli.get(b"k1").unwrap().get_value() == b"v1");
// CAS returns the current value, and sets the success flag accordingly
assert!(cli.cas(b"k1", b"v1", b"v12").unwrap().get_value() == b"v12");
assert!(cli.cas(b"k1", b"vNever", b"vNever2").unwrap().get_value() == b"v12");
assert!(cli.cas(b"k1", b"vNever", b"vNever2").unwrap().get_success() == false);
assert!(cli.cas(b"k1", b"v12", b"v13").unwrap().get_value() == b"v13");
// deletes return the last value
assert!(cli.del(b"k1").unwrap().get_value() == b"v13");
assert!(cli.get(b"k1").unwrap().get_success() == false);
}
Rasputin will utilize shard size and request density metrics to faciliate intelligent splitting.
- kv: backed by RocksDB
- log: Kafka-like sequential segment files
- object: files on system VFS
- subscribe: in-order mutation stream
- watch: at most once mutation notification
- consensus: mutations block on replication to a quorum
- async: mutations return quickly, and are replicated later
- logarithmically bucketed histograms for efficient aggregation and consumption of extremely high velocity metrics, a la loghisto.
- mio event loops
- leader election
- rocksdb persistence layer
- log replication
- multipaxos consensus
- simple KV client operations
- reconfigurable membership
- range splitting
- mesos framework
- c/jvm/python/ruby/go client libs
Because Rasputin aims to be as general purpose of a replication mechanism as possible, it needs to be resilient against partitions. We aim to reuse the parts of Raft that work for this as much as we can, and replace the leader election mechanism with a lease-based one that does not preempt in the presence of partial partitions. This obviously needs to be tested extensively, and to that end a comprehensive simulator is being built for testing the state machine (see test/cluster.rs), and fault injection tooling is being built for inducing realistic datacenter conditions on a non-simulated cluster.
Raft is vulnerable to rapid leader churn when a partial partition exists between the leader and any other node. The partially partitioned node will fire its leader election timer and receive quorum. Because the old leader can't talk to this new leader, it will do the same. Leadership bounces a lot and we have suboptimal throughput. Harpoon is essentially just Raft with a modified election mechanism: candidates and leaders request leases from all peers, extend leadership if they reach quorum, and abdicate if they do not reach a quorum of successful extension request responses by the end of their lease. This prevents leadership churn in scenarios where there is a partial partition, which is common over the open internet, for example.
Harpoon has not yet been formally verified, but eventually we will adapt the Raft Coq proof for it.