-
-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
in-memory LockFreeBtree<K, V> #62
Comments
I'd be very interested in including a lock-free I think it's possible to abstract a Bw-Tree into a separate crate, and then expose a completely in-memory API for use as a concurrent Here are some of my thoughts on a completely in-memory Bw-Tree (let's call it 1. TypesIn the standard library we have I noticed that in rsdb keys are of type 2. Page TableIn rsdb you implemented a radix tree as the page table storage. Page table is one of the crucial tricks behind the Bw-Tree, and it took me a long time to realize that having the table as an array, hash table, or some kind of tree is completely unnecessary. In In In other words, IDs are just pointers to those special nodes. And the mapping from the ID to another node is written inside the heap-allocated node. I think the Bw-Tree should allow configuring the page table implementation (the API might be: 3. IteratorsUnlike the typical quick operations on a Bw-Tree (e.g. insert/search/remove), iterators may be long-lived. This is a problem in the context of memory reclamation. Holding a thread pinned for the entire duration of iteration might be blocking garbage collection for an unacceptably long time. I wonder how you're going to solve this problem. My idea was to keep a reference count in each node of the tree. That would allow us to hold a reference to a node even after unpinning the current thread. The key idea is that we increment the reference count of a node just before unpinning, and then use the node outside the pinned region of code. All nods start with the reference count of one. When removing a node from the tree we just decrement the count. Whoever decrements the count to zero has to add the node as a piece of garbage to the epoch GC. 4. Thoughts on ARTFinally, I wonder what you think about adaptive radix trees. In my benchmarks, an ART is typically several times faster than a B-tree for insert/search/delete, and several times slower for iteration. If your keys are byte arrays, ART can be a fantastic data structure. Having an ART in Crossbeam is another long-term goal of mine. Servo uses a concurrent string cache for interning (it's just a cache of strings that maps every unique string to a unique integer ID), which is based on a sharded mutexes-protected hash table. I think some performance gains might be achieved by switching to ART. ART is probably not a good fit for rsdb as the index data structure, though, since it's more difficult to seamlessly move nodes between memory and disk, keys have to be byte arrays, iteration is slightly worse, and it's not strictly speaking lock-free. But I'm still curious if you have any thoughts on this. :) Is your page mapping table perhaps going to be an ART? cc @aturon - In case you are interested, this is the cool database project I've mentioned to you last week, and Bw-Tree is the concurrent B+-tree behind it. |
Background: I began this project as a part of my larger (currently idle) rasputin distributed database with the primary goal of providing a super fast persistent KV store that I am familiar enough with to bend to the whims of problems I face in stateful distributed systems, as well as finally getting a persistent db for the rust community with a high focus on reliability. I'm super interested in collaborating on a lock-free tree for crossbeam!
For this to be done in a generic way that plays with the underlying pagecache, we just need to be able to choose either the in-memory straight-to-
I like this for supporting long iterations directly on the tree! A single RC on a wide node is pretty low overhead. This sounds great to me after thinking about it!
So, I'm definitely up for combining forces for a |
I was thinking about this for a long time, but I'm still not really sure how to move forward, at least not right now. :) I'd just suggest keeping in mind that we eventually want to expose the tree as a purely in-memory data structure. My plan is to experiment a little bit and try coding up a simple prototype of a Bw-Tree, focusing more on the low-level implementation details than the high level tree design (consolidations, supporting all operations, etc.). More specifically, I'm trying to answer questions like:
|
@stjepang awesome, I'm really looking forward to seeing what you come up with! RE: dynamic nodes with few indirections, it may be hard to beat a For non-clone/copy keys, this may not be the way forward, as you'll need a reference to a single instance. Also keep in mind that the inserted keys that were the source of node min/max bounds may have been deleted. So you may need a ref count on the underlying key. Also there's a choice of how you handle re-inserting an equivalent key. Do you keep the indices and node bounds referring to a different, equivalent key objects, or do you try to coalesce these somehow? I haven't thought of a better solution to the iterator problem than your RC idea above. I'll definitely let you know if something else pops into my head though, but I think that approach will give you some good mileage. Broken |
Another paper you might find interesting: A Comparison of Adaptive Radix Trees and Hash Tables. It compares the performance of adaptive radix tree, hash table, judy array, and B+ tree. |
@stjepang ahh cool, thanks for the paper! The current naive radix implementation is getting more and more significant on the flamegraphs, and pretty soon I think I'll give an ART implementation a go! This paper will be really useful for that. The current one has like a 1:64 pointer:data ratio and even though the traversal / insertion code is fairly minimal and probably nice to the L0 code cache, I wonder how much that node bloat messes up the overall performance. |
For supporting long-lived pointers such as iterators in Crossbeam, I have implemented a very preliminary version of a hybrid of EBR (epoch-based reclamation) and HP (hazard pointers). The high-level idea of this hybrid is using HP for long-lived pointers and EBR for short-lived pointers so that long-lived ones do not hinder the epoch advancement, while short-lived ones enjoy short access latency. This hybrid is very much inspired by Snowflake. Though I'm not sure HP is actually helpful in the iteration workloads. First, maybe it's not costly to count the references to a node. I'd like to produce some benchmark results. Second, HP protects only a handful of pointers, while a user may want to iterate over a large set of data. In reference-counting, you're going to mark all the nodes first and then to iterate over them later, right? |
Not really. The idea is to increment and decrement reference counts as you go from one node to another. The same applies to hazard pointers. For example, if you want to iterate over a long chain of 1000 nodes, you only need two hazard pointers to do that. |
Hey @stjepang, have you played around with a lock-free ART in rust yet? I'm starting to think more about using one to replace the naive radix tree here in sled. |
I haven't, but my friend implemented this, which should be a good starting point for a concurrent ART. |
@stjepang ahh cool, I pinged andy pavlo a couple days ago, and he was kind enough to provide some recommendations. He also mentioned the ART, and I've kicked off a few mental threads around it. So, there is a pattern in use in sled, inspired by the LLAMA paper, where you can get lock-free data structures synchronized with a log-structured persistent store, in terms of linearizing updates, where we first reserve a certain amount of space in the log, then do a CAS on the in-memory structure, and then based on the success of the CAS either write a "failed flush" into the log buffer or the actual data for the update. I have been wondering about the performance of doing this to create a persistent ART index for point queries. |
@spacejam In fact, one of my mid-term goals is pagecache + persistent memory :) I really enjoyed Andy's recent work, and glad that you already contacted him! I'm also thinking of in-memory lock-free ART. It is very.. intriguing, but I hope we will find a way. I haven't thought of persistent ART, but it'll be very interesting if the page index for a log-structured storage (LSS) is also persisted in the LSS itself! I'm curious how to break this seemingly circularity, though. |
@jeehoonkang that sounds awesome! If you're going for optimization, these two tunables are probably the biggest ones to play with for your workloads:
For hosting on the LSS, that's what sled is doing now. Even if the checkpoint gets deleted, it will scan the storage, record which page fragments are stored at which log offsets, and using the recover functionality in the BLinkMaterializer to track root page changes. Then when requests come in, just start at the root page, and use the recorded log offsets to materialize pages while working down the tree. I think the same approach could be used directly for a persistent ART. I'm skeptical about the persistent version being able to beat the Bw-tree in terms of space efficiency or read latency when not loaded into memory completely, but if we don't evict nodes from cache it might be good. In any case, it would probably not be the hardest thing to build a prototype of since pagecache is now split out :) |
An interesting new paper was published: |
While this seems interesting and possibly something I'd like to do in the future, I'm going to close this because it's not relevant to the sled project itself. FWIW it's possibly to create this by basically just chopping out the pagetable completely in sled, and this might even be a nice exercise to see how small the tree implementation can become when it doesn't have a pagecache underneath to broker atomic operations. |
implemented as the concurrent-map crate :) |
and non-concurrent fixed-length art: art |
No description provided.
The text was updated successfully, but these errors were encountered: