-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[processor/interval] Persistence backed interval processor with support for multiple aggregation intervals #33949
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
We have an established notion of a storage extension which allows components to persist data via an interface. Users can then choose the implementation of the interface (i.e. the type of storage extension) that meets their needs. The use case you're describing seems reasonable but can we please first consider whether it could work with a storage extension? There are several examples of components using the extension - this might be a relatively straightforward one. |
@djaglowski Thanks for looking over the issue.
The current storage extensions would not be efficient for aggregations. The best we could do with the current storage extension would be to store each resource as a key-value pair and do something like update on write manually. For example, when we get something stored on the same key then we decode the data, perform merge, and write it back to the DB. Pebble, OTOH, can do periodic compactions and perform merges (aggregations) on compactions. I think if we can implement pebble as a new storage extension then we can use it with aggregations efficiently. WDYT, does it make sense? |
I'm not opposed to a new pebble extension (though I don't have capacity to help make it happen). What I'm most curious about is whether the storage extension really needs to be used so aggressively. In other words, can we maintain aggregations in memory and periodically write the overall state to the storage extension (e.g. once per second)? |
Got it. I can do the heavy lifting here once we have a general direction, if that would work for you.
No, we don't need to use it so aggressively as to hit the database for each value we get. I did a PoC for using pebble for aggregations and used a size-based batch buffer of 16MBs (ref) to keep the data in-memory before flushing it to the actual DB. In addition, before harvesting the period when it is mature we would flush all in-memory data to db. One of the advantages of using something like pebble over a generic key-value DB would be that it would make writes fairly straightforward, without a lot of computational overhead involved in merging. |
I had the same thought. Would it be fine to only store to the DB every X seconds or so, so we don't have to suffer the read, deserialize, write roundtrip for every operation |
To clarify, I don't have capacity to sponsor (review and ultimately be responsible for) another component. It's still worth proposing formally though since you may find another sponsor. Otherwise you can always host it outside of contrib and people can pull it into their builds. |
Hmm, I understand the point now. In the PoC I posted I am still serializing the data and writing it to an in-memory batch, however, we can save the whole serialization/deserialization bit by performing some merges in-memory and then serializing the partial aggregate. This makes sense to me and we could definitely do it.
Got it. Thanks for the clarification. |
A clarification to this, with pebble, in the hot path (the |
An update, I am working on building a pebble-based storage extension as per @djaglowski 's suggestion earlier. The existing storage extension API is basic and using it directly would be a bit detrimental for the performance as it has: a) no range keys support b) keys are always string. I will know more once I have made some progress with the code. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I have been a bit away from this topic but I am working on this again. Looking into the storage extension, I found the current interface for the extension lacking support for the use-case for using pebble for aggregation. If we were to use the storage extension in the current state (with the support for simple get/put/batch) then the performance would be quite bad as we require range operations on keys. I am currently working on getting some benchmarks comparing the pebble approach with the current interval processor to get some data. |
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Component(s)
processor/interval
Is your feature request related to a problem? Please describe.
The current version of the interval processor does in-memory aggregations for a single given interval (more than one intervals can be added with extra-overhead as new processors). The in-memory aggregations can be lost due to crashes and can also cause high memory usage for high cardinality aggregations.
Describe the solution you'd like
I propose database-backed aggregation supporting more than one intervals. I have a working PoC for this here, it is rough around the edges and will require some work before it can be contributed but it demonstrates the proposed idea. The PoC uses pebble-db, an Log-Structured-Merge (LSM) database and provides option to aggregate over multiple intervals.
The currently linked PoC uses only the interval as a key in the database which would require us to load the full aggregated interval in-memory, however, we could provide granular keys or even simple hash-based partitioning on resource IDs to reduce memory requirements even further.
Describe alternatives you've considered
No response
Additional context
I would love to hear what the maintainers of the interval-processor think about the proposal. If the maintainers are on-board the idea, I can clean-up the PoC, get it to a feature parity with the current interval processor, and create a PR.
The text was updated successfully, but these errors were encountered: