Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eliminate multi data point TimeSeries #54

Closed
tigrannajaryan opened this issue Nov 14, 2019 · 15 comments · Fixed by #77
Closed

Eliminate multi data point TimeSeries #54

tigrannajaryan opened this issue Nov 14, 2019 · 15 comments · Fixed by #77
Assignees

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Nov 14, 2019

Metric proto currently allows to represent multiple data points for each metric, i.e. a timeseries with several values of the same metric, one value per timestamp.

However, reading the metric SDK proposal tells me that normally each collection will emit only a single data point per metric, the last aggregate state of the metric. There may be custom exporters which accumulate data points and emit them all in one batch but it seems to be a possible exceptional case and not a norm. Typically an exporter will simply send the last aggregate value.

Another use case where we could possibly have multiple data points per metric is in Collector, where we could in theory batch points from the same metric into one timeseries. However, in practice I do not think we will do it since it is a very costly operation that requires maintaining maps of metrics.

Given that a single data point per metric is the most common case I suggest that we optimize the proto for that case. Currently the proto definition is the following (only int64 is shown for simplicity):

message Metric {
  MetricDescriptor metric_descriptor = 1;
  Resource resource = 2;
  repeated Int64TimeSeries int64_timeseries = 3;
}
message Int64TimeSeries {
  repeated string label_values = 1;
  repeated Int64Value points = 2;
}
message Int64Value {
  fixed64 start_time_unixnano = 1;
  fixed64 timestamp_unixnano = 2;
  sfixed64 value = 3;
}

I propose to change this to the following:

message Metric {
  MetricDescriptor metric_descriptor = 1;
  Resource resource = 2;
  repeated Int64Point int64_points = 3;
}
message Int64Point {
  repeated string label_values = 1;
  fixed64 start_time_unixnano = 2;
  fixed64 timestamp_unixnano = 3;
  sfixed64 value = 4;
}

Preliminary benchmarking shows that for the case when we have 1 int64 data point per metric this change reduces CPU and RAM usage by 25% for encoding and decoding.

Note: this schema still allows representing multiple data points per metric if neccessary, however at the cost of repeating label values.

@tigrannajaryan tigrannajaryan self-assigned this Nov 14, 2019
@tigrannajaryan
Copy link
Member Author

@jmacd @bogdandrutu I need your opinion please.

@bogdandrutu
Copy link
Member

Another use case where we could possibly have multiple data points per metric is in Collector, where we could in theory batch points from the same metric into one timeseries. However, in practice I do not think we will do it since it is a very costly operation that requires maintaining maps of metrics.

We are working on a processor that does this, in order to reduce cardinality for the labels (some users need this).

I would suggest that we just simply drop the repeated for the Points initially and keep only one Point. This way we still have the chance in the future to add the capability to support multiple points in a backwards compatible way.

@tigrannajaryan
Copy link
Member Author

OK. So you are suggesting to do this:

message Int64TimeSeries {
  repeated string label_values = 1;
  Int64Value point = 2;
}

@bogdandrutu
Copy link
Member

I think that will keep the possibility in the future to support multiple points if needed. Still removing one allocation.

@tigrannajaryan
Copy link
Member Author

It is a smaller CPU saving (2% encoding, 10% decoding) but we can do it.

@jmacd
Copy link
Contributor

jmacd commented Nov 15, 2019

I might suggest that we take both options. Use the format Tigran suggests for clients to report a single data point with a relatively short period, where the timestamp is implied by the enclosing message. This will be fast and cheap for the clients to use.

Then, support a second format for multi-valued timeseries with timestamped values. This will be useful in situations where longer periods are being aggregated. This will enable flexibility. I'm not sure how this second format will be applied.

@tigrannajaryan
Copy link
Member Author

@jmacd I understand what you are saying but am a bit reluctant to add a second parallel representation of the data. That complicates the schema and implementations (especially receivers who need to be able to process both formats).

@jmacd
Copy link
Contributor

jmacd commented Nov 16, 2019

@akehlenbeck I'd like your input.

@akehlenbeck
Copy link

I've never seen much demand at all for representing multiple points from the same time series in one write request. The main use is I'm aware of actually has nothing to do with realtime collection, but rather with backfilling. Eventually, someone will want to rename or otherwise transform existing historical data. The natural way to do that is with some kind of extract-transform-load job, and in the "load" step you end up wanting to write a lot of points from a single time series in one big batch.

But: I would also imagine that kind of operation might be performed directly against a SaaS-specific API, rather than going through the same collectors that are used for realtime collection. So it sort of seems like it's out of scope for this discussion.

My inclination would be to go with the format suggested by Tigran above. I don't see compelling benefit from the intermediate approach of keeping the Int64Value indirection. Tigran's proposed format does formally support multiple points per time series, just at some extra cost (of repeating the label values), but I think that's exactly the right tradeoff: normal clients should get the efficient and simple behavior, and clients doing the weird thing should pay the cost.

If for some reason in the future someone comes along with a compelling reason for needing multiple points per time series and for very high performance, then revisit this question then with a separate API. (But I would be surprised if such a use case ever arose.) I suspect in that situation you'll want an entirely different representation anyways, for example

message Int64TimeSeries {
  repeated string label_values;
  // Three parallel arrays:
  repeated fixed64 start_time_unixnanos;
  repeated fixed64 timestamp_unixnanos;
  repeated sfixed64 values;
}

So basically: do what Josh said about having a second format, but don't bothering doing it until it's actually needed (if ever).

@jmacd
Copy link
Contributor

jmacd commented Nov 18, 2019

Great. I support Tigran's proposal.

@tigrannajaryan
Copy link
Member Author

I put a bit more thought into this and given that Bogdan thinks we may need to support multiple data points (and batching) in the future I do not want to pursue this anymore. It will be difficult (or impossible) to add support for multiple data points in a way that does not break existing code that deals with data points in the Collector. Yes, on the wire level repeated fields are compatible with single fields but the generated code is not compatible and will be a breaking change for Collector.

@akehlenbeck
Copy link

What's the motivation for having multiple points from the same time series in one batch? I didn't really understand Bogdan's comment from higher up:

We are working on a processor that does this, in order to reduce cardinality for the labels (some users need this).

How does cardinality reduction translate into multiple points?

@tigrannajaryan
Copy link
Member Author

How does cardinality reduction translate into multiple points?

I don't know what exactly Bogdan's use case is but purely technically it is possible to eliminate one (or more) of the labels that you don't care about and combine all data points for all values of that label into a single array (a Timeseries message). Obviously this loses data, not sure if this is really something desirable ever.

@jmacd
Copy link
Contributor

jmacd commented Nov 25, 2019

I believe the use-case being imagined is one where we combine say a number of gauge values taken across hosts. In the aggregation, the host label is removed leaving us with many gauge values, still with independent timestamps, that should be preserved as multiple points.

I still endorse a plan of supporting both options. There will be a relatively simple adapter in code to visit the single-value timeseries as a length-1 multi-value timeseries.

@tigrannajaryan
Copy link
Member Author

Have just chatted with @bogdandrutu and we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense.

I am going to submit a PR with a single Int64Value point. The intent is to keep Int64Value (and other value types) as a separate message for usability.

If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto:

  1. Add a corresponding points repeated field in the Int64TimeSeries.
  2. Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.

tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-proto that referenced this issue Nov 27, 2019
Resolves open-telemetry#54

After a discussion we came to conclusion that there is no good use case
for having multiple data points. All possible ways to combine data points
that we imagined are either very expensive or make no semantic sense.

This commit eliminates timeseries that can have multiple data points and replaces
them with single data points.

If in the future we discover a use case for multiple data points we have 2
possible paths to extend the proto:

- Add a corresponding points repeated field in the Int64DataPoint.
- Introduce a different message type that is capable of storing multiple data
  points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.
tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-proto that referenced this issue Nov 27, 2019
Resolves open-telemetry#54

After a discussion we came to conclusion that there is no good use case
for having multiple data points. All possible ways to combine data points
that we imagined are either very expensive or make no semantic sense.

This commit eliminates timeseries that can have multiple data points and replaces
them with single data points.

If in the future we discover a use case for multiple data points we have 2
possible paths to extend the proto:

- Add a corresponding points repeated field in the Int64DataPoint.
- Introduce a different message type that is capable of storing multiple data
  points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.
tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-proto that referenced this issue Nov 27, 2019
Resolves open-telemetry#54

After a discussion we came to conclusion that there is no good use case
for having multiple data points. All possible ways to combine data points
that we imagined are either very expensive or make no semantic sense.

This commit eliminates timeseries that can have multiple data points and replaces
them with single data points.

If in the future we discover a use case for multiple data points we have 2
possible paths to extend the proto:

- Add a corresponding points repeated field in the Int64DataPoint.
- Introduce a different message type that is capable of storing multiple data
  points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.
tigrannajaryan pushed a commit to tigrannajaryan/opentelemetry-proto that referenced this issue Nov 27, 2019
Resolves open-telemetry#54

After a discussion we came to conclusion that there is no good use case
for having multiple data points. All possible ways to combine data points
that we imagined are either very expensive or make no semantic sense.

This commit eliminates timeseries that can have multiple data points and replaces
them with single data points.

If in the future we discover a use case for multiple data points we have 2
possible paths to extend the proto:

- Add a corresponding points repeated field in the Int64DataPoint.
- Introduce a different message type that is capable of storing multiple data
  points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.
bogdandrutu pushed a commit that referenced this issue Dec 10, 2019
Resolves #54

After a discussion we came to conclusion that there is no good use case
for having multiple data points. All possible ways to combine data points
that we imagined are either very expensive or make no semantic sense.

This commit eliminates timeseries that can have multiple data points and replaces
them with single data points.

If in the future we discover a use case for multiple data points we have 2
possible paths to extend the proto:

- Add a corresponding points repeated field in the Int64DataPoint.
- Introduce a different message type that is capable of storing multiple data
  points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants