Eliminate multi data point TimeSeries #54

tigrannajaryan · 2019-11-14T14:45:04Z

Metric proto currently allows to represent multiple data points for each metric, i.e. a timeseries with several values of the same metric, one value per timestamp.

However, reading the metric SDK proposal tells me that normally each collection will emit only a single data point per metric, the last aggregate state of the metric. There may be custom exporters which accumulate data points and emit them all in one batch but it seems to be a possible exceptional case and not a norm. Typically an exporter will simply send the last aggregate value.

Another use case where we could possibly have multiple data points per metric is in Collector, where we could in theory batch points from the same metric into one timeseries. However, in practice I do not think we will do it since it is a very costly operation that requires maintaining maps of metrics.

Given that a single data point per metric is the most common case I suggest that we optimize the proto for that case. Currently the proto definition is the following (only int64 is shown for simplicity):

message Metric {
  MetricDescriptor metric_descriptor = 1;
  Resource resource = 2;
  repeated Int64TimeSeries int64_timeseries = 3;
}
message Int64TimeSeries {
  repeated string label_values = 1;
  repeated Int64Value points = 2;
}
message Int64Value {
  fixed64 start_time_unixnano = 1;
  fixed64 timestamp_unixnano = 2;
  sfixed64 value = 3;
}

I propose to change this to the following:

message Metric {
  MetricDescriptor metric_descriptor = 1;
  Resource resource = 2;
  repeated Int64Point int64_points = 3;
}
message Int64Point {
  repeated string label_values = 1;
  fixed64 start_time_unixnano = 2;
  fixed64 timestamp_unixnano = 3;
  sfixed64 value = 4;
}

Preliminary benchmarking shows that for the case when we have 1 int64 data point per metric this change reduces CPU and RAM usage by 25% for encoding and decoding.

Note: this schema still allows representing multiple data points per metric if neccessary, however at the cost of repeating label values.

tigrannajaryan · 2019-11-14T14:45:22Z

@jmacd @bogdandrutu I need your opinion please.

bogdandrutu · 2019-11-14T15:37:06Z

Another use case where we could possibly have multiple data points per metric is in Collector, where we could in theory batch points from the same metric into one timeseries. However, in practice I do not think we will do it since it is a very costly operation that requires maintaining maps of metrics.

We are working on a processor that does this, in order to reduce cardinality for the labels (some users need this).

I would suggest that we just simply drop the repeated for the Points initially and keep only one Point. This way we still have the chance in the future to add the capability to support multiple points in a backwards compatible way.

tigrannajaryan · 2019-11-14T15:54:18Z

OK. So you are suggesting to do this:

message Int64TimeSeries {
  repeated string label_values = 1;
  Int64Value point = 2;
}

bogdandrutu · 2019-11-14T15:55:37Z

I think that will keep the possibility in the future to support multiple points if needed. Still removing one allocation.

tigrannajaryan · 2019-11-14T16:10:55Z

It is a smaller CPU saving (2% encoding, 10% decoding) but we can do it.

jmacd · 2019-11-15T18:37:13Z

I might suggest that we take both options. Use the format Tigran suggests for clients to report a single data point with a relatively short period, where the timestamp is implied by the enclosing message. This will be fast and cheap for the clients to use.

Then, support a second format for multi-valued timeseries with timestamped values. This will be useful in situations where longer periods are being aggregated. This will enable flexibility. I'm not sure how this second format will be applied.

tigrannajaryan · 2019-11-15T18:54:00Z

@jmacd I understand what you are saying but am a bit reluctant to add a second parallel representation of the data. That complicates the schema and implementations (especially receivers who need to be able to process both formats).

jmacd · 2019-11-16T21:43:29Z

@akehlenbeck I'd like your input.

akehlenbeck · 2019-11-18T14:42:30Z

I've never seen much demand at all for representing multiple points from the same time series in one write request. The main use is I'm aware of actually has nothing to do with realtime collection, but rather with backfilling. Eventually, someone will want to rename or otherwise transform existing historical data. The natural way to do that is with some kind of extract-transform-load job, and in the "load" step you end up wanting to write a lot of points from a single time series in one big batch.

But: I would also imagine that kind of operation might be performed directly against a SaaS-specific API, rather than going through the same collectors that are used for realtime collection. So it sort of seems like it's out of scope for this discussion.

My inclination would be to go with the format suggested by Tigran above. I don't see compelling benefit from the intermediate approach of keeping the Int64Value indirection. Tigran's proposed format does formally support multiple points per time series, just at some extra cost (of repeating the label values), but I think that's exactly the right tradeoff: normal clients should get the efficient and simple behavior, and clients doing the weird thing should pay the cost.

If for some reason in the future someone comes along with a compelling reason for needing multiple points per time series and for very high performance, then revisit this question then with a separate API. (But I would be surprised if such a use case ever arose.) I suspect in that situation you'll want an entirely different representation anyways, for example

message Int64TimeSeries {
  repeated string label_values;
  // Three parallel arrays:
  repeated fixed64 start_time_unixnanos;
  repeated fixed64 timestamp_unixnanos;
  repeated sfixed64 values;
}

So basically: do what Josh said about having a second format, but don't bothering doing it until it's actually needed (if ever).

jmacd · 2019-11-18T18:19:13Z

Great. I support Tigran's proposal.

tigrannajaryan · 2019-11-25T16:33:00Z

I put a bit more thought into this and given that Bogdan thinks we may need to support multiple data points (and batching) in the future I do not want to pursue this anymore. It will be difficult (or impossible) to add support for multiple data points in a way that does not break existing code that deals with data points in the Collector. Yes, on the wire level repeated fields are compatible with single fields but the generated code is not compatible and will be a breaking change for Collector.

akehlenbeck · 2019-11-25T16:41:45Z

What's the motivation for having multiple points from the same time series in one batch? I didn't really understand Bogdan's comment from higher up:

We are working on a processor that does this, in order to reduce cardinality for the labels (some users need this).

How does cardinality reduction translate into multiple points?

tigrannajaryan · 2019-11-25T20:41:33Z

How does cardinality reduction translate into multiple points?

I don't know what exactly Bogdan's use case is but purely technically it is possible to eliminate one (or more) of the labels that you don't care about and combine all data points for all values of that label into a single array (a Timeseries message). Obviously this loses data, not sure if this is really something desirable ever.

jmacd · 2019-11-25T21:26:12Z

I believe the use-case being imagined is one where we combine say a number of gauge values taken across hosts. In the aggregation, the host label is removed leaving us with many gauge values, still with independent timestamps, that should be preserved as multiple points.

I still endorse a plan of supporting both options. There will be a relatively simple adapter in code to visit the single-value timeseries as a length-1 multi-value timeseries.

tigrannajaryan · 2019-11-26T22:37:40Z

Have just chatted with @bogdandrutu and we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense.

I am going to submit a PR with a single Int64Value point. The intent is to keep Int64Value (and other value types) as a separate message for usability.

If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto:

Add a corresponding points repeated field in the Int64TimeSeries.
Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum.

There are probably other ways as well.

Resolves open-telemetry#54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.

Resolves #54 After a discussion we came to conclusion that there is no good use case for having multiple data points. All possible ways to combine data points that we imagined are either very expensive or make no semantic sense. This commit eliminates timeseries that can have multiple data points and replaces them with single data points. If in the future we discover a use case for multiple data points we have 2 possible paths to extend the proto: - Add a corresponding points repeated field in the Int64DataPoint. - Introduce a different message type that is capable of storing multiple data points and assign it a different type in the MetricDescriptor.Type enum. There are probably other ways as well.

tigrannajaryan self-assigned this Nov 14, 2019

tigrannajaryan mentioned this issue Nov 27, 2019

Eliminate multi-data point timeseries #77

Merged

bogdandrutu closed this as completed in #77 Dec 10, 2019

jmacd mentioned this issue Dec 12, 2019

Implement Histogram aggregator open-telemetry/opentelemetry-go#317

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eliminate multi data point TimeSeries #54

Eliminate multi data point TimeSeries #54

tigrannajaryan commented Nov 14, 2019 •

edited

Loading

tigrannajaryan commented Nov 14, 2019

bogdandrutu commented Nov 14, 2019

tigrannajaryan commented Nov 14, 2019

bogdandrutu commented Nov 14, 2019

tigrannajaryan commented Nov 14, 2019

jmacd commented Nov 15, 2019

tigrannajaryan commented Nov 15, 2019

jmacd commented Nov 16, 2019

akehlenbeck commented Nov 18, 2019

jmacd commented Nov 18, 2019

tigrannajaryan commented Nov 25, 2019

akehlenbeck commented Nov 25, 2019

tigrannajaryan commented Nov 25, 2019

jmacd commented Nov 25, 2019

tigrannajaryan commented Nov 26, 2019

Eliminate multi data point TimeSeries #54

Eliminate multi data point TimeSeries #54

Comments

tigrannajaryan commented Nov 14, 2019 • edited Loading

tigrannajaryan commented Nov 14, 2019

bogdandrutu commented Nov 14, 2019

tigrannajaryan commented Nov 14, 2019

bogdandrutu commented Nov 14, 2019

tigrannajaryan commented Nov 14, 2019

jmacd commented Nov 15, 2019

tigrannajaryan commented Nov 15, 2019

jmacd commented Nov 16, 2019

akehlenbeck commented Nov 18, 2019

jmacd commented Nov 18, 2019

tigrannajaryan commented Nov 25, 2019

akehlenbeck commented Nov 25, 2019

tigrannajaryan commented Nov 25, 2019

jmacd commented Nov 25, 2019

tigrannajaryan commented Nov 26, 2019

tigrannajaryan commented Nov 14, 2019 •

edited

Loading