Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to send loggregator envelopes type LOG and EVENT to OtelCollector as logs #595

Closed
wants to merge 10 commits into from

Conversation

jriguera
Copy link
Contributor

Description

This PR adds support to forward envelopes type LOG (and optionally EVENT) as "go.opentelemetry.io/proto/otlp/collector/logs/v1" to OtelCollector.

Summary of changes:

  • Envelope Log payload is defined as a string type in OTLP Message. Logs type Log_OUT will have severity INFO and logs type Log_ERR will become severity ERROR. Envelope tags will be OTLP LogRecord attributes.
  • Events can be forwarded as logs if a boolean flag defined via a new env var EMIT_EVENTS_AS_OTEL_LOGS is set to true (default is false).
  • Event data is transformed in OTLP Message to a Key-Value list with two items: title and event with the contents of these fields from the Envelope Event.
  • Bosh release has been updated to include the new environment variable for loggr-forwarder-agent and loggr-forwarder-agent-windows jobs.
  • Test for new functionality have been added .

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Testing performed?

  • Unit tests
  • Integration tests
  • Acceptance tests

Checklist:

  • This PR is being made against the main branch, or relevant version branch
  • I have made corresponding changes to the documentation
  • I have added testing for my changes

If you have any questions, or want to get attention for a PR or issue please reach out on the #logging-and-metrics channel in the cloudfoundry slack

@jriguera jriguera requested a review from a team as a code owner July 15, 2024 11:57
Copy link

linux-foundation-easycla bot commented Jul 15, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I like this change, but I have some some small things which I think have to be changed.

src/cmd/forwarder-agent/app/forwarder_agent_test.go Outdated Show resolved Hide resolved
src/cmd/forwarder-agent/app/forwarder_agent_test.go Outdated Show resolved Hide resolved
src/cmd/forwarder-agent/app/forwarder_agent_test.go Outdated Show resolved Hide resolved
src/pkg/otelcolclient/otelcolclient.go Show resolved Hide resolved
@jriguera
Copy link
Contributor Author

I will fix the rest of the issues tomorrow. Thanks for reviewing the PR!

@jriguera jriguera requested a review from chombium July 19, 2024 10:52
@chombium
Copy link
Contributor

Hi @jriguera. Thanks for your contribution and the quickly made changes. I've approved the checks and saw that the linting fails. Could you please take a look at that?

I would also like to see what the others think. @ctlong, @acrmp, @mkocher Could you please take a look a this PR?

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jriguera Please take a look at the PR checks and fix the linting error

@jriguera jriguera requested a review from chombium July 23, 2024 22:45
@mkocher mkocher dismissed chombium’s stale review August 9, 2024 20:25

this has been addressed, linter is green

Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jriguera can you please rebase this PR, it has merge conflicts at the moment

@mkocher
Copy link
Member

mkocher commented Aug 9, 2024

@ctlong and I tried deploying this today and came across a concern with this approach that we're not sure how to deal with - when the forwarder agent is attempting to send logs to an otel collector that does not have a logs pipeline configured it will generate an error log line for every log message that it tries to forward.

We thought about a few ways to handle this but don't have a clear preferred path forward:

  1. ask the grpc server what services it supports: otel-collector doesn't support reflection, so this doesn't seem possible
  2. add support in the ingress_port.yml file to specify which signals should be forwarded: this is painful as the otel-collector-release would have to parse the config to see what signals have configured pipelines
  3. add dummy pipelines in the otel-collector with a noop exporter: Ugly, and adds an extra load of sending everything to otel-collector even if nothing is interested in it

@ctlong says maybe we should do the loggregator receiver in otel-collector and avoid this problem entirely 😄

@chombium
Copy link
Contributor

chombium commented Aug 10, 2024

@ctlong says maybe we should do the loggregator receiver in otel-collector and avoid this problem entirely 😄

It's a sarcasm, right? :)

The Loggtegator receiver in its current state, is actually a Firehose nozzle which means:

  • we still have the noisy neighbor problem
  • possible log loss due to underscaled Loggregator
  • Cannot be used in Foundations where Loggregator is deployed with the Shared Nothing Architecture

IMO, The Loggregator Otel Receiver should not even exist. if we were more active and started earlier in the Otel community, we could have explained why this implementation is bad and stopped its further development.

@mkocher
Copy link
Member

mkocher commented Aug 10, 2024

Agreed that the current loggregator receiver is a flawed approach. What @ctlong is talking about is a loggregator receiver which can be configured to receive loggregator envelopes locally. Initially from the forwarder agent, and eventually directly from system components taking the place of the forwarder agent as otel-collector replaces the agent suite.

@jriguera
Copy link
Contributor Author

My 2 cents:

  1. To make everybody aware, there is already a component in Otel Collector contrib which implements the firehose v2 consumer protocol. We have added support for logs (and traces is in progress). Apart of that, all OTEL components have a well defined lifecycle (see stability), so they can be deprecated and removed as soon as there are alternatives or other implementations. There is no problem at all to have temporary components see for example logs-transform-processor. I understand the problems of this architecture (we are suffering them and I would like to have all telemetry in Otel as soon as it is generated by each component), but the reality is it's being used by other (firehose) implementations, and it just works for small foundations, and most important: it is documented. So, the first step is offering to the operators a suitable alternative.

  2. Anyway, I think there is no point on finding out what types OTLP server supports, it is a push model and (apart of the technical limitations) I am not aware of other implementations in this direction. The clients should emit the different telemetry types depending on its configuration (defined by the operators) and the server has to accept the OTL protocol, as simple as that. In my opinion, is perfectly valid to have default pipelines with NOP exporter for all types of telemetry or just ignore the errors in the logs. If this is still a concern, the best approach will be define flags to enable/disable pushing the different telemetry types and maybe using bosh links for automatic configuration (just a quick idea!).

  3. I have already commented in other threads that we do not use the official OTEL Collector boshrelease. It is quite limited in terms of configuration, it is not suitable for us; for example, we are collecting system metrics (similar to node_exporter) and other system logs from the OS, transforming them, filtering and sending to different endpoints. In my opinion, as the OTEL collector boshrelease is an add-on, it is responsibility of the operators of the platform having a proper configuration according to their needs, which include the definition of the pipeline for logs, metrics and traces (that's is also in line with the philosophy of the OTEL project, being vendor neutral). The boshrelease should enable operators to define as many pipelines as they need. I think is fine having the default ones automatic, but with an option to disable this automatic generation and enabling named pipelines (eg logs/system, metrics/system).

  4. I have noticed that CF community (platform level) is not participating in OpenTelemetry. Comparing it with other platforms, there are lack of schemas, definitions, attributes .... I think a lot of people is not aware that Otel is a reality, getting more and more traction, the flexibility it gives and the amount of vendors collaborating there (ofc, on their own interests).

@mkocher
Copy link
Member

mkocher commented Aug 13, 2024

  1. Yep, I understand why the cf receiver exists and the use case it fulfills. We're hoping it becomes unnecessary eventually, but that's a goal we're working towards. The confusion is that there's a naming collision when I said 'loggregator-receiver' I meant a component which listened locally for envelopes, not one which subscribed to the firehose.
  2. Agreed that feature flagging this is the easiest path forward. We've started work on that on a branch over here - https://github.com/cloudfoundry/loggregator-agent-release/tree/add-otel-logs-flag (along with a few minor changes)
  3. That's awesome how much you're doing with the otel-collector already. We've been gradually adding to what's possible with the cf otel collector bosh release, but we've certainly been taking it slower as we navigate what we want to support in Cloud Foundry. We recently exposed a lot more of the config, but we are hesitating to add more non-GA otel-collector components. Is your bosh release open source? I'd love to take a look, and I'd love to hear anything you've learned from running it in prod.
  4. I know @chombium has been thinking about this some. We're seeing more and more interest in the Otel features of CF, hopefully that allows us to spend more time on it.

@mkocher
Copy link
Member

mkocher commented Aug 13, 2024

One thought we had - if events are getting mapped to otel logs, should we just control events and logs with the one emit_otel_logs property? We've gone back and forth on if that would be better. It would map closer to the various otel signals.

I haven't gone looking, but I seem to recall that there are very few actual usages of loggregator events.

@chombium
Copy link
Contributor

Hi @jriguera,

Don't get me wrong and I'm sorry if I offended you somehow.

I'm really glad to see that there is some interest for Otel in CF and I really appreciate your initiative and the things that you've done so far. You've started with your own view of CF, found what's available at the moment and started using it, which is totally fine. I really like that there is interest in the CF community about Otel and as @mkocher said, we could use that to raise its importance and give higher priority to it, so that we can intensify the work on it.

At the moment, we've noticed two major challenges with the Otel Collector in general:

  • It seems that no one is running it somewhere outside of Kubernetes, and no one up until now has figure out how to manage it environments outside Kubernetes. We've explained where and how do we want to run it and all the suggestions were more or less find a workaround or use half baked PoC things like Supervisor for example. As we have to manage a fleet Otel Collectors the things are not that trivial and we have to figure out our selves how to do it in CF
  • We still haven't found a replacement for the Syslog Drains as the Otel Collector doesn't support partial restarts, where we can reconfigure only a single pipeline. Everyone that we've talked to up until now, told us that that's our problem, and we have to figure it out ourselves and throw a bunch of workarounds at as :-/ The best solution which we can think of at the moment to support similar functionality as the Syslog Drains is to build a exporter creator, similar to the receiver creator and reconfigure a single pipeline to use a new/exchange and existing exporter.

We are aware that the Otel Collector Release is not where it should be for prime time usage, but because of the above mentioned challenges, we have decided to start small and slowly expose more and more Otel Collector features in it. We have started with platform metrics metrics only to test things and because the CF operators are the one who are in charge of them. We already have few more PRs which open up more and more feature from the Otel Collector's configuration like Expose wider collector configuration and Config to allow subset of processors and exporters. We play a bit defensive, so that we don't break things.

In regards to Otel Semantic conventions I want to share two things:

  • We already have an opened PR to add semantic conventions for CF and maybe later we'll do something for BOSH as well. The PR is opened for a long time as we had to decide how we map the fields from the Loggregator API. Now it's waiting for a review

  • CF has a lot of platform components written in various programming languages and we cannot expect that we can do the switch for all at once. That's why we have to do that gradually. On the other side, there are many people who have build monitoring around the current metric names. Hence, we need some migration strategy.
    There are two strategies that I can think of at the monent:

    • Let OTEL and Loggregator Agent (Firehose) and Syslog Agent (Syslog Drains + Log Cache) run in parallel to the Otel Collector for a while
    • Consume the telemetry data in it's current format in a Otel Receiver and then have a
      • "Legacy Pipeline" with metric names unchanged and Syslog Exporter + "Loggregator Agent Exporter"
      • "Otel Pipeline" which will do the conversion of the names to semantic conventions on the fly with a Processor or something similar and use OTLP to export data

    As we have a system with "legacy" metric names which we want to migrate to Otel convention conform names, I've asked what to do and how to do the migration and they've suggested us to check the Weaver Project which helps defining application based schemas. I've been following the project for a while, but I haven't got deep enough to understand how it will help us... :-/

    I'm really glad that that someone else thinks about Otel in CF and it will be great if we team up and work together on the topic.

@jriguera
Copy link
Contributor Author

Hi @chombium ,

No ofended at all!! We are just discussing different ideas and implementations!. To be clear, even if this PR1 or the ones we (in our team) are doing in Otel (regarding cloudfoundryreceiver or the new component cfgardenobserver) are not going to production, or they are deprecated in N months, I will be happy If eventually they helped triggering discussions like this and drive to better implementations. I think is good if you see all my work like PoC or experimental (and to be honest, there is one part of trying to re-implement some things in different ways, which is similar to the famous meme on the internet which says something like: "hey!, are you coming; no, wait, somebody is wrong on the internet" ;-) ).

Regarding the other topics:

  1. We are using Otel collector in our deployments with this release: https://github.com/springernature/otelcollector-binary-boshrelease . This is not a proper way to build a release (it uses binary blobs) but we need it in order to overcome the limitations with the official release and also be able to use our own binary collectors with specific components. Regarding supervisor, I do not think is needed (at least with my current knowledge), I prefer to avoid clustering as much as possible and have a bunch of collectors. I will try to share the feedback of our implementations during the next months.

  2. Regarding syslog drains: we are working on the new component cfgardenobserver (we are in the process of submitting the functionality in the current PR). We are testing this configuration:

    extensions:
      cfgarden_observer:
        cache_sync_interval: 10s
        cloud_foundry:
          endpoint: https://api.cf
          auth_type: user_pass
          username: user
          password: pass
    
    receivers:
      receiver_creator:
        watch_observers: [cfgarden_observer]
        receivers:
          prometheus_simple:
            rule:  labels["telemetry.springernature.com/o11y-scrape"] == "true" && (labels["telemetry.springernature.com/o11y-port"] ?? string(port)) == string(port)
            config:
              metrics_path: '`"telemetry.springernature.com/o11y-path" in labels ? labels["telemetry.springernature.com/o11y-path"] : "/metrics"`'
    
    exporters:
      debug:
        verbosity: detailed
    
    service:
      pipelines:
        metrics:
          receivers: [receiver_creator]
          exporters: [debug]
      extensions: [cfgarden_observer]
      telemetry:
        metrics:
          level: none
    

    With the previous configuration, as soon as a user adds a label or annotation telemetry.springernature.com/o11y-scrape=true to an app, it is automatically scraped.

  3. Probably we will end up implementing another component similar to k8sattributesprocessor in order to be able to enrich metrics with some API metadata from the app_id (like labels and annotations). Our goal is also be completely multi-tenant: telemetry (logs, metrics) from a group of apps (or just a space, or org) will belong to the same tenant.

  4. I was not aware of all those links you shared to improve Otel in CF, good initiatives!! Yesterday, a colleague told me about new PRs in the official Otel Collector boshrelease to make the configuration more flexible for operators. Also, regarding telemetry attributes, in the future, I would like to propose a PR to move up some attributes at metric datapoints and log record levels to the resource level (I am talking about attributes regarding bosh job names, source types, .... which currently are copied directly from envelope tags to log record attributes and they belong to resource attributes).

1 Well, in the case of this PR, I really think the functionality is needed anyway. Ofc, the implementation can be different.
Btw, during August, I have a limited availability and it may take some days to me to reply.

@jriguera
Copy link
Contributor Author

Regarding the cfgarderobserver configuration I have shared, I forgot to mention one detail. Currently, it uses the unsecure port 8080, but nothing stops the operator to setup mTLS in prometheus_simple to use the envoy secured ports (we are going to test it today).

@mkocher
Copy link
Member

mkocher commented Aug 15, 2024

I included these changes in #603 and just merged them in. Thank you @jriguera for the contribution! Closing this PR, but lets continue the conversation above here or on Slack.

@mkocher mkocher closed this Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants