Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long delay between EA received input and loaded #14337

Open
carsonip opened this issue Oct 10, 2024 · 5 comments
Open

Long delay between EA received input and loaded #14337

carsonip opened this issue Oct 10, 2024 · 5 comments
Labels

Comments

@carsonip
Copy link
Member

carsonip commented Oct 10, 2024

APM Server version (apm-server version): 8.14.3

Description of the problem including expected versus actual behavior:

In EA managed apm-server, there are observations that there can be delay between log "received input from elastic-agent" and "loaded input config" in the order of days. It implies 2 servers are actually running in the apm-server process during this long period when the old server is stopping. The fact that 2 TBS gc goroutines may cause #14305 , actually makes #14305 also an evidence that 2 servers are running at the same time. Additionally, as only 1 reload can happen at a time, a long reload will stall other input updates.

Edit: as explained below, there are a few parts to this problem:

  • Check and fix processor code that are not concurrent-safe (e.g. monitoring metrics reporting in go-docappender), which would be unsafe when EA hot reloads
  • Check and fix code that manipulate global variables, which would be unsafe when EA hot reloads

Provide logs (if relevant):

"Oct 7, 2024 @ 19:46:30.592"	"loaded input config"
"Oct 7, 2024 @ 19:46:30.577"	"tail sampler aborted"
"Oct 7, 2024 @ 19:46:30.576"	"LSM aggregator stopped"
"Oct 7, 2024 @ 19:46:30.576"	"stopping aggregator"
"Oct 7, 2024 @ 19:46:30.576"	"Server stopped"
"Sep 30, 2024 @ 23:41:27.228"	"Path /intake/v3/rum/events added to request handler"
"Sep 30, 2024 @ 23:41:27.228"	"Path /config/v1/rum/agents added to request handler"
"Sep 30, 2024 @ 23:41:27.228"	"Path /intake/v2/rum/events added to request handler"
"Sep 30, 2024 @ 23:41:27.228"	"RUM endpoints enabled!"
"Sep 30, 2024 @ 23:41:27.228"	"Starting apm-server [ceaf482859c86da0ba6de99005ab2fffae7551c6 built 2024-07-08 17:22:03 +0000 UTC]. Hit CTRL-C to stop it."
"Sep 30, 2024 @ 23:41:27.224"	"stopping apm-server... waiting maximum of 30s for queues to drain"
"Sep 30, 2024 @ 23:41:27.224"	"Listening on: [::]:8200"
"Sep 30, 2024 @ 23:41:27.224"	"Stop listening on: 0.0.0.0:8200"
"Sep 30, 2024 @ 23:41:27.222"	"received input from elastic-agent"
@carsonip carsonip added the bug label Oct 10, 2024
@carsonip carsonip self-assigned this Oct 10, 2024
@carsonip
Copy link
Member Author

carsonip commented Oct 10, 2024

I'm fairly convinced that it is due to slow httpServer and grpcServer shutdown, as I managed to reproduce something similar with an artificial delay:

diff --git a/internal/beatcmd/beat.go b/internal/beatcmd/beat.go
index 8a62e1c89..6d19f67fb 100644
--- a/internal/beatcmd/beat.go
+++ b/internal/beatcmd/beat.go
@@ -365,7 +365,7 @@ func (b *Beat) Run(ctx context.Context) error {
 		return err
 	}
 
-	if b.Manager.Enabled() {
+	if b.Manager.Enabled() || true {
 		reloader, err := NewReloader(b.Info, b.newRunner)
 		if err != nil {
 			return err
@@ -377,6 +377,37 @@ func (b *Beat) Run(ctx context.Context) error {
 			return fmt.Errorf("failed to start manager: %w", err)
 		}
 		defer b.Manager.Stop()
+
+		g.Go(func() error {
+			for {
+				in := config.MustNewConfigFrom(map[string]interface{}{
+					"apm-server": map[string]interface{}{
+						"rum.enabled": true,
+						"host":        "0.0.0.0:8200",
+						"sampling.tail": map[string]interface{}{
+							"enabled": true,
+							"policies": []map[string]interface{}{
+								{"sampling_rate": 0.1},
+							},
+							"storage_gc_interval": "2s",
+						},
+					},
+				})
+				out := config.MustNewConfigFrom(map[string]interface{}{
+					"elasticsearch": map[string]interface{}{
+						"host":     []string{"localhost:9200"},
+						"username": "admin",
+						"password": "changeme",
+					},
+				})
+				if err := reloader.reload(in, out, nil); err != nil {
+					logp.Err("reload error")
+				}
+				time.Sleep(time.Minute)
+			}
+
+			return nil
+		})
 	} else {
 		if !b.Config.Output.IsSet() {
 			return errors.New("no output defined, please define one under the output section")
diff --git a/internal/beater/server.go b/internal/beater/server.go
index 397e21e50..8bfec6c05 100644
--- a/internal/beater/server.go
+++ b/internal/beater/server.go
@@ -21,6 +21,7 @@ import (
 	"context"
 	"net"
 	"net/http"
+	"time"
 
 	"go.elastic.co/apm/module/apmgorilla/v2"
 	"go.elastic.co/apm/v2"
@@ -227,6 +228,7 @@ func (s server) run(ctx context.Context) error {
 		// See https://github.com/elastic/gmux/issues/13
 		s.httpServer.stop()
 		s.grpcServer.GracefulStop()
+		time.Sleep(5 * time.Minute)
 		return nil
 	})
 	if err := g.Wait(); err != http.ErrServerClosed {
diff --git a/x-pack/apm-server/sampling/processor.go b/x-pack/apm-server/sampling/processor.go
index 82dc2df59..48546a542 100644
--- a/x-pack/apm-server/sampling/processor.go
+++ b/x-pack/apm-server/sampling/processor.go
@@ -394,8 +394,10 @@ func (p *Processor) Run() error {
 		for {
 			select {
 			case <-p.stopping:
+				p.logger.Error("gc stopping")
 				return nil
 			case <-ticker.C:
+				p.logger.Error("gc tick")
 				const discardRatio = 0.5
 				var err error
 				for err == nil {

With this change, initially "gc tick" will be logged every 2 seconds, but after the first reload, "gc tick" frequency doubles, indicate that 2 gc routines are running. This will cause #14305 when gc is called when another gc is running.

The reason why this delay causes 2 apm-server processors (e.g. TBS) to run concurrently is that, when reload is triggered from EA "received input", the server's context is canceled. The server is actually a wrapper over the actual gmux server and processors, let's call it wrappedServer. The shutdown sequence will be that gmux server first shuts down, then only after gmux server shuts down, the processors .Stop() will be called. But in the reloader, a new wrappedServer is already running while the old wrappedServer is stopping (yes, it stops listening, but processors are still running). This is a plausible explanation of the observed logs.

TLDR: in a hot reload, there is a period of time where an old server and new server run concurrently. We need to limit this reload time, as well as to ensure the processors are fine with running concurrently (e.g. to have 2 TBS processor running at the same time during hot reload).

@carsonip
Copy link
Member Author

carsonip commented Oct 10, 2024

To fix this bug, it requires a few changes

  1. httpServer and grpcServer need to respect shutdown timeout. It cannot just wait indefinitely for connections to terminate.
  2. But even with that and e.g. a shutdown timeout of 10s, in the current design, it is still possible for 2 wrapped servers to run in parallel during this 10s. This is fine for the most part, e.g. aggregation processor (since badger run in memory), bulk indexer, but will be fatal for TBS processor, and possibly some libbeat metric registry code (e.g. in newFinalBatchProcessor which may or may not be broken now, as well as TBS metrics). TBS indexing might actually be ok, but GC definitely cannot run concurrently.
  3. Alternatively, we can make the new wrappedServer wait for the old wrappedServer to shutdown completely, but that may imply some downtime, especially when there's a shutdown timeout > 0. But this is the safest option.

@carsonip
Copy link
Member Author

#14339 is merged, but keeping this issue open, as we want to double check if all the processors are fine with concurrent runs.

@simitt
Copy link
Contributor

simitt commented Nov 11, 2024

@carsonip please close this task if everything works as expected or update with follow up steps.

@carsonip
Copy link
Member Author

carsonip commented Jan 7, 2025

Sorry for the late reply. I believe we should still quickly audit the code to check for similar issues across the codebase. These will be code that manipulate global variables without synchronization, and code that should not be run concurrently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants