For a primer on various issues that may appear with a Graphite based system, please see this article: 25 Graphite, Grafana and statsd gotchas
You should monitor the dependencies according to their best practices. In particular, pay attention to delays in your kafka queue, if you use it. Especially for metric persistence messages which flow from primary to secondary nodes: if those have issues, chunks may be saved multiple times when new primaries come online (or get promoted). (see clustering transport)
Metrictank reports metrics about itself. See the list of documented metrics
You can import the Metrictank dashboard from Grafana.net into your Grafana. this will give instant insights in all the performance metrics of Metrictank.
- process is running and listening on its http port (and carbon port, if you enabled it) (use your monitoring agent of choice for this)
metrictank.stats.$environment.$instance.cluster.primary.gauge1
: assure you have exactly 1 primary node (saving to cassandra) or as many as you have shardgroups, for sharded setups.metrictank.stats.$environment.$instance.input.kafka-mdm.partition.*.lag.gauge64
: kafka lag, depending on your throughput you can always expect some lag, but it should be in the thousands not millions.metrictank.stats.$environment.$instance.store.cassandra.write_queue.*.items.{min,max}.gauge32
: make sure the write queues are able to drain. For primary nodes that are also used for qureies, assert the write queues don't reach capacity, otherwise ingest will block and data will lag behind in queries.metrictank.stats.$environment.$instance.input.*.metricpoint.discarded.unknown.counter32
: counter of MetricPoint messages for an unknown metric, will be dropped.metrictank.stats.$environment.$instance.input.*.metrics_decode_err.counter32
: counter of incoming data that could not be decoded.metrictank.stats.$environment.$instance.input.*.*.discarded.invalid*.counter32
: counter of incoming data that could not be decoded.metrictank.stats.$environment.$instance.tank.discarded.*.counter32
: counter of points that can't be added to the tank, seggregated by reason.metrictank.stats.$environment.$instance.api.request.render.latency.*.gauge32
: shows how fast/slow metrictank responds to http queriesmetrictank.stats.$environment.$instance.api.request.render*.status.*.counter32
: counters per status code. make sure most, or all result in http-200's.metrictank.stats.$environment.$instance.store.cassandra.error.*
: shows erroring queries. Queries that result in errors (or timeouts) will result in missing data in your charts.perSecond(metrictank.stats.$environment.$instance.tank.discarded.received-too-late.counter32)
: Points dropped due to chunks being closed. Need to tune the chunk-max-stale setting or fix your data stream to not send old points so late.metrictank.stats.$environment.$instance.recovered_errors.*.*.*
: any internal errors that were recovered from automatically (should be 0. If not, please create an issue)
If you expect consistent or predictable load, you may also want to monitor:
metrictank.stats.$environment.$instance.store.cassandra.chunk_operations.save_ok.counter32
: number of saved chunks (based on your chunkspan settings)metrictank.stats.$environment.$instance.api.request.render.values.rate32
: rate per second of render requestsmetrictank.stats.$environment.$instance.input.*.*.received.counter32
: input counter (derive with perSecond)metrictank.stats.$environment.$instance.cache.accounting.queue.size.used.max.gauge32
: accounting queue size, if this queue fills up, it will slow down requests (compare to size.max)
Metrictank crashed. What to do?
- Check
dmesg
to see if it was killed by the kernel, maybe it was consuming too much RAM If it was, check the grafana dashboard which may explain why. (sudden increase in ingested data? increase in requests or the amount of data requested? slow requests?) Tips:- The profiletrigger functionality can automatically trigger
a memory profile and save it to disk. This can be very helpful if suddently memory usage spikes up and then metrictank gets killed in seconds or minutes.
It helps diagnose problems in the codebase that may lead to memory savings. The profiletrigger can look at both RSS used and heap size. - Use rollups to be able to answer queries for long timeframes with less data
- The profiletrigger functionality can automatically trigger
a memory profile and save it to disk. This can be very helpful if suddently memory usage spikes up and then metrictank gets killed in seconds or minutes.
- Check the metrictank log.
If it exited due to a panic, you should probably open a ticket with the output of
metrictank --version
, the panic, and perhaps preceding log data. If it exited due to an error, it could be a problem in your infrastructure or a problem in the metrictank code (in the latter case, please open a ticket as described above)
- If the crashed instance was a secondary, you can just restart it and after it warmed up (or backfilled data from Kafka), it will ready to serve requests. Verify that you have other instances who can serve requests, otherwise you may want to start it with a much shorter warm up time. It will be ready to serve requests sooner, but may have to reach out to Cassandra more to load data.
- If the crashed instance was a primary, you have to bring up a new primary. Based on when the primary was able to last save chunks, and how much data you keep in RAM (using chunkspan * numchunks, you can calculate how quickly you need to promote an already running secondary to primary to avaid dataloss. If you don't have a secondary up long enough, pick whichever was up the longest.
If you use the kafka-mdm input (at grafana we do), before restarting check your offset option. Most of our customers who run a single instance seem to prefer the last
option: preferring immediately getting realtime insights back, at the cost of missing older data.
if the metrictank process seems "stuck".. not doing anything, but up and running, you can report a bug. Please include the following information:
- stacktrace obtained with
curl 'http://<ip>:<port>/debug/pprof/goroutine?debug=2'
- cpu profile obtained with
curl 'http://<ip>:<port>/debug/pprof/profile'
- output of
metrictank --version
.
You can also send SIGQUIT
via Control-Backslash to the process, in which case it will print the stack dump and then exit.
See behavior of signals in Go programs for more information.
If you're feeling lucky or are a developer, you can also get a trace like so:
curl 'http://<ip>:<port>/debug/pprof/trace?seconds=20' --output trace.data
- stop the primary:
curl -X POST -d primary=false http://<node>:6060/node
- make sure it's finished saving metricpersist messages (see its dashboard)
- promote the candidate to primary:
curl -X POST -d primary=true http://<node>:6060/node
- you can verify the cluster status through
curl http://<node>:6060/node
or on the Grafana dashboard (see above)
For more information see Clustering: Promoting a secondary to primary
See HTTP api docs
If metrictank ingestion speed is lower than expected, or decreased for seemingly no reason, it may be due to:
-
Indexing of metadata puts backpressure on the ingest stream.
New metrics (including metrics with new settings such as interval, unit, or tags) need to get indexed into:- the in-memory index (which generally should not exert backpressure)
- Cassandra (index) - if enabled - which may not keep up with throughput, resulting in backpressure, and a lowered ingestion rate. Check the index stats on the dashboard.
-
Saving of chunks. Metrictank saves chunks at the rhythm of your chunkspan (10 minutes in the default docker image) When this happens, it will need to save a bunch of chunks and based on the configuration of your write queues and how many series you have the queues may run full and provide ingestion backpressure, also lowering ingestion speed.
Store (cassandra) saving backpressure is also visualized on the 'metrics in' graph of the dashboard. Additionally, the 'write workers & queues' graph shows the queue limit and how many items are in the queues.
The queues are drained by saving chunks, but populated by new chunks that need to be saved. Backpressure is active is when the queues are full (when number of items equals the limit). It's possible for the queues to stay at the limit for a while, despite chunks being saved (when there's new chunks that also need to be saved). However you should probably tune your queue sizes in this case. See our Cassandra page Of course, if metrictank is running near peak capacity, The added workload of saving data may also lower ingest speed. -
golang GC runs may cause ingest drops. Look at 'golang GC' in the Grafana dashboard and see if you can get the dashboard zoom right to look at individual GC runs, and see if they correspond to the ingest drops. (shared cursor is really handy here)
-
doing http requests to metrictank can lower its ingestion performance. (note that the dashboard in the docker stack loads from metrictank as well). normally we're talking about hundreds of requests (or very large ones) where you can start to see this effect, but the effect also becomes more apparent with large ingest rates where metrictank gets closer to saturation
If metrictank consumes too much memory and you want to know why. There are a few usual suspects:
- the ringbuffers, especially if you have high values for numchunks, chunkspan, especially for rollup archives that use many aggregation functions.
- chunk-cache (see your max-size setting)
- metadata index, especially if you have many series.
To get insights into memory usage:
- use the grafana dashboard. If memory grows significantly at a given point, figure out what happened at that point (were new metrics ingested into the system?)
- use the profiletrigger: this automatically collects profiles when memory usage reaches a certain point (see
proftrigger-*
settings) - if you want to understand memory usage "right now", you can take a live profile. This is fairly easy and only requires you have the
go
tool installed. It has a very low, usually insignificant resource impact. Unless you have set up a low, non-default value formem-profile-rate
in your config.
Taking a profile:
wget http://localhost:6060/debug/pprof/heap
go tool pprof <path-to-binary> heap
It is very important that you use the correct metrictank binary. It should not be a different version.
The go tool pprof
command will give you a command prompt.
- type
top30
to get a good idea of where memory is spent. The first columns (flat
andflat%
) show memory used by just that code. The last colum (cum
andcum%
show memory used by that code and everything called from it. (which is why certain things like the input plugins use a lot of memory cumulatively, because all data being added to the memory cache originates from that code path) - type
web
to see a visual representation in the browser, which can show how code calls each other - type
list <regex matching type and function name>
to see a breakdown of lines of code of a function, along with the memory allocated by each line. (e.g.list chunk.New
) Type exit (or ctrl-D) to exit the prompt. For more information on profiling see the excellent Profiling Go Programs article.
- make sure you specify a correct interval. sending minutely data with interval specified as 10s will result in 5 nulls for each point, which in combination with certain grafana display settings ("like nulls connected") may not show anything. Also note thatcurrently all series are normalized to the same resolution so even when you fix intervals, if the query matches any older / lower-resolution data, the entire output will have a lower resolution (the lowest common denominator).
- make sure consumption from input works fine and is not lagging (see dashboard)
- check if any points are being rejected, using the ingest chart on the dashboard (e.g. out of order, invalid)
- can use debug logging to trace data throughout the pipeline. mt-store-cat to see what's in cassandra, mt-kafka-mdm-sniff, etc.
- if it's old data, make sure you have a primary that can save data to cassandra, that the write queue can drain
- if you have a sharded cluster, make sure each node can see all other nodes. Check the cluster metrics in the dashboard or check the /cluster endpoint of each node.
- check
metric-max-stale
andchunk-max-stale
settings, make sure chunks are not being prematurely sealed (happens in some rare cases if you send data very infrequently. seetank.discarded.received-too-late
metric) - did you restart instances? if so: make sure your instances start replaying data within the allotted "overhead window". E.g. if your kafka retention is 7 hours and your largest chunks are 6 hours, then instances need to start replaying data within an hour after startup. (so make sure processing of metricpersist messages, index loading, etc doesn't take too long). Any subsequent restart (e.g. due to kafka removing a segment currently being consumed) starts the process from zero again, so watch out. Increase kafka retention as needed.
In the below example, we:
- use a max chunk span of 6 hours
- use 30 minutes as time needed to get all chunks saved to the store
- assume startup until data backfill takes about 1 hour (see startup procedure)
- assume 7.5 hours Kafka retention
As you can see, you need at least 7.5 hours of Kafka retention to support this case. (note: for simplicity, we don't include time needed to process initial segment which may be dropped by Kafka)
| 05:00 | 06:00 | 07:00 | 08:00 | 09:00 | 10:00 | 11:00 | 12:00 | 13:00 |
|-----------------------------------------------------------------------|
Times of chunk saving xxxx xxxx
MT crash (worst case: right before chunk saving completes) ^
MT start up (loading index etc until start of replaying data) xxxxxxx^
Time range that has been persisted to the store ...------|
Time range that MT can replay from Kafka |-----------------------------------------------------------|
Metrictank supports opentracing via Jaeger It can give good insights into why certain requests are slow, and is easy to run. To use, enable in the config and point it at a Jaeger collector.