-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathobservability-coverage.html
336 lines (329 loc) · 14.2 KB
/
observability-coverage.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
---
layout: default
title: Observability coverage
---
<style>
.list-unstyled {
padding-left: 10px;
list-style-type: square;
}
</style>
<div class="row">
<h1>Observability Coverage</h1>
<small>v2.2-2022-07-14</small>
</div>
<div>
<div>By following this coverage strategy we can increase the observability of each layer of our systems and make system easier to manager</div>
</div>
<div class="row">
<table class="table table-hover table-bordered table-striped">
<caption>Coverage Strategy</caption>
<thead class="thead-dark">
<tr>
<th> </th>
<th scope="col">Metal</th>
<th scope="col">Server Side Code (APM)</th>
<th scope="col">API</th>
<th scope="col">Website</th>
<th scope="col">Client Side Code (APM)</th>
<th scope="col">Security</th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row">Covers</th>
<td>
Is the cloud infrastructure healthy, performing and efficient
</td>
<td>
Application Performance Monitoring (APM) and Transaction Tracing.
Server-side code instrumentation for performance and errors. Transaction tracing.
</td>
<td>
APIs and Blackbox. Are our underlying APIs up, performing well (globally) and returning the right data?
</td>
<td>
Are web pages up, performing well (globally) and returning the right content?
</td>
<td>
APM for Client Side Code Instrumentation for performance and errors
</td>
<td>
Has our code or infrastructure been compromised or have vulnerabilities?
</td>
</tr>
<tr>
<th scope="row">Examples</th>
<td>
<ul class="list-unstyled">
<li>Databases</li>
<li>Disks</li>
<li>Compute</li>
<li>Lambda Functions</li>
<li>Networks (VPCs)</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>DB Queries</li>
<li>API Queries</li>
<li>3rd party API invocations</li>
<li>Custom instrumentation markers</li>
<li>Errors</li>
<li>Function invocation rate/frequency</li>
<li>Lambda Invocations</li>
<li>Distribution Tracing</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Uptime</li>
<li>(Global) Latency</li>
<li>Contract Testing</li>
<li>SLO Monitoring</li>
<li>3rd Party Contract/SLA Monitoring</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Uptime</li>
<li>Latency</li>
<li>Accuracy</li>
<li>Synthetic user monitoring</li>
<li>Real user monitoring</li>
<li>Transaction monitoring</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Web page errors</li>
<li>Web page speed</li>
<li>Mobile errors</li>
<li>Mobile speed</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Unauthorised Access</li>
<li>Intrusion Detection</li>
<li>Compromised "Supply Chain" (libraries)</li>
<li>DDoS</li>
<li>SIEM</li>
</ul>
</td>
</tr>
<tr>
<th scope="row">
<span data-toggle="tooltip" data-placement="right" title="These are the recommended and supported tools we have (or would like*)">
Example Tools
<div class="text-muted">Green -> yellow -> teal: Current implementation level</div>
</span>
</th>
<td>
<ul class="list-unstyled">
<li data-toggle="tooltip" data-placement="right" title="Basic featureset built into AWS" class="btn btn-success">AWS
Cloudwatch</li>
<li data-toggle="tooltip" data-placement="right" title="Toolset that monitors basic" class="btn btn-success">CloudHealth</li>
<li data-toggle="tooltip" data-placement="right" title="Very mature in this space" class="btn btn-success">NewRelic</li>
<li data-toggle="tooltip" data-placement="right" title="Very mature in this space" class="btn btn-success">DataDog</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li data-toggle="tooltip" data-placement="right" title="APM" class="btn btn-warning">NewRelic</li>
<li data-toggle="tooltip" data-placement="right" title="Kernel monitoring" class="btn btn-info">Sysdig</li>
<li data-toggle="tooltip" data-placement="right" title="APM" class="btn btn-warning">DataDog</li>
<li data-toggle="tooltip" data-placement="right" title="Zipkin and Jaeger to trace transactions" class="btn btn-warning">Opentracing</li>
<li data-toggle="tooltip" data-placement="right" title="Horizontal stack opentracing and custom instrumentation markers" class="btn btn-info">Logz.io</li>
<li data-toggle="tooltip" data-placement="right" title="Horizontal stack opentracing" class="btn btn-info">AWS Xray</li>
<li data-toggle="tooltip" data-placement="right" title="Lambda monitor" class="btn btn-warning">Dashbird</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li data-toggle="tooltip" data-placement="right" title="Our own suite of tools" class="btn btn-warning">BlackBox</li>
<li data-toggle="tooltip" data-placement="right" title="Pingdom for APIs" class="btn btn-warning">Runscope</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li data-toggle="tooltip" data-placement="right" title="Check sites for uptime" class="btn btn-success">Pingdom</li>
<li data-toggle="tooltip" data-placement="right" title="More comprehensive than Pingdom" class="btn btn-info">Catchpoint</li>
<li data-toggle="tooltip" data-placement="right" title="Pingdom competitor" class="btn btn-warning">StatusCake</li>
<li data-toggle="tooltip" data-placement="right" title="Snythetics" class="btn btn-warning">NewRelic</li>
<li data-toggle="tooltip" data-placement="right" title="Transaction monitoring" class="btn btn-info">Logz.io</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li data-toggle="tooltip" data-placement="right" title="medium maturity in this space" class="btn btn-warning">NewRelic</li>
<li data-toggle="tooltip" data-placement="right" title="Very mature in this space" class="btn btn-info">NewRelic</li>
<li data-toggle="tooltip" data-placement="right" title="website error monitoring" class="btn btn-warning">Rollbar</li>
<li data-toggle="tooltip" data-placement="right" title="" class="btn btn-warning" class="btn btn-success">Mobile
- Firebase Monitoring</li>
<li data-toggle="tooltip" data-placement="right" title="" class="btn btn-warning" class="btn btn-success">Mobile
- Crittercism</li>
<li data-toggle="tooltip" data-placement="right" title="" class="btn btn-warning" class="btn btn-success">Mobile
- Crashlytics</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li data-toggle="tooltip" data-placement="right" title="WAF and DDoS protection" class="btn btn-success">Incapsula</li>
<li data-toggle="tooltip" data-placement="right" title="Software 'Supply Chain' validation" class="btn btn-info">Snyk</li>
<li data-toggle="tooltip" data-placement="right" title="Central place for security alerts" class="btn btn-warning">AWS Security hub</li>
<li data-toggle="tooltip" data-placement="right" title="Intrusion Detection on AWS infrastructure" class="btn btn-warning">AWS
GuardDuty</li>
<li data-toggle="tooltip" data-placement="right" title="PII detection" class="btn btn-warning">AWS Macie</li>
<li data-toggle="tooltip" data-placement="right" title="SIEM" class="btn btn-info">Logz.io</li>
<li data-toggle="tooltip" data-placement="right" title="Kernel monitoring" class="btn btn-info">Sysdig & Falco</li>
</ul>
</td>
</tr>
<tr>
<th scope="row">
Responsible and Accountable roles/functions
<a href="https://en.wikipedia.org/wiki/Responsibility_assignment_matrix">(RACI)</a>
</th>
<td>
<ul class="list-unstyled">
<li>Devs</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Devs</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Devs</li>
<li>QA</li>
<li>Service Delivery</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>QA</li>
<li>Service Delivery</li>
<li>Product Owners</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Devs</li>
<li>QA</li>
<li>Service Delivery</li>
<li>Product Owners</li>
</ul>
</td>
<td>
<ul class="list-unstyled">
<li>Devs</li>
<li>QA</li>
<li>Service Delivery</li>
</ul>
</td>
</tr>
<tr>
<th scope="row">
Current overall maturity
</th>
<td class="bg-warning">
low to medium
</td>
<td class="bg-warning">
low to medium
</td>
<td class="bg-danger">
very low
</td>
<td class="bg-success">
medium to high
</td>
<td class="bg-warning">
low to medium
</td>
<td class="bg-danger">
very low
</td>
</tr>
<tr>
<th scope="row">
Maturity criteria
<div class="text-muted">What does good look like?</div>
</th>
<td>
<ol >
<li>Can you pick up infra issues ahead of time</li>
<li>Do you have detailed load stats on underlying infra</li>
<li>Do you have enough information to make good infra rightsizing decisions</li>
<li>Can you spot underlying infra issues</li>
<li>Can you easily visualise all your data</li>
</ol>
</td>
<td>
<ol>
<li>Can you pick up code issues picked up ahead of time</li>
<li>Do you have detailed stats on load and app load profiles</li>
<li>Do you have custom StatsD type metrics to show behaviours e.g <em>Total Articles served today</em></li>
<li>Do you have detailed stats on application behaviour under load</li>
<li>Do you have a comprehensive view on 3rd party integrations</li>
<li>Do you have a handle on how each deployment affects application performance</li>
<li>Can you detect runtime errors very quickly</li>
<li>Do you have detailed info that enables you to make the right optimisations</li>
<li>Can you easily visualise all your data</li>
</ol>
</td>
<td>
<ol>
<li>Do you have a comprehensive view on 3rd party integrations</li>
<li>Do you have a handle on how each deployment affects application performance</li>
<li>Can you detect API errors before they ripple too far up the stack </li>
<li>Can you quickly detect schema changes/breaks early (contract monitoring)</li>
<li>Do you have detailed info on Global api performance</li>
<li>Can you easily visualise all your data</li>
</ol>
</td>
<td>
<ol>
<li>Can you detect, monitor and audit website uptime</li>
<li>Do you have detailed global data on website performance</li>
<li>Can you ensure that website content is consistently accurate</li>
<li>Can you easily visualise all your data</li>
</ol>
</td>
<td>
<ol >
<li>Can you pick up code issues picked up ahead of time</li>
<li>Do you have detailed stats on load and app load profiles</li>
<li>Do you have detailed stats on application behaviour under load</li>
<li>Do you have a comprehensive view on 3rd party integrations</li>
<li>Do you have a handle on how each deployment affects application performance</li>
<li>Can you detect runtime errors very quickly</li>
<li>Do you have detailed info that enables you to make the right optimisations</li>
<li>Do you have data on user behaviours</li>
<li>Do you have data on the platforms your users are using?</li>
<li>Can you easily visualise all your data</li>
</ol>
</td>
<td>
<ol >
<li>Can you pick up security issues ahead of time</li>
<li>Do you get regular alerts and remedies on new vulnerabilities</li>
<li>Do you get heuristic pickup of suspicious behaviour on your infra and apps</li>
<li>Do you have constant data on current threat/exposure level</li>
<li>Do you get best practice recommendations automatically</li>
<li>Can you easily visualise all your data</li>
</ol>
</td>
</tr>
</tbody>
</table>
</div>
<script>
$(()=> {
$('[data-toggle="tooltip"]').tooltip()
})
</script>
</div>