-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize costs #1149
Comments
related: #1152 (but I expect outbox is a small fraction of overall cost) |
Looking at https://cloud.google.com/firestore/docs/audit-logging . I've turned on audit logs for Firestore/Datastore API and Access Approval in https://console.cloud.google.com/iam-admin/audit?project=bridgy-federated . |
Other leads: |
Got ~20h of datastore logs, digging into them in log analytics now. https://console.cloud.google.com/logs/analytics |
...maybe promising? Eg this query breaks down by API method: SELECT DISTINCT
proto_payload.audit_log.method_name as method_name,
count(*) as count FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
group by method_name
order by count desc
LIMIT 1000 and this one samples the actual contents of queries: SELECT proto_payload
FROM `bridgy-federated.global._Default._Default`
where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access'
and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery'
limit 200 I can't aggregate (group by) by fields inside |
It's in BigQuery, https://console.cloud.google.com/bigquery?ws=!1m5!1m4!4m3!1sbridgy-federated!2slogs!3s_Default, let's see how that goes. 37M rows total. |
Damn, BigQuery can't do it either. Maybe if I pull the JSON out in a view. |
Back to looking at this. |
Some useful datastore API usage analytics below, and already finding some useful results. Queries are ~40% of datastore API calls, and the vast majority of thoose are looking up original QuerySELECT DISTINCT proto_payload.audit_log.method_name as method_name, count(*) as count FROM bridgy-federated.logs._AllLogs where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' group by method_name order by count desc
QuerySELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, ARRAY_LENGTH(JSON_QUERY_ARRAY(proto_payload.audit_log.request.query.filter.compositeFilter.filters)) as num_composite_filters, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' group by kind, num_composite_filters order by count(*) desc
QuerySELECT string(proto_payload.audit_log.request.query.kind[0].name) as kind, FilterStr(proto_payload.audit_log.request.query.filter) as filter, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.RunQuery' and proto_payload.audit_log.request.query.filter.propertyFilter is not null group by kind, filter order by count(*) desc
|
^ should hopefully help with queries. Here's the monitoring I'm watching: https://console.cloud.google.com/apis/api/datastore.googleapis.com/metrics?project=bridgy-federated&pageState=%28%22duration%22%3A%28%22groupValue%22%3A%22PT1H%22%2C%22customValue%22%3Anull%29%29 |
Now looking at lookups aka gets. Surprising conclusion: the vast majority are for stored DID docs. Who knew. QuerySELECT array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) as num_keys, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' group by num_keys order by count(*) desc
QuerySELECT string(proto_payload.audit_log.request.keys[0].path[0].kind) as kind, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and array_length(JSON_QUERY_ARRAY(proto_payload.audit_log.request.keys)) = 1 group by kind order by count(*) desc
Object lookups by id scheme: QuerySELECT split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] as scheme, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' group by scheme order by count(*) desc
|
I don't understand why the DID QuerySELECT JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name) as name, count(*) FROM bridgy-federated.logs._Default where log_name='projects/bridgy-federated/logs/cloudaudit.googleapis.com%2Fdata_access' and proto_payload.audit_log.method_name='google.datastore.v1.Datastore.Lookup' and string(proto_payload.audit_log.request.keys[0].path[0].kind) = 'Object' and split(JSON_EXTRACT_SCALAR(proto_payload.audit_log.request.keys[0].path[0].name), ':')[0] = 'did' group by name order by count(*) desc limit 100
|
Now looking at optimizing log storage. We were doing 250G/mo a bit ago, we're now down to ~150G/mo or so. The main cost here is initial ingest. We get 50G/mo for free, then $.50/G after that. That includes 30d retention, and our current retention period is set to 30d, so reducing retention wouldn't help. https://cloud.google.com/stackdriver/pricing#cloud-monitoring-pricing Tips on optimizing logging costs: https://cloud.google.com/architecture/framework/cost-optimization/cloudops#logging Logs in BigQuery: https://console.cloud.google.com/bigquery?invt=AbjFiQ&project=bridgy-federated&inv=1&ws=%211m0 Logs dashboard: https://console.cloud.google.com/monitoring/dashboards/builder/24c22d42-91d8-4feb-aa6b-99dbb84c6417;duration=PT8H?project=bridgy-federated |
The other next cost to look at is CPU. router is currently on four cores, atproto-hub on one. We should be able to get them both down. Here's one place to start: https://cloud.google.com/profiler/docs |
Next step for datastore reads: snarfed/arroba#30 |
^ The difficulty with merging router and atproto-hub is that we currently run four WSGI workers for router, and we want to run the atproto-hub threads on just one of them. I had trouble running all the threads in just one worker on router a while back, but that may have been a red herring, it may actually work ok after all, and I doubt we're compute-bound in a way that we'd care about the GIL. Worth trying. |
Another idea here: stop storing and fetching transient activities (create, update, delete, undo, accept/reject, etc) in the datastore entirely. Just use memcache for them. |
Another idea here, cache follower counts in memcache instead of instance memory. Not sure how much that would matter, I'll see when I dig into datastore usage. |
Yet another idea here, stop storing incoming activities altogether, only store objects. |
Oh wow, I didn't realize that the datastore had separate pricing for multi-region vs single-region, https://cloud.google.com/datastore/pricing#location_pricing . Let's see if I can figure out how to switch us to single-region. |
OK so right now, the App Engine app is in us-central, see dashboard or The catch may be that datastore says it doesn't support App Engine in us-central1, argh. |
Sadly we may be too late here, evidently you can't change a datastore db's location after it's created. I can use a regional endpoint for requests, but I don't think that will affect storage locality or pricing. |
Can't get a regional endpoint to work anyway. from google.api_core.client_options import ClientOptions
with ndb.Client(client_options=ClientOptions(api_endpoint='https://us-east1.googleapis.com')).context():
w = Web.get_by_id('snarfed.org')
...
google.api_core.exceptions.RetryError: Maximum number of 3 retries exceeded while calling <function make_call.<locals>.rpc_call at 0x108723f60>, last exception: 503 DNS resolution failed for https://us-east1-firestore.googleapis.com:443: C-ares status is not ARES_SUCCESS qtype=A name=https://us-east1-firestore.googleapis.com:443 is_balancer=0: Domain name not found Also tried variations like Dropping the |
adds new generic common.memcache_memoize decorator for caching any function's output in memcache. for #1149
Added some logging on datastore lookups, we're doing tons without caching, as expected. A few stack traces:
router:
|
No good reason, but I vaguely suspect our Lines 65 to 101 in b4270a5
|
Still struggling to instrument datastore lookups and understand why they're not getting cached. Got logging on the source side of ndb lookups, via |
Here's everything I can think of that I've done to get ndb caching in memcache working, or to even check that it's working:
I still don't see |
Don't want to draw attention to this, I've been looking at it mostly behind the scenes, but I'd like to start tracking at least the investigation and work more publicly.
I expect there's plenty of low hanging fruit here. Biggest contributors right now are datastore reads and frontend instances, both of which I should be able to cut down. Biggest blocker right now is that I'm not sure what's driving the datastore read load, esp since I added a memcached instance a while back. Hrm.
The text was updated successfully, but these errors were encountered: