Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Closed
Nathanmalnoury opened this issue Dec 8, 2021 · 7 comments · Fixed by #873 or #864
Closed

Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Nathanmalnoury opened this issue Dec 8, 2021 · 7 comments · Fixed by #873 or #864
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@Nathanmalnoury
Copy link

Nathanmalnoury commented Dec 8, 2021

We were recently working on large scale migration batches (on ~550.000 entities) using NDB on App Engine. As we are limited to 2048MB of ram, memory consumption is a big subject when doing this kind of things.
When trying to lower it, I noticed something that I found an unexpected (in my opinion) behaviour of fetch_page.
I would have expected the memory consumption to be close to a plateau, as the new entities will weight approx. the same as the previous ones. I found that it instead grows linearly until it ends, even without processing the results at all.

Environment details

  1. Specify the API: Datastore using NDB 1.11.1
  2. OS type and version: Ubuntu 20.04.3 LTS x86_64
  3. Python version and virtual environment information: 3.8.10
  4. google-cloud- version: 1.11.1

Code example

import os

import psutil  # https://pypi.org/project/psutil/
from google.cloud import ndb

from app.models import Enterprise

print(ndb.__version__)


def memory_usage():
    pid = os.getpid()
    process = psutil.Process(pid)
    return process.memory_info().rss / 10 ** 6


with ndb.Client().context():
    more = True
    cursor = None
    standard_usage = memory_usage()
    while more:
        # 'Enterprise' is rather a big entity in our datastore, which makes the memory consumption increase more visible
        result, cursor, more = Enterprise.query().fetch_page(200, start_cursor=cursor, use_cache=False, use_global_cache=False)
        print(memory_usage() - standard_usage)

Output

I truncated the output, as it loops ~400 times. Using the best App Engine servers (B8) we would not be able to finish that script.

37.437439999999995
51.515392000000006
60.62079999999999
74.268672
83.07507199999999
92.27878399999999
101.040128
114.597888
129.179648
138.15398399999998
136.94976000000003
156.188672
164.958208
[...]
3663.07328
3672.14592
3681.140736
3690.008576
3698.93376
3708.043264
3710.746624
@product-auto-label product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label Dec 8, 2021
@yoshi-automation yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 9, 2021
@meredithslota meredithslota added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed triage me I really want to be triaged. 🚨 This issue needs some love. labels Feb 9, 2022
@yoshi-automation yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels May 11, 2022
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Jun 6, 2022
@pnico
Copy link

pnico commented Feb 3, 2023

This sounds alarming; we use fetch_page(use_cache=False) a lot. It's not clear if you actually tried this or just were worried it wouldn't work, because of the output of memory_info(), and if you confirmed this same behavior didn't occur on python 2 / legacy ndb. It seems possible that this could just mean that garbage collection wasn't triggered during your test, and that in a real situation a collection would be triggered rather than just using up all available memory. More investigation into GC stats inside your test script might have shed some light on this, but instead this issue is still just sitting here mostly untouched for years.

Does the type: bug label mean this was confirmed by the team to be a real issue? If so, isn't this kind of important since GCP developer support is still urging developers to go ahead and use this library? If not, shouldn't the issue be closed with an explanation showing it's safe to use this library in production? Without knowing the answer to either of those questions, the responsible thing is to assume the worst - which is kind of worse than just having a known/confirmed bug like this.

@Nathanmalnoury
Copy link
Author

Hi, I'll try to address some of your questions to the best of my knowledge !

It's not clear if you actually tried this or just were worried it wouldn't work,

We actually experienced this bug on our productions servers; we use "B8" instances; which hold 2048MB of memory. We initially thought our code was simply not optimized enough, but with this snippet above we do reproduce the "soft memory exceeded" error.

Our application usually runs such long loops as Cloud Tasks. In that configuration, we managed to get around this bug by using ndb cursors. The function used in Cloud Tasks only processes a few loops at a time; and triggers another execution of itself with an updated cursor to let the instance clean its memory, and start again fresh.

and if you confirmed this same behavior didn't occur on python 2 / legacy ndb

I could'nt confirm how python 2 / legacy ndb behaves; we're currently using python 3.10 and google-cloud-ndb v1.11.1; which isn't legacy ndb.

It seems possible that this could just mean that garbage collection wasn't triggered during your test, and that in a real situation a collection would be triggered rather than just using up all available memory.

As I mentionned, this issue occured on real B8 servers; so we did experienced soft memory being exceeded in that sort of cases.

More investigation into GC stats inside your test script might have shed some light on this

I have tried to include your suggestion of investigating more into the GC stats. I must admit I have very little experience with GC; let alone GC Stats.
In an effort to clear-up all doubts; I've included both the stats and a forced gc.collect into the loop of my original script:

import gc
import os

import psutil  # https://pypi.org/project/psutil/
from google.cloud import ndb
from rsa.prime import gcd

from app.models import Enterprise

print(ndb.__version__)


def memory_usage():
    pid = os.getpid()
    process = psutil.Process(pid)
    return process.memory_info().rss / 10 ** 6


with ndb.Client().context():
    more = True
    cursor = None
    standard_usage = memory_usage()
    while more:
        # 'Enterprise' is rather a big entity in our datastore, which makes the memory consumption increase more visible
        result, cursor, more = Enterprise.query().fetch_page(200, start_cursor=cursor, use_cache=False, use_global_cache=False)
        print("memory: ", memory_usage() - standard_usage)
        print(f"{gc.get_stats()}")
        gc.collect()

And the output of the program is attached .
bug_fetch_page.txt.
I've interrupted the script as letting it run would take a while. The conclusion remains that the 2048Mb are still exceeded without any actual computations.

I'll let you investigate the GC stats and hopefully draw some conclusions. Without prior knowledge, nothing is uncollectable but the number of collections never decreases despite the regular gc.collect.

Hope this clarifies some your interogations; and makes it easier to investigate that issue for maintainers.

@pnico
Copy link

pnico commented Feb 16, 2023

Thanks for responding!

We do things like this all the time in tasks running on f1-micro instances and it works fine, memory gets freed up along the way because nothing is stored in instance cache. So if you're having memory issues fetching pages with 2GB of memory available, it seems like a pretty major regression.

Re. Python 2 ndb, if I'm not mistaken this library was made mainly to coax developers off the old appengine standard runtime. At some point, messages started appearing in issues about how development was stopping on this and no further support, Google started recommending against using it for new projects, and finally seemed to give in and just let developers continue using (for now, anyway) most of the legacy appengine services, including the ndb library, on Python 3 using this: https://github.com/GoogleCloudPlatform/appengine-python-standard

I'd be curious to know if the ndb support using the legacy bundled services sdk exhibits this issue for you. As it is, and without any comment from the team on it, it's kind of a red flag for us since we know our instances would just run out of memory right away if this is bug exists. If legacy bundled services support goes away at some point, using this instead wouldn't really be an option.

rwhogg added a commit to rwhogg/python-ndb that referenced this issue Feb 22, 2023
In certain circumstances, we were not respecting use_cache
for queries, unlike legacy NDB, which is quite emphatic
about supporting them.
(See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187)

In googleapis#613 we tried to match legacy NDB behavior by updating the cache
using the results of queries. We still do that, but now we respect
use_cache, which was a valid keyword argument for Query.fetch()
and friends, but was not passed down to the context cache when
needed.

As a result, the cache could mysteriously accumulate lots of memory
usage and perhaps even cause you to hit memory limits, even if it was
seemingly disabled and it didn't look like there were any objects
holding references to your query results.
This is a problem for certain batch-style workloads where you know
you're only interested in processing a certain entity once.

Fixes googleapis#752
rwhogg added a commit to rwhogg/python-ndb that referenced this issue Feb 22, 2023
In certain circumstances, we were not respecting use_cache
for queries, unlike legacy NDB, which is quite emphatic
about supporting them.
(See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187)

In googleapis#613 we tried to match legacy NDB behavior by updating the cache
using the results of queries. We still do that, but now we respect
use_cache, which was a valid keyword argument for Query.fetch()
and friends, but was not passed down to the context cache when
needed.

As a result, the cache could mysteriously accumulate lots of memory
usage and perhaps even cause you to hit memory limits, even if it was
seemingly disabled and it didn't look like there were any objects
holding references to your query results.
This is a problem for certain batch-style workloads where you know
you're only interested in processing a certain entity once.

Fixes googleapis#752
@rwhogg
Copy link
Contributor

rwhogg commented Feb 22, 2023

Hi, I'm working on a PR to fix this issue. Thanks for bringing it up.

In the meantime, you can try the following workaround of explicitly setting a never-cache policy:

Instead of

with ndb.Client().context():

you can do:

with ndb.Client().context(cache_policy=lambda l: False)

rwhogg added a commit to rwhogg/python-ndb that referenced this issue Feb 22, 2023
In certain circumstances, we were not respecting use_cache
for queries, unlike legacy NDB, which is quite emphatic
about supporting them.
(See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187)

In googleapis#613 we tried to match legacy NDB behavior by updating the cache
using the results of queries. We still do that, but now we respect
use_cache, which was a valid keyword argument for Query.fetch()
and friends, but was not passed down to the context cache when
needed.

As a result, the cache could mysteriously accumulate lots of memory
usage and perhaps even cause you to hit memory limits, even if it was
seemingly disabled and it didn't look like there were any objects
holding references to your query results.
This is a problem for certain batch-style workloads where you know
you're only interested in processing a certain entity once.

Fixes googleapis#752
@patb2
Copy link

patb2 commented Feb 23, 2023

Great news ! This could also fix issue #751, where erroneous use of cache might play a role. Actually maybe not, since #751 is a case of accessing same entity from another instance, it would have to relate to global cache, not local cache.

@patb2
Copy link

patb2 commented Feb 23, 2023

Giving some more thoughts to this cache management issue, it seems to me that taking the use_cache into account is only part of the story. Indeed it will provide a way to alleviate the memory limit issue by adding use_cache=False to many query.fetch(), but there is another side to this: a proper local cache management should have some sort of memory management so that it does not use more than a pre-defined amount of memory, it should not take memory until the instance crashes.

The local cache as it is today is just a dict object, to which entities are added on Key.get() and Model.put(). I don't see where things get cleared. Could it be a good idea to use cachetools.LRUCache(), with a max_size that could be set as a context variable ?

If it is a good idea, I could give it a try.

@Nathanmalnoury
Copy link
Author

Hi @rwhogg,

Thank you for your response and your work on this issue.

Just to let you know that your suggested fix :

with ndb.Client().context(cache_policy=lambda l: False)

does the trick for me ! The maximum memory usage reached is 72 Mo, which is expected in our case.

rwhogg added a commit that referenced this issue Feb 27, 2023
In certain circumstances, we were not respecting use_cache
for queries, unlike legacy NDB, which is quite emphatic
about supporting them.
(See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187)

In #613 we tried to match legacy NDB behavior by updating the cache
using the results of queries. We still do that, but now we respect
use_cache, which was a valid keyword argument for Query.fetch()
and friends, but was not passed down to the context cache when
needed.

As a result, the cache could mysteriously accumulate lots of memory
usage and perhaps even cause you to hit memory limits, even if it was
seemingly disabled and it didn't look like there were any objects
holding references to your query results.
This is a problem for certain batch-style workloads where you know
you're only interested in processing a certain entity once.

Fixes #752
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the googleapis/python-ndb API. priority: p2 Moderately-important priority. Fix may not be included in next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
6 participants