Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Nathanmalnoury · 2021-12-08T14:46:57Z

We were recently working on large scale migration batches (on ~550.000 entities) using NDB on App Engine. As we are limited to 2048MB of ram, memory consumption is a big subject when doing this kind of things.
When trying to lower it, I noticed something that I found an unexpected (in my opinion) behaviour of fetch_page.
I would have expected the memory consumption to be close to a plateau, as the new entities will weight approx. the same as the previous ones. I found that it instead grows linearly until it ends, even without processing the results at all.

Environment details

Specify the API: Datastore using NDB 1.11.1
OS type and version: Ubuntu 20.04.3 LTS x86_64
Python version and virtual environment information: 3.8.10
google-cloud- version: 1.11.1

Code example

import os

import psutil  # https://pypi.org/project/psutil/
from google.cloud import ndb

from app.models import Enterprise

print(ndb.__version__)


def memory_usage():
    pid = os.getpid()
    process = psutil.Process(pid)
    return process.memory_info().rss / 10 ** 6


with ndb.Client().context():
    more = True
    cursor = None
    standard_usage = memory_usage()
    while more:
        # 'Enterprise' is rather a big entity in our datastore, which makes the memory consumption increase more visible
        result, cursor, more = Enterprise.query().fetch_page(200, start_cursor=cursor, use_cache=False, use_global_cache=False)
        print(memory_usage() - standard_usage)

Output

I truncated the output, as it loops ~400 times. Using the best App Engine servers (B8) we would not be able to finish that script.

37.437439999999995
51.515392000000006
60.62079999999999
74.268672
83.07507199999999
92.27878399999999
101.040128
114.597888
129.179648
138.15398399999998
136.94976000000003
156.188672
164.958208
[...]
3663.07328
3672.14592
3681.140736
3690.008576
3698.93376
3708.043264
3710.746624

The text was updated successfully, but these errors were encountered:

pnico · 2023-02-03T23:28:00Z

This sounds alarming; we use fetch_page(use_cache=False) a lot. It's not clear if you actually tried this or just were worried it wouldn't work, because of the output of memory_info(), and if you confirmed this same behavior didn't occur on python 2 / legacy ndb. It seems possible that this could just mean that garbage collection wasn't triggered during your test, and that in a real situation a collection would be triggered rather than just using up all available memory. More investigation into GC stats inside your test script might have shed some light on this, but instead this issue is still just sitting here mostly untouched for years.

Does the type: bug label mean this was confirmed by the team to be a real issue? If so, isn't this kind of important since GCP developer support is still urging developers to go ahead and use this library? If not, shouldn't the issue be closed with an explanation showing it's safe to use this library in production? Without knowing the answer to either of those questions, the responsible thing is to assume the worst - which is kind of worse than just having a known/confirmed bug like this.

Nathanmalnoury · 2023-02-08T22:57:26Z

Hi, I'll try to address some of your questions to the best of my knowledge !

It's not clear if you actually tried this or just were worried it wouldn't work,

We actually experienced this bug on our productions servers; we use "B8" instances; which hold 2048MB of memory. We initially thought our code was simply not optimized enough, but with this snippet above we do reproduce the "soft memory exceeded" error.

Our application usually runs such long loops as Cloud Tasks. In that configuration, we managed to get around this bug by using ndb cursors. The function used in Cloud Tasks only processes a few loops at a time; and triggers another execution of itself with an updated cursor to let the instance clean its memory, and start again fresh.

and if you confirmed this same behavior didn't occur on python 2 / legacy ndb

I could'nt confirm how python 2 / legacy ndb behaves; we're currently using python 3.10 and google-cloud-ndb v1.11.1; which isn't legacy ndb.

It seems possible that this could just mean that garbage collection wasn't triggered during your test, and that in a real situation a collection would be triggered rather than just using up all available memory.

As I mentionned, this issue occured on real B8 servers; so we did experienced soft memory being exceeded in that sort of cases.

More investigation into GC stats inside your test script might have shed some light on this

I have tried to include your suggestion of investigating more into the GC stats. I must admit I have very little experience with GC; let alone GC Stats.
In an effort to clear-up all doubts; I've included both the stats and a forced gc.collect into the loop of my original script:

import gc
import os

import psutil  # https://pypi.org/project/psutil/
from google.cloud import ndb
from rsa.prime import gcd

from app.models import Enterprise

print(ndb.__version__)


def memory_usage():
    pid = os.getpid()
    process = psutil.Process(pid)
    return process.memory_info().rss / 10 ** 6


with ndb.Client().context():
    more = True
    cursor = None
    standard_usage = memory_usage()
    while more:
        # 'Enterprise' is rather a big entity in our datastore, which makes the memory consumption increase more visible
        result, cursor, more = Enterprise.query().fetch_page(200, start_cursor=cursor, use_cache=False, use_global_cache=False)
        print("memory: ", memory_usage() - standard_usage)
        print(f"{gc.get_stats()}")
        gc.collect()

And the output of the program is attached .
bug_fetch_page.txt.
I've interrupted the script as letting it run would take a while. The conclusion remains that the 2048Mb are still exceeded without any actual computations.

I'll let you investigate the GC stats and hopefully draw some conclusions. Without prior knowledge, nothing is uncollectable but the number of collections never decreases despite the regular gc.collect.

Hope this clarifies some your interogations; and makes it easier to investigate that issue for maintainers.

pnico · 2023-02-16T16:00:31Z

Thanks for responding!

We do things like this all the time in tasks running on f1-micro instances and it works fine, memory gets freed up along the way because nothing is stored in instance cache. So if you're having memory issues fetching pages with 2GB of memory available, it seems like a pretty major regression.

Re. Python 2 ndb, if I'm not mistaken this library was made mainly to coax developers off the old appengine standard runtime. At some point, messages started appearing in issues about how development was stopping on this and no further support, Google started recommending against using it for new projects, and finally seemed to give in and just let developers continue using (for now, anyway) most of the legacy appengine services, including the ndb library, on Python 3 using this: https://github.com/GoogleCloudPlatform/appengine-python-standard

I'd be curious to know if the ndb support using the legacy bundled services sdk exhibits this issue for you. As it is, and without any comment from the team on it, it's kind of a red flag for us since we know our instances would just run out of memory right away if this is bug exists. If legacy bundled services support goes away at some point, using this instead wouldn't really be an option.

In certain circumstances, we were not respecting use_cache for queries, unlike legacy NDB, which is quite emphatic about supporting them. (See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187) In googleapis#613 we tried to match legacy NDB behavior by updating the cache using the results of queries. We still do that, but now we respect use_cache, which was a valid keyword argument for Query.fetch() and friends, but was not passed down to the context cache when needed. As a result, the cache could mysteriously accumulate lots of memory usage and perhaps even cause you to hit memory limits, even if it was seemingly disabled and it didn't look like there were any objects holding references to your query results. This is a problem for certain batch-style workloads where you know you're only interested in processing a certain entity once. Fixes googleapis#752

rwhogg · 2023-02-22T22:08:13Z

Hi, I'm working on a PR to fix this issue. Thanks for bringing it up.

In the meantime, you can try the following workaround of explicitly setting a never-cache policy:

Instead of

with ndb.Client().context():

you can do:

with ndb.Client().context(cache_policy=lambda l: False)

In certain circumstances, we were not respecting use_cache for queries, unlike legacy NDB, which is quite emphatic about supporting them. (See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187) In googleapis#613 we tried to match legacy NDB behavior by updating the cache using the results of queries. We still do that, but now we respect use_cache, which was a valid keyword argument for Query.fetch() and friends, but was not passed down to the context cache when needed. As a result, the cache could mysteriously accumulate lots of memory usage and perhaps even cause you to hit memory limits, even if it was seemingly disabled and it didn't look like there were any objects holding references to your query results. This is a problem for certain batch-style workloads where you know you're only interested in processing a certain entity once. Fixes googleapis#752

patb2 · 2023-02-23T08:24:33Z

Great news ! This could also fix issue #751, where erroneous use of cache might play a role. Actually maybe not, since #751 is a case of accessing same entity from another instance, it would have to relate to global cache, not local cache.

patb2 · 2023-02-23T09:56:21Z

Giving some more thoughts to this cache management issue, it seems to me that taking the use_cache into account is only part of the story. Indeed it will provide a way to alleviate the memory limit issue by adding use_cache=False to many query.fetch(), but there is another side to this: a proper local cache management should have some sort of memory management so that it does not use more than a pre-defined amount of memory, it should not take memory until the instance crashes.

The local cache as it is today is just a dict object, to which entities are added on Key.get() and Model.put(). I don't see where things get cleared. Could it be a good idea to use cachetools.LRUCache(), with a max_size that could be set as a context variable ?

If it is a good idea, I could give it a try.

Nathanmalnoury · 2023-02-24T11:29:05Z

Hi @rwhogg,

Thank you for your response and your work on this issue.

Just to let you know that your suggested fix :

with ndb.Client().context(cache_policy=lambda l: False)

does the trick for me ! The maximum memory usage reached is 72 Mo, which is expected in our case.

In certain circumstances, we were not respecting use_cache for queries, unlike legacy NDB, which is quite emphatic about supporting them. (See https://github.com/GoogleCloudPlatform/datastore-ndb-python/blob/59cb209ed95480025d26531fc91397575438d2fe/ndb/query.py#L186-L187) In #613 we tried to match legacy NDB behavior by updating the cache using the results of queries. We still do that, but now we respect use_cache, which was a valid keyword argument for Query.fetch() and friends, but was not passed down to the context cache when needed. As a result, the cache could mysteriously accumulate lots of memory usage and perhaps even cause you to hit memory limits, even if it was seemingly disabled and it didn't look like there were any objects holding references to your query results. This is a problem for certain batch-style workloads where you know you're only interested in processing a certain entity once. Fixes #752

product-auto-label bot added the api: datastore Issues related to the googleapis/python-ndb API. label Dec 8, 2021

yoshi-automation added triage me I really want to be triaged. 🚨 This issue needs some love. labels Dec 9, 2021

yoshi-automation added 🚨 This issue needs some love. and removed 🚨 This issue needs some love. labels May 11, 2022

yoshi-automation added the 🚨 This issue needs some love. label Jun 6, 2022

rwhogg mentioned this issue Feb 22, 2023

fix: Query options were not respecting use_cache #873

Merged

rwhogg closed this as completed in #873 Feb 27, 2023

release-please bot mentioned this issue Feb 27, 2023

chore(main): release 2.1.1 #864

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Nathanmalnoury commented Dec 8, 2021 •

edited

Loading

pnico commented Feb 3, 2023

Nathanmalnoury commented Feb 8, 2023

pnico commented Feb 16, 2023 •

edited

Loading

rwhogg commented Feb 22, 2023

patb2 commented Feb 23, 2023 •

edited

Loading

patb2 commented Feb 23, 2023 •

edited

Loading

Nathanmalnoury commented Feb 24, 2023

Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Datastore: (using NDB) Linear memory usage consumption when using fetch_page #752

Comments

Nathanmalnoury commented Dec 8, 2021 • edited Loading

Environment details

Code example

Output

pnico commented Feb 3, 2023

Nathanmalnoury commented Feb 8, 2023

pnico commented Feb 16, 2023 • edited Loading

rwhogg commented Feb 22, 2023

patb2 commented Feb 23, 2023 • edited Loading

patb2 commented Feb 23, 2023 • edited Loading

Nathanmalnoury commented Feb 24, 2023

Nathanmalnoury commented Dec 8, 2021 •

edited

Loading

pnico commented Feb 16, 2023 •

edited

Loading

patb2 commented Feb 23, 2023 •

edited

Loading

patb2 commented Feb 23, 2023 •

edited

Loading