[Performace] -- Instance validation is very slow when the instance data set is considerably huge #28

faraazbrcm · 2019-09-12T06:22:47Z

Hi,

I am trying to use Yangson in Sonic and as part of my experiment, I am trying to load huge instance data for validation. Although validation goes through but it is taking huge time to complete.

In my case I have sample yang for ACL. I am trying to create 1000 ACLs with 5 rules each and yang has one must expression.

Is there any way to improve performance. Can you please look into this case how to make yangson highly performant. Practically instance data will be huge in the production system.

Attaching a tar file containing dependent yang models along with instance data JSON file (acl_1000.json) along with metadata file (yang-library-ext2.json).

Command line:

bash-3.2$ time yangson yang-library-ext2.json -p ../common/:. -v acl_1000.json

real 62m59.614s

samples.zip

user 62m45.841s

sys 0m4.968s

bash-3.2$ yangson -V
yangson 1.3.43

iwanb · 2019-09-12T11:21:11Z

I've encountered similar issues, AFAIK there are two reasons.

First the backing data structure involves a lot of data copy when traversing the tree. It comes from the up, __getitem__, _member and _entry methods of the InstanceNode classes. For containers the keys are copied, and more importantly for lists the whole list is copied when you access an element (so it won't scale as the list grows). The copy should be avoided for all access methods, and the data should only be copied when modifying something. There are probably also possible optimizations to avoid copying data when modifying the tree.

Second there is no indexing for list instances, so looking up a list item by its key(s) is O(n). An indexing mechanism should be added to make it O(1).

llhotka · 2019-09-12T14:39:23Z

Yes, I made a similar experience when I tried to feed the entire .com DNS zone to an instance tree. But what was really slow was to read the data and build the tree. Traversing and modifying the zipper tree does have some overhead compared to a standard tree but the actual size of the tree shouldn't IMO play much role - all such operations are localised.

I think the slowness is partly due to Python, including the fact that Python doesn't support tail recursion, which may be significant when validating a deep tree.

I guess it is fair to say that Yangson isn't really suitable for huge instance data.

faraazbrcm · 2019-09-14T04:45:02Z

@llhotka I see same behavior is observed with libyang as well. are there any other tools available to validate yang instance efficiently?

faraazbrcm · 2019-09-14T04:45:32Z

I've encountered similar issues, AFAIK there are two reasons.

First the backing data structure involves a lot of data copy when traversing the tree. It comes from the up, __getitem__, _member and _entry methods of the InstanceNode classes. For containers the keys are copied, and more importantly for lists the whole list is copied when you access an element (so it won't scale as the list grows). The copy should be avoided for all access methods, and the data should only be copied when modifying something. There are probably also possible optimizations to avoid copying data when modifying the tree.

Second there is no indexing for list instances, so looking up a list item by its key(s) is O(n). An indexing mechanism should be added to make it O(1).

How did you overcome this?

iwanb · 2019-09-16T10:15:11Z

I haven't yet but I would like to improve the data structures in yangson.

Quick experiment to show the impact:

In [16]: l = cont.put_member('listA', [{'leafE': str(i), 'leafF': True} for i in range(1000)], raw=True)
In [23]: l_short = cont.put_member('listA', [{'leafE': str(i), 'leafF': True} for i in range(1)], raw=True)
# Baseline
In [25]: timeit.timeit('l_short[0]', globals=locals(), number=1000)                                         
Out[25]: 0.011242438000408583
# 50 times slower (that's the copy overhead)
In [26]: timeit.timeit('l[0]', globals=locals(), number=1000)                                               
Out[26]: 0.542444660000001
# 100 times slower when trying to find the last element of the list
# 2 times slower than a lookup by index, that's the penalty due to a lack of index
In [27]: timeit.timeit('l.look_up(leafE="999", leafF=True)', globals=locals(), number=1000)                 
Out[27]: 1.021914311000728
# 50 times slower when looking at the first element
In [28]: timeit.timeit('l.look_up(leafE="0", leafF=True)', globals=locals(), number=1000)                   
Out[28]: 0.5635812550008268
# Baseline
In [29]: timeit.timeit('l_short.look_up(leafE="0", leafF=True)', globals=locals(), number=1000)             
Out[29]: 0.011247126000853314

I expect the list copy is done many times when checking a constraint because it traverses the tree.

While it's true that Python is not fast in general, all those operations could take the same time.

For other performance problems like recursion and number of function calls/object allocations, doing more pre-computations on the schema and caching on the instance nodes might help. Of course that's not straightforward, you need to do a lot of profiling to figure out what could help.

jktjkt · 2021-06-08T18:39:10Z

Hello, just a friendly question if there's some development going on to address this. We've stumbled upon this problem at @Telecominfraproject. Here's the data that we're loading, here's the ietf-yang-library data, and the data models are from this (IETF) and this (ours) directory.

A call to validate takes about 10s on the data set I mentioned -- and that's less than 5'000 lines of JSON. We also have other, non-public data (with a completely different YANG model) where the validation did not finish in an hour. Unfortunately, it was not a synthetic benchmark, but real data that we're working with. @llhotka, do you have plans to improve this in the near future? Our backup option is of course writing nice, object-level Python bindings to @CESNET's libyang, but I thought that it's a good idea to ask here first. Even though we already use numpy, we would love to avoid building wheels for a C library which is not in the distros.

llhotka · 2021-06-09T05:44:23Z

There are no plans for such development, it would probably mean writing something entirely different.

faraazbrcm changed the title ~~[Performace] -- Instance is very slow when the instance data set is considerably huge~~ [Performace] -- Instance validation is very slow when the instance data set is considerably huge Sep 12, 2019

jktjkt mentioned this issue Dec 7, 2021

yang library packaging Telecominfraproject/oopt-gnpy#426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performace] -- Instance validation is very slow when the instance data set is considerably huge #28

[Performace] -- Instance validation is very slow when the instance data set is considerably huge #28

faraazbrcm commented Sep 12, 2019 •

edited

Loading

iwanb commented Sep 12, 2019

llhotka commented Sep 12, 2019

faraazbrcm commented Sep 14, 2019

faraazbrcm commented Sep 14, 2019

iwanb commented Sep 16, 2019

jktjkt commented Jun 8, 2021

llhotka commented Jun 9, 2021

[Performace] -- Instance validation is very slow when the instance data set is considerably huge #28

[Performace] -- Instance validation is very slow when the instance data set is considerably huge #28

Comments

faraazbrcm commented Sep 12, 2019 • edited Loading

iwanb commented Sep 12, 2019

llhotka commented Sep 12, 2019

faraazbrcm commented Sep 14, 2019

faraazbrcm commented Sep 14, 2019

iwanb commented Sep 16, 2019

jktjkt commented Jun 8, 2021

llhotka commented Jun 9, 2021

faraazbrcm commented Sep 12, 2019 •

edited

Loading