Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performace] -- Instance validation is very slow when the instance data set is considerably huge #28

Open
faraazbrcm opened this issue Sep 12, 2019 · 7 comments

Comments

@faraazbrcm
Copy link

faraazbrcm commented Sep 12, 2019

Hi,

I am trying to use Yangson in Sonic and as part of my experiment, I am trying to load huge instance data for validation. Although validation goes through but it is taking huge time to complete.

In my case I have sample yang for ACL. I am trying to create 1000 ACLs with 5 rules each and yang has one must expression.

Is there any way to improve performance. Can you please look into this case how to make yangson highly performant. Practically instance data will be huge in the production system.

Attaching a tar file containing dependent yang models along with instance data JSON file (acl_1000.json) along with metadata file (yang-library-ext2.json).

Command line:

bash-3.2$ time yangson yang-library-ext2.json -p ../common/:. -v acl_1000.json

real 62m59.614s

samples.zip

user 62m45.841s

sys 0m4.968s

bash-3.2$ yangson -V
yangson 1.3.43

@faraazbrcm faraazbrcm changed the title [Performace] -- Instance is very slow when the instance data set is considerably huge [Performace] -- Instance validation is very slow when the instance data set is considerably huge Sep 12, 2019
@iwanb
Copy link

iwanb commented Sep 12, 2019

I've encountered similar issues, AFAIK there are two reasons.

First the backing data structure involves a lot of data copy when traversing the tree. It comes from the up, __getitem__, _member and _entry methods of the InstanceNode classes. For containers the keys are copied, and more importantly for lists the whole list is copied when you access an element (so it won't scale as the list grows). The copy should be avoided for all access methods, and the data should only be copied when modifying something. There are probably also possible optimizations to avoid copying data when modifying the tree.

Second there is no indexing for list instances, so looking up a list item by its key(s) is O(n). An indexing mechanism should be added to make it O(1).

@llhotka
Copy link
Member

llhotka commented Sep 12, 2019

Yes, I made a similar experience when I tried to feed the entire .com DNS zone to an instance tree. But what was really slow was to read the data and build the tree. Traversing and modifying the zipper tree does have some overhead compared to a standard tree but the actual size of the tree shouldn't IMO play much role - all such operations are localised.

I think the slowness is partly due to Python, including the fact that Python doesn't support tail recursion, which may be significant when validating a deep tree.

I guess it is fair to say that Yangson isn't really suitable for huge instance data.

@faraazbrcm
Copy link
Author

@llhotka I see same behavior is observed with libyang as well. are there any other tools available to validate yang instance efficiently?

@faraazbrcm
Copy link
Author

I've encountered similar issues, AFAIK there are two reasons.

First the backing data structure involves a lot of data copy when traversing the tree. It comes from the up, __getitem__, _member and _entry methods of the InstanceNode classes. For containers the keys are copied, and more importantly for lists the whole list is copied when you access an element (so it won't scale as the list grows). The copy should be avoided for all access methods, and the data should only be copied when modifying something. There are probably also possible optimizations to avoid copying data when modifying the tree.

Second there is no indexing for list instances, so looking up a list item by its key(s) is O(n). An indexing mechanism should be added to make it O(1).

How did you overcome this?

@iwanb
Copy link

iwanb commented Sep 16, 2019

I haven't yet but I would like to improve the data structures in yangson.

Quick experiment to show the impact:

In [16]: l = cont.put_member('listA', [{'leafE': str(i), 'leafF': True} for i in range(1000)], raw=True)
In [23]: l_short = cont.put_member('listA', [{'leafE': str(i), 'leafF': True} for i in range(1)], raw=True)
# Baseline
In [25]: timeit.timeit('l_short[0]', globals=locals(), number=1000)                                         
Out[25]: 0.011242438000408583
# 50 times slower (that's the copy overhead)
In [26]: timeit.timeit('l[0]', globals=locals(), number=1000)                                               
Out[26]: 0.542444660000001
# 100 times slower when trying to find the last element of the list
# 2 times slower than a lookup by index, that's the penalty due to a lack of index
In [27]: timeit.timeit('l.look_up(leafE="999", leafF=True)', globals=locals(), number=1000)                 
Out[27]: 1.021914311000728
# 50 times slower when looking at the first element
In [28]: timeit.timeit('l.look_up(leafE="0", leafF=True)', globals=locals(), number=1000)                   
Out[28]: 0.5635812550008268
# Baseline
In [29]: timeit.timeit('l_short.look_up(leafE="0", leafF=True)', globals=locals(), number=1000)             
Out[29]: 0.011247126000853314

I expect the list copy is done many times when checking a constraint because it traverses the tree.

While it's true that Python is not fast in general, all those operations could take the same time.

For other performance problems like recursion and number of function calls/object allocations, doing more pre-computations on the schema and caching on the instance nodes might help. Of course that's not straightforward, you need to do a lot of profiling to figure out what could help.

@jktjkt
Copy link
Contributor

jktjkt commented Jun 8, 2021

Hello, just a friendly question if there's some development going on to address this. We've stumbled upon this problem at @Telecominfraproject. Here's the data that we're loading, here's the ietf-yang-library data, and the data models are from this (IETF) and this (ours) directory.

A call to validate takes about 10s on the data set I mentioned -- and that's less than 5'000 lines of JSON. We also have other, non-public data (with a completely different YANG model) where the validation did not finish in an hour. Unfortunately, it was not a synthetic benchmark, but real data that we're working with. @llhotka, do you have plans to improve this in the near future? Our backup option is of course writing nice, object-level Python bindings to @CESNET's libyang, but I thought that it's a good idea to ask here first. Even though we already use numpy, we would love to avoid building wheels for a C library which is not in the distros.

@llhotka
Copy link
Member

llhotka commented Jun 9, 2021

There are no plans for such development, it would probably mean writing something entirely different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants