Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

My thoughts on memory management #36

Open
NodixBlockchain opened this issue Nov 23, 2017 · 33 comments
Open

My thoughts on memory management #36

NodixBlockchain opened this issue Nov 23, 2017 · 33 comments

Comments

@NodixBlockchain
Copy link

As you probably already know, i'm working on a vm system for a little while, and as everyone i guess i want to make memory management as easy and performant as possible.

Most of the applications i have in mind will have the same configuration of having potentially lots of objects in memory, but very few of them being mutated frequently, even if any of them can potentially be mutated at any time depending on user actions.

It's the typical setup for lot of applications, either they are database/cache, video game, or most application where loads of mutable object are loaded in memory, but most likely the user is only going to mutate and potentially free a small amount of them at any given time.

The main issue i see with mark and sweep GC is that as far as i know, the GC will need to scan the whole memory at every collection step in order to find the dead objects, which can be a big waste of time in case there will be a very low ratio of dead object vs number of potential object killed since the last collection.

What make me think that static analysis is a better approach is that the analyzer can have access to the code, and can have opportunity to already organize the memory in sort that potential lifetime of object can be easily determined in many case.

The best example of GC i have in mind is in video game engine such as havok or nintendo sdk, where there are big level loaded in memory, but as the GC is completely integrated with engine, which contain bsptree or other, it can make in sort that the number of objects that GC need to scan can be limited to what the user can actually interact with.

It's the case where there is high degree of interaction between the code structure and the memory management, as the code can structure the memory with some sort of tree that the garbage collector can recognize in sort to only scan a subset of all the potential objects in memory.

I believe static analysis could get to this sort of result, and grouping objects together following the code logic in sort that objects lifetime can easily be associated with a particular components or class, and determining which area of the objects tree need to be scanned after a certain code path has been followed.

The problem i see with general purpose garbage collector is at this level they can only be dumb, and mostly blind to the relation between the code and the structure of objects in memory, and need to scan everything all the time, no matter what the code is doing, and there is not really much way to work around this for the general case.

I believe that it wouldn't be too hard with simple analysis in many case to figure out which object is potentially deleted or not by knowing which code path has been followed if memory is structured in way to easily match certain class code with certain subset of objects references that it can potentially access. And the GC need to only scan the subset of object that has been modified by the code that has been executed between the two collections.

But it's why for the moment i tend to prefer manual reference counting, because there is a sense of 'you pay for what you use', and in most case in the configuration i mentioned above, there will be a very low ratio of total objects to scan vs the ones that are actually used, and it can allow to make very close match between what the code use, and the amount of object to scan. Even if it can end up with you pay a lot if you use a lot, but generally it will still be much less wasteful than having to scan all objects in memory for every collection.

The only way i see to make some mark/sweep GC efficient for the general case is if there is a good way to organize the code in sort that it can be easy for the GC to know which subset of memory to scan in relation to what code path has been taken.

Knowing that most of the time, only a small subset of objects can be easily identified for potential release if the code is organized properly and the memory is structured with tree system where branch of the tree can be associated with potential access from some subset of the total program. And a branch of the tree can be marked as to be scanned when the code that can mutate objects in this branch is executed.

Well maybe i'm missing something there, but from everything i could read so far on GC, it seems to be a recurrent problem with most general purpose implementations.

@NodixBlockchain
Copy link
Author

Another perspective, which matches my experience of fighting the garbage collector: >http://prog21.dadgum.com/134.html

Also I re-read the paper, and I did get those performance metrics backwards. If it's as good as the >paper says, I wonder why nobody is using it.

One thing to consider also reading those benchmark is that they are made on high level language that are very intensive on the heap.

It's why object nursery give significant performance increase. In the general case those tracing GC tend to spend a lot of their work on no brainer case that wouldn't be here in stack efficient language like C or C++ with 'simple value on stack'.

I would really like to see those same benchmarks made on language who have efficient stack management like C or C++ and doesn't have all the heap-noise generated by the policy of object for everything, and allocating on heap at every sneeze.

It would not take me too long to make a tracing GC instead of RC to do comparison on language like C with a real stack to store "short lived object", i'm pretty sure already the gap between tracing GC and RC will be much less.

And most of those benchmark seem to be made on desktop application who use moderate number of highly dynamic objects, i'm speculating here, but i'm pretty sure if you would make those same benchmark on stack efficient language who don't overuse the heap for local objects and on configuration where there is large number of not so dynamic long lived object in the heap, RC would be mile ahead.

@keean
Copy link
Owner

keean commented Nov 24, 2017

I think providing the hooks so that the same code can be run with a modular 'pluggable' GC would allow interesting back-to-back comparisons of performance with other factors removed. You could take a simple object language with manual allocation/deallocation, and for the GC make deallocation a no-op.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Nov 24, 2017

I could probably come up with such benchmark using my system, and i already have plenty of test function.

It would not be too hard to dynamically configure the heap allocator to be RC or tracing, and skip all the reference manipulation on tracing GC heap, and adding garbage collection point running periodically.

I already have function who run test for long amount of time enough to trigger several garbage collection with a tracing GC.

But at any rate i don't see how a mark and sweep GC would be faster than RC if all the short lived object heap noise is on the stack like it's more natural with C or C++, and only objects that have longer life time than the function would be actually allocated on the heap.

Yesterday i started to read some blogs from video game programmers on this topic, where they care lot about higher performance and memory safety, they make quite good points.

https://www.sebastiansylvan.com/post/on-gc-in-games-response-to-jeff-and-casey/

https://www.sebastiansylvan.com/post/garbage-collection-thoughts/

https://www.sebastiansylvan.com/post/why-most-high-level-languages-are-slow/

Already according to this paper, only make the minor improvement on RC on heap intensive language such as delayed reference counting and these sort of tweak which amount to fixing high level language intensive use of heap for temporary objects, it give similar performance than tracing mark and sweep.

http://users.cecs.anu.edu.au/~steveb/downloads/pdf/rc-ismm-2012.pdf

It seem most of the problems of GC with high level language is that they really over use the heap for everything, and give lot of no brainer work to do to the heap manager for short lived temporary objects.

With this removed or at least mitigiated enough not to introduce more complication in programming, or hidden cost to copy objects out of the scope, or introducing some complexity with the two types of memory, i'm pretty sure it would give different results with benchmark.

The problem that all those benchmark have is they already have to deal with a language that have rather inefficient use of heap, encourage sloppy code and poor memory management, and they have to deal with it.

In the case that they could be tested on language that have a support for temporary object of some sort, it could be much easier to test various system of temporary object allocation that wouldn't clog the heap as much.

In my case i believe it would not be too hard to come to something close to the 'allocate as dead' system, as each thread already have their own heap allocator, and can't allocate or free from other thread's pool, it would be easy to just grab a free zone without really removing it from the free stack or really allocating it with the RC or anything, and only pulling it out of the free stack and making it 'active' as in used after it's copied into an active reference tree. And if not, then the zone is just still in the free stack and there is nothing special to do to free it. It would give a pointer to free heap zone rapidly without really allocating it fully, without needing to copy the data or do anything special if it's lifetime need to be extended.

I'm sure there are plenty of system that can be though of to have cheap allocation of temporary objects that could give a different set of benchmark to test the efficiency of memory management in the case where it really matter, and getting rid of the no brainer case that are used everywhere with most high level languages without there is really an easy way to sort them out because the language is not though that way.

@NodixBlockchain
Copy link
Author

Ok so i have implemented the tracing GC today, which scan the stack & data sections of modules to mark active zone and then sweep.

It's not optimized yet and there seem to still be few glitches with the multi thread, but it seem to be working fine, i will try to finish the debugging tomorrow and remove the glitches and optimize a bit the algo and i can probably do some testing starting from C with relatively clean heap management.

At first glance it seems somehow faster, even if i don't really get why, as on one side, the only removed code is the fetch_add for ref count, and the other side, there is the stack/data section tracing of live objects, which amount to 1000 - 2000 objects, but i need to do more accurate benchmarking.

Knowing that if the GC is only used for the script, it will probably be faster because the whole tree of reference to trace is already known, so that would skip the whole tracing of stack and data sections, even if that shouldn't make a huge difference as they are not very big anyway.

@NodixBlockchain
Copy link
Author

Examples of deterministic allocation that would overload the automated non-deterministic GC, are >assignment of a new instance either exclusively to a block scoped pointer (aka pointer stored locally on >stack) in a loop, or as in @keean’s image buffering example allocating a new 16MB buffer each time his >code needs one:

p = new Object
So in the first case, the young generation allocator is slammed with an new allocation for each iteration of >the loop, and in the latter case the old generation allocator accumulates 16 MB allocations which have to >be collected (and with a MS collector that is going to be very costly).

the pb with these cases is that most of the time in modern languages, you would use an asynchronous loader to do this who will do the allocation itself.

Maybe a smart management can allow to rotate the two buffers or so, but generally now day you wouldnt do big allocation of objects in a synch loop like that.

@keean
Copy link
Owner

keean commented Nov 25, 2017

Maybe a smart management can allow to rotate the two buffers or so, but generally now day you wouldnt do big allocation of objects in a synch loop like that.

I think you misunderstand, the objects are actually much bigger than 16MB, we are processing them in 16MB chunks to make them smaller. If you try and load the whole object at once it might be 1-2 GB in size. This very quickly results in memory fragmentation which will cause the allocations to fail and the application to crash. Splitting it into 16MB chunks and loading them all at the same time in parallel does not help the situation. The problem is to use as little memory as possible, so the only solution is to load each 16MB chunk sequentially and process it then store it back. Of course you do this in the background, not in the UI thread.

In terms of programming, you want to do a fold over the promises, not use Promise.all

@NodixBlockchain
Copy link
Author

Yeah oki, i guess with a tracing GC it could end up keeping all the chunks in memory for a while if a new buffer is allocated at each iteration and only one is used at once.

I think for most cases in buffered/chunked io, manual memory management is generally quite easy to handle and can hardly have down side vs any kind of automatic management.

Anyway i think the tracing GC start to works well, i will think about several kind of test to see how it perform against RC in various situations.

In any case that will be a nice feature to have :)

@shelby3
Copy link

shelby3 commented Nov 25, 2017

@NodixBlockchain wrote:

It's why object nursery give significant performance increase. In the general case those tracing GC tend to spend a lot of their work on no brainer case that wouldn't be here in stack efficient language like C or C++ with 'simple value on stack'.

Note I wrote:

possibly to gain the efficiency of fast generational collection of the young objects which escape deterministic lifetimes (if any significant quantity still do)

But bear in mind that the 2003 BG-RC collector I cited found it worthwhile to retain mark-sweep for the highly mutated of the younger generation which escape a simplistic block (stack) scope and only use reference counting (RC) for the infrequently mutated older generation. But you might be correct that the generational (“BG”) aspect may become unnecessary with good static analysis of deterministically deallocated young objects.


@NodixBlockchain wrote:

At first glance it seems somehow faster, even if i don't really get why, as on one side, the only removed code is the fetch_add for ref count, and the other side, there is the stack/data section tracing of live objects, which amount to 1000 - 2000 objects, but i need to do more accurate benchmarking.

Mark-sweep is more efficient than reference counting except for those objects whose reference counts are rarely mutated. The downside is that mark-sweep has unbounded (asymptotic) pause times. That 2003 BG-RC research attempted to balance the tradeoffs and hit the sweet spot for the use of RC and MS.


@NodixBlockchain wrote:

The only way i see to make some mark/sweep GC efficient for the general case is if there is a good way to organize the code in sort that it can be easy for the GC to know which subset of memory to scan in relation to what code path has been taken.

The likely problem is designing such a static analyzer that is sufficiently generalized and not create a NP hard problem.

@keean
Copy link
Owner

keean commented Nov 25, 2017

Ada has statically safe pointers (called access types) without all the complex lifetimes of Rust. This is the kind of model I was thinking about.

@NodixBlockchain
Copy link
Author

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Nov 25, 2017

The thing is that I have some requirement for what I want to do that make it doesnt exactly fit with general strategies.

Already it's made to be distributed, so virtually all objects must be convertible to json, and retain keyname of all members. Which is not great for cache and stuff, but data can already be stored in binary packed form in the node's data, and type signature could be stored else where for non distributed objects. It's typically the kind of stuff that should be optimized at compile time to reduce the runtime definition to minimum used.

It need to be multithread, and as lock less as possible. Like dynamic list that can be manipulated by a thread and iterated over by other threads without lock.

For the moment I think it works with per/thread tracing without having to pause all threads, i added a system of shared slot that threads can use to iterate over a list that can be potentially resized/deleted by another thread, in sort that the iterator use a slot in this shared list that is going to be scanned by the tracer in the allocator thread.

I still need to test well the case where several threads can resize/deleted the list while other iterate on it to make sure there is no concurrency bugs.

Atomic rc still make this easier as its easy to "pin" an object in memory by increasing the RC, specially with the system of outer counter, there can be active reference to an object without a pointer or if its not really deferenced.

With the tracing GC need normally to scan all threads stack to find active references , but i want to avoid to pause all threads and having system specific code.

For the moment it seem to works with the shared slot for iterator in sort that list that are being iterated by other threads dont get collected before the threads are done with it. It should pick the new version of the list at each start of iteration, but in case the list it's iterating get freed by another thread while its doing the iteration, it needs to add it in a shared slot in the allocator thread with a cas.

Other than this it seems to work, couldn't find good bench mark for C, but there are many for java, i might end up copying the algorithm to C to make tests.

Binary tree manipulation seem to be good tests.

Normally i think it should be able to avoid to fall into the GC moto of "most object die young", first because I made a system with two heaps, one for regular allocation, i/o string & such for temporary buffers, which is supposed to be manually managed, and because it should be easier to avoid the "heaps objects everywhere" that seem to cripple high level language performance. Generally when a node is allocated in the tree it's supposed to stay at least a while. Generally at least time to be sent over the network in case of messages. Or another system need to be thought to allocate temporary object with fixed short life time like stack to avoid clogging tracing path with no brainer cases.

@NodixBlockchain
Copy link
Author

The likely problem is designing such a static analyzer that is sufficiently generalized and not create a NP >hard problem.

In the simplest form, it could be like Android ContentProvider class which hold collections of objects, it should be easy to skip the tracing of those objects if no mutating method of the class is called.

In the more complex form, it should take advantage of mvc architecture, where a controller is bound statically to certain models, who are seen as objects store, and skipping tracing of objects inside of object stores unless a controller use a mutating method of the model.

The main issue is that its not happening directly at the language level, unless the language enforce a certain semantic or global structure to manipulate object stores.

Or if the structure of the application can make it easy for the compiler to generate local allocators used by different part of the program that only needs to be collected when those part of the program are executed.

@shelby3
Copy link

shelby3 commented Nov 26, 2017

@NodixBlockchain wrote:

It need to be multithread, and as lock less as possible.

I’m going to repeat what I explained you to multiple times in the Concurrency thread. Multithreaded contention is a can-of-worms in numerous facets such as requires a deterministic total ordering on borrowing (because non-deterministic locking will always have corner case live lock bugs which are impossible to statically analyse because you can’t enumerate a priori the unbounded non-determinism in the I/O environment!) and even makes the GC less robust and slower.

Instead I’m aiming for a separate heap for each thread (a la JavaScript) which simplifies matters, and copying objects between threads. Or the alternative would be allowing the reference counted heap section (or a RC/MS hybrid such in the BG-RC design) be muiltithreaded, so objects could be shared by exclusive borrowing of references.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Nov 26, 2017

Lock have many problems. One of the main one for me is specially it can make refactoring much longer, because each time an array or anything has to be added or removed from threading, need to change all code that use it to add/remove locks.

Multithread contention is mostly a problem for non atomic operations, and i already have the solution for shared dynamic lists.

It already has one heap/thread, with lockless allocation. And most of the useful cases can already be shared between threads without problems. For the other case can use some new instance + swap to modify non atomic value. For lists i already have the thread safe solution which can't deadlock, and is lockless for read access, without having to copy it for each thread.

It already has GC where each thread collect it's own heap, and with the shared slot for iterators, it deals with all he cases i use for the moment. As the GC can already trace list, it only need to have explicit shared pointer on the list, and all the elements in it are marked. Which give already lockless (not waitfree) iterators on shared list without a copy / thread. It only needs to delay the collection of the list it uses.

Given that a thread can only use on shared iterator at a particular moment, there is a maximum wait of n threads CAS per shared iterator. Which is already much better than a copy. And even with copy, i would think it still need some kind of locks for copying it in case it's modified by another thread during the copy.

@keean
Copy link
Owner

keean commented Nov 26, 2017

@NodixBlockchain

"All objects die young" directly leads to heap fragmentation especially if you also have "everything is an object".

We can make things better by having value semantics, and storing values directly on the stack (providing they are under a certain size to prevent stack overflows), and by having object pools that re-use objects, and mutating object in-place.

@NodixBlockchain
Copy link
Author

"All objects die young" directly leads to heap fragmentation especially if you also have "everything is an >object".

We can make things better by having value semantics, and storing values directly on the stack >>(providing they are under a certain size to prevent stack overflows), and by having object pools that re->use objects, and mutating object in-place.

Yes something like this :)

The main issue for me in the today state of things, is that there is either low level language who are fast but 'unsafe', or either high level language that are great for small programs, but they can become a trap too because they will scale very poorly, and hit a performance wall at some point. And when this wall is hit, there is not much that can be done about it.

(And paradoxically, i would believe most of the safest program in the world are still made in C or C++, and using android, it's not like crash in java app is that ecxeptional either.)

My general idea is to have something that can be high level for small program, but give opportunity introducing potentially more complexity like different memory (stack vs heap), or other way to get past the performance wall for program who need to scale beyond what high level language currently allow, while still keeping an high level synthax with GC etc for the case where the performance are sufficient.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Nov 27, 2017

Or the alternative would be allowing the reference counted heap section (or a PC/MS hybrid such in the >BG-PC design) be muiltithreaded, so objects could be shared by exclusive borrowing of references.

The problem i see with the borrowing synthax is as far as i know, it's made with the paradigm of manually spawning threads and passing the object to it manually.

That can work for embeded programming because thread creation is cheap, but on modern multi task OS, thread creation and management always introduce some latency,

And spawning more threads than core will never give much scaling benefits anyway.

I prefer much more the idea with thread pools and shared job lib, and when a request is to be executed in a thread, it's insert in the job list which is a lockless shared iterator, which mean any thread can insert a job into it easily, and the thread pool manage state of the object, with a 'done' state set when the job is done, which can be caught by the main thread to collect the job or do some continuation or other.

Like this there is zero overhead to parallelize a job, and the scaling benefits is always maximal, as the number of threads can always match the number of core available, and there is almost no overhead in adding a job in the pool with the shared iterator of job list.

With my system, job already can be made to call a specified method in a module, on any objects due to collection of dynamic type, so it's very easy to a synthaxe like

do_job(module,method,object);

that will be inserted in the job list, and all threads in the pool compete on the job list to acquire a job and execute it, and then set a done state when the method has been executed.

It would be not be too hard to have a 'done job list' queue to deal with continuation, still with zero copy.

And implicitly, the object members should not be mutated while a thread is executing on it, but in most case the object can be created with a reference on the stack that will disapear just after the request is put in the job list, in sort that the calling program don't keep a reference to an object used in the job list.

Like this there is no copy, no spawning/destroying of thread, and zero overhead for executing it in a thread. The threads are created once at the start of the application and will be running for the whole lifetime of the application, in sort that there is one thread/ core, to maximize the potential scaling benefits.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 2, 2017

So now i think the GC is 100% bug free, i started to do test on tracing GC vs RC with binary tree.

The test first fill up the balanced tree, then delete à portion of nodes, then run the mark & sweep and keep deleting node until the tree is empty.

I only tested in debug mode without optim so far.

As i expected, for small tree < 8000 nodes, GC is faster, but after 16000 nodes, rc start to become faster.

At 16000 nodes

RC take 1537 ms to fill the tree, and 8000ms to deleted it.

GC 1400 ms to fill the tree 11000ms to delete it by 128 increment.

I will do more test tomorow using c struct instead of the full DAG and with optim activated. As the DAG manipulation can add some over head and defavorize RC.

But im pretty sure RC will become faster and faster with bigger tree.

@keean
Copy link
Owner

keean commented Dec 2, 2017

@NodixBlockchain how does your RC deal with cycles? Does it leak them, or do you detect them and free them? Mark Sweep deals with cycles automatically, but RC needs to have cycle detection added, and that may affect the speed.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 2, 2017

For the moment no cycle détection with rc.

I have my idea for this which should work with tree with low depth.

If each time a new node is added in the tree, it recurse back to the root node, and if it see a ref to self, it increase a self ref count, and then instead of deleting when ref CNT = 0, it delete when ref = self ref.

I may try to add this, in case the depth of the tree is in the 3/4 level to the root, should not be too expansive. For long linked list it could sucks. But I dont use linked list for the moment.

Did more testing with optim active and 32000 nodes deleted by 128

RC : init the tree ~900 ms, delete the tree 800-900ms
MS: init the tree ~900ms, delete the tree ~1500-1600ms

On 32000 nodes using the DAG the GC is roughtly two time slower. On small tree the GC is much faster thought. On 1024 nodes it 10x faster, 8000 nodes 2 times faster.

I will try to put the code on git tomorrow, to check if there are no bugs.

@keean
Copy link
Owner

keean commented Dec 2, 2017

@NodixBlockchain checkout this quota question www.quora.com/How-do-reference-counting-and-garbage-collection-compare the top answer explains the tradeoffs, and appears to say that optimised MS beats optimised RC. Naive MS has a problem with total heap size, which seems to be the issue you are having with it.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 2, 2017

The total number of active objects is what make mark and sweep slower vs RC in case small number of objects are collected between two collections. Because MS still need to trace all actives objects where RC just free the ones that are deleted.

As my test run the GC every 128 deleted nodes, it simulate this case where lot of objects are in memory and few of them deleted occasionally. With big tree over 16000 nodes, RC become faster, as the GC needs to trace all actives objects every 128 nodes deleted, where the RC just need to free the 128 deleted nodes.

in my case where there is ~3000 active objects max at GC runtime, tracing GC should still be faster.

But in case of database/cache like things where lot of permanent objects needs to be stored in memory, tracing GC might still end up much slower.

Maybe with the strategy to only collect when the memory is full and cant allocate the next object that can offset the problem, but im not sure if like this strategy too much. Because it's probably going to mean higher pause time, and more risk of crash due to running out of memory.

Same with incremental GC it can have same pb if the program allocate more than the GC free per increment.

I will still do test using C struct instead of the automatic DAG to see if it make a difference.

@keean
Copy link
Owner

keean commented Dec 2, 2017

This is why they have generational MS. You have a nursery where objects are created, if they live for more than a certain amount of time, objects get moved to the next pool, keeping the number of objects in the nursery that need to be scanned much less. Often you have 4 pools, the nursery, short lived objects, long lived objects and immortal objects. The longer lived pools are scanned incrementally and less often.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 2, 2017

In fact scratch that, seem there was a bug in the algorithm.

Now with the good algorithm, tracing GC is always much faster even with 32000 nodes tree.

Now with the good algorithm with 32000 nodes

MS init tree 1200ms, delete ~1700ms
RC init tree 1200ms, delete ~9800ms

Lol

So in any case tracing GC = much faster !

Still need to do test with the C struct but apparently in any case counter intuitively it's slower to deal with reference counter & free "on spot" rather than tracing live object & delete per 128.

Using atomic rée count and memory fence have marginal effect ~400ms on the whole thing.

Now i think the algorithm is ok.

Put the source of the test there

https://github.com/NodixBlockchain/memtest/blob/master/memtest/main.c

The gist of tracing code is there

https://github.com/NodixBlockchain/memtest/blob/master/libcon/base/mem_base.c#L1447
+
https://github.com/NodixBlockchain/memtest/blob/master/libcon/tpo_mod.c#L1041
+
https://github.com/NodixBlockchain/memtest/blob/master/libcon/runtime.asm#L406

Will do the test with c struct tomorow.

@NodixBlockchain
Copy link
Author

I have put the test result there

Automatic Dag :

https://github.com/NodixBlockchain/memtest/tree/master/memtest

C structures :

https://github.com/NodixBlockchain/memtest/tree/master/memtest2

Conclusion Tracing GC much faster in all case.

for the tracing GC

C structure ~1200 ms
automatic DAG ~1600ms

@shelby3
Copy link

shelby3 commented Dec 5, 2017

Those recent 3 blogs by Eric S. Raymond indicate that the issue of GC (aka memory allocation mgmt) design choice is critical to the programming language which is going to replace C and C++ for large project scale systems programming. I think I might have found a huge design win, but I am still formulating and writing down my understanding of the issues.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 6, 2017 via email

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 6, 2017

Rust seem to be the kind of language where everything need to be thought of before writing anything. To know the good qualifier that need to be used for every case.

But to make bare metal code, it always needs to manipulate direct pointer and binary data, for memory mapped register or dma buffer to communicate with the hardware, which is different from allocated memory for storing high level objects.

And it's impossible to beat C at performance, especially for things that are computation intensive, where cache management and simd can be more critical, and given the decade of research into optimizing C code, it's impossible to beat it at performance.

Anything added to C will have performance impact.

But people have been doing multi threaded code for decade in C with zero security, resulting in still rather safe code, so I would think rust is a bit too much with declaration of 4/5 qualifier for all variables, and I wonder how good the compiler really is to detect real problem, vs case where there can be potential contention, but prevented by high level logic ( like two threads who will not be run at the same time due to high level logic), and if it's not too hard-core for refactoring. And of course also lack of compatibility with C and introducing lot of new logic and synthax.

All high level language retain a trapdoor to link with C code when they need to do high performance multi media or scientific computation.

For this i think my system is good trade off, as it can keep all the good side of C for high performance and bare metal, but also providing an higher level with memory safety, garbage collection, object convertibles to json, cross compiler binary compatibility, safe io to print variable and serialize/deserialize objects, and some convenience for lockless multithread algorithm for high level programming, but still interoperable with low level C when needed, either for bare metal, or the simple algorithm for io or other where bare C is not too problematic, but keeping the managed heaps separated from the heaps manipulated manually from C.

I would think that rust is trying to make the language having built in high level abtraction, but im not sure it's that much a good idea.

Generally application developers build their code on top of an sdk or framework who deal with the low level part, to specialize the application code for some specific cases, and leave the bare language as fundemental as possible.

It's same with c++, java, php and most languages. Application developers rarely use the bare language.

And it's there is wonder if it's not better to have a VM/compiler with already built in high level construction akin to BASIC where expression can be more semantic than formal, and the VM/compiler can generate the good low level code in the context to execute the expression.

Like for web server, having all the stack from kernel to sdk like codeigniter in low level code, only leaving the high level logic in high level language with the VM/compiler using built in expression to express high level logic, and leaving low level code that is close to the kernel up to high level application code in low level language because it's what low level language are efficient and relatively safe for.

Like this it avoid to have to write high level declaration in low level language and give too much head ache to the compiler to express high level logic with simple expression implying high level logic.

For me rust on this regard is either doing too much or too little. It's too complex to really do quick low level bare metal routine, but too simple to have something like high level sdk for app development.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 7, 2017

Cause it's the thing that always bugs me out when people are saying c/c++ is unsafe, but java/c#/vb/js etc are because all those language run on top of very large code base in c or c++.

All kernel, drivers, and large part of system is made with c/c++ with largely "unsafe" code. So if this code is no safe, nothing on top of it can be.

Rust is probably the only one who can hold the promise to have full stack built from the ground up with safe language.

But to run on most operating system, need to deal with kernel/drivers/sdk made in c/c++. This plus the need for high performance algorithm like video/audio/image decompression, physics etc, cant really avoid c/c++ entierely.

Or need a language that is low level enough to build a kernel with, in order to have the whole stack using safe memory language from the ground up.

Otherwise it will always be sitting between two chairs, and relying on unsafe code to run a safe VM, which is sort of logic contradiction.

Maybe im wrong on this, but i see rust easily following same fate than Delphi, as lack of good interfacing with c/c++, slower developpement of the runtime than the rest of the world, need to wait two year to have latest directX interface or access to latest interface for network, or system stuff, lack of many libs that cant compete with their equivalent in c/c++, and staying marginal, even if the language/IDE itself is not so bad, it's not the main problem.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 8, 2017

And the way i see, C is not even to be seen as a full language, but rather like an abstraction of cpu logic.

At CPU level, there is only pointers and registers. The CPU doesn't know about heap, allocators, object reference or threads.

So in a way it's not the job of C as a language to have memory management, heap and threads, because it doesn't know how to translate this to machine code.

C operate at the abstraction level of the CPU. so it doesn't have good reason to add more restrictions than what is allowed at CPU level.

To translate "pointer to register code" to machine code, C is still the best by far.

C compilers are very advanced, it can abstract floating point, SIMD, word size, do all the job of memory alignment, and there are C compilers for virtually any CPU out there that will have good level of optimization.

And if the goal is to have high level language compiled to assembler, there will always be a step where the high level code need to be translated to "pointer to register level", before to be turned to machine code.

So from my point of view, it's still the best option to have this abstraction layer of memory management, threads, i/o made in C.

It will be easily portable to any CPU architecture, with all the optimizations.

Then the high level language is expressed at the abstraction level that the C runtime implement

Either considering the C code as a runtime, like intrinsec functions, or more involved compilation process to turn high level language into something that run at this abstraction level by turning object access to runtime function call, or inlining the assembler code generated by C compiler directly in a way or another.

The other alternative to directly compile high level language to assembler fully is not to appealing because anyway it will never beat C at what it does, which is turning "pointer to register code" to machine code, it will need lot of work to tune it to all CPU architectures, and will never be as advanced as what C compilers can do well.

The runtime already provide safe abstraction layers for allocation and manipulation of object references, as well as text formatting, high precision timers, date, and all the kernel level things.

I'm probably part of those people who consider C more as 'portable assembler' rather than as high level language, and i don't consider the stdlib is part of the language.

Any abstraction level that need to be created on top of CPU will need to be translated to CPU machine code anyway, and this include memory management and threads (as well as files, timers etc). Having this abstraction layer made in C is not the worst option for all reason mentioned above.

I guess most high level language prefer the solution to compile to byte code, and having a byte code interpreter who translate this byte code in CPU machine code.

But in the principle it's more or less equivalent to have the language expressed in an abstraction layer provided by a runtime in C.

The runtime in C is the equivalent of byte code interpreter, except the code is easier to compile to executable or bare metal, and it make it easier to interface with the kernel, drivers, system SDK, and all high performance libraries in C.

Even transpiling it to languages like JS or so should not be too hard, as the C code already operate at this level of abstraction.

tree_set_value(var,value) => var = value.
tree_set_key_value(object,"key",value) => object.key= value.

The whole allocation and object manipulation is already abstracted by the runtime, and the code above this layer of abstraction is already close to JS level of abstraction.

@NodixBlockchain
Copy link
Author

Virtual memory is practically cost-free on 64-bit address space (but intra-page fragmentation is not). I >think I mentioned upthread I’m not prioritizing design for 32-bit address space. Some trade-offs have >to be made in design and I think that’s an acceptable one since there will still be Rust for specialty and >archaic use cases.

This can be a tempting approach, i believe bitmonero follow this philosophy, as when it's run for a few days on the server, it ends up using 18gb+ of swap memory, which doesn't seem to disturb it too much, as the unused pages just ends up sleeping on the swap.

But for running it as a service, need to take in account that there will be potentially other services running on the server, and if all services follow this policy, it quickly become unmangeable. If all services can have memory spike at the same, it will end up exhausting all physical memory, and make all the server slow, which is generally to be avoided.

Both database services and php have memory limit to avoid this sort of situation. I would think most admins will prefer to adjust the max memory used by each services manually for it to always fit to available physical memory, rather than letting all process allocate as much as they want.

Knowing the kernel will always make in sort to have a new page for the process, and will never run out of virtual memory, it will not really lead to a crash, so it can always be an option, but if the process has to live with other process who needs to share the available physical memory for their operation, that can still ends up killing server performance.

@shelby3
Copy link

shelby3 commented Dec 17, 2017

Indeed Rust's static safety analysis (aka lifetimes + exclusive mutability borrowing of pointers) can't model all of the cases of optimizations the programmer can reason about in his head. That's why I have proposed combining the generational portion of a runtime GC with statically typed reference counting for the long-lived objects. So that we can get the best low-level optimizations of C with the benefits of automatic, highly optimized allocation/deallocation.

Next I plan to marry a C-like syntax with a HLL. So we have the best of all worlds wrapped in one mainstream programming language.

@NodixBlockchain you're correct that Rust's static safety checking limits the programmer and is a lot of tsuris. I have also proposed a single-threaded with thread pool model for achieving mutability safety without needing Rust's inflexible, highly annoying exclusive mutable borrowing paradigm.

So what I have proposed is to paradigm-shift the issues so we get all the safety we want without the tsuris and impacts on flexibility for maximizing performance and coherently expressing algorithms in code without a lot of superfluous noisy annotations.

I think you're incorrect to think that you can just build a framework on top of C as the optimum solution. For example, afaics the GC scheme I proposed can't work as a framework on top of C. Also a compiler can enable more elegant syntax sugar.


This can be a tempting approach, i believe bitmonero follow this philosophy, as when it's run for a few days on the server, it ends up using 18gb+ of swap memory, which doesn't seem to disturb it too much, as the unused pages just ends up sleeping on the swap.

It wasn't my intention to say that we should have memory leaks, because of course eventually with memory leaks even 64-bit address space (or more likely the harddisk space) could potentially be exhausted.

I was just pointing out that @keean's pathological use case was likely much more constrained on 32-bit virtual address space. Meaning that I don't want to strive for absolute timeliness of the deallocation of virtual address space, but I'm not advocating memory leaks.

@shelby3 wrote in private to @keean:

I think this pathological example is because you're allocating the same memory over and over instead of reusing buffers?

I guess the large array buffers are entirely skipping the nursery and going directly to concurrent mark-sweep? The concurrent mark-sweep probably can't keep pace with the rate of allocation and deallocation that is bypassing the generational GC. Mark-sweep has very bad asymptotic performance both on total heap size and as allocated size approached even 50% of the physical memory size.

Sounds to me like your use case would work very well with RC? So my proposal should be fine.

@NodixBlockchain
Copy link
Author

NodixBlockchain commented Dec 17, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@shelby3 @keean @NodixBlockchain and others