-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
request: a server for Gzemnid #1540
Comments
LGTM. Someone from @nodejs/build-infra needs to allocate the resources, but then we could add it to the test CI cluster as configure a job for it.. |
We will need to assess where/which hosters we have additional capacity that we can use. @rvagg @jbergstroem |
Aside: @rvagg is around a bit and @jbergstroem not |
@Trott it's back to the same discussion we've had about expanding overall access, although a slightly different dimension as we could break up access based on hoster versus some other criteria. @rvagg had the action to try to suggest a breakdown where we could give out privileges in a more granular way but I doubt a breakdown based on hoster would have been the first suggestion. On the Softlayer side I can chime in that I think we are relatively "full". For the other's I probably could figure it out but its a matter of time so it's easier to ask @rvagg. Aside: Aside, I'm not sure an Aside in an issue on a different topic is the best way to push for change on this front. A dedicated issue somewhere (build or otherwise) makes more sense to me. |
#1337 (just dropping it here to keep someone else from opening a new issue unless they feel that one doesn't quite cover the aside issue--I'd say it does in a broad sense, but won't argue if someone thinks another issue is the way to go) Now back to gzemnid on its own server.... |
@ChALkeR in terms of a mirror what do you need. The package files on disk or something that serves up packages like npm does? |
Is the idea here to run both Gzemnid and a npm registry mirror on the same machine? If so, we need to re-dimension disk a bit more. |
@jbergstroem I think that would be the case. What size of disk would we need? |
@bengl: do you have any recent numbers of npm repo mirror size? any growth predictions? |
@mhdawson I meant a mirror that includes (at least) the latest versions of packages, in @jbergstroem It does not have to be physically on the same machine. A network disk would be fine enough. If that's not possible, even http access to a fast mirror could speed up things, but that is less optimal. I do not need a full documents mirror, I plan to use replicate.npmjs.com directly. There is no reason to setup a mirror (on the same machine or not) just for Gzemnid — I do not need a mirror, the only reason why I asked about if we already have it is for saving up the disk storage space. If we don't have such a mirror that stores packages, I can store them locally (see «Slow storage» space requirements). |
Should I provide some additional information? I can answer more questions =). |
@ChALkeR IIUC we just need to find where we can allocate this (seems like we are close to our limits of the donated quotas) |
OK, so I'm getting back into sync and tuning in to this now. This is a pretty hefty resource which makes it tricky to find allocation. Joyent and DigitalOcean are probably the two obvious options for now. Since we have a pretty good relationship with DO and have freed up some resources recently there I've gone ahead and put a server there. I've allocated a big disk too so we can put a mirror of the npm data on it too, that'll be handy for other purposes I suspect. It's in Europe, so connectivity should be fast for you @ChALkeR and your SSH key is in there for root.
Primary disk is 160G and extra attached 1TB is at /mnt/volume_fra1_01/. Ping me if you need any help with any of this, and keeps us updated with what's going on on this machine so we don't forget about it. |
@rvagg Thanks! I'm installing things now, and moving the existing datasets there. |
I moved the existing datsets to the new server, pointed gzemnid.oserv.org (my domain, as I don't have a better one) to it, and turned off the old server. The existing datasets are now available at https://gzemnid.oserv.org/datasets/ and are served by the new server. Packages and other things needed to rebuild the dataset are still being downloaded, that'll finish in a day or so. |
I've added a DNS record for nodejs.org and it's funneling through Cloudflare to get our HTTPS cert. https://gzemnid.nodejs.org/datasets/ @ChALkeR perhaps at some point you could add something to the root to describe how to use this? Also, a http -> https redirect would be good too. |
@rvagg Done and done. The packages for rebuilding are still being downloaded. |
Status update: I fetched the required packages and moved over the historic datasets to the volume. Other things would be stored directly at the main partition, and they are not ready yet. I will keep everyone updated. In the meantime, I am working on the registry follower that would (later) keep the dataset updated automatically. |
@ChALkeR how hard will it be to document the setup in an ansible script (example). It might also be good to consider having the "pull new dataset" script as ansible. But since it's a singleton machine, it's not high priority. Could also be just-a-bunch-of-shell-scripts. P.S. taking off the agenda since it's unstalled. |
Status update:
@refack I think it's too early for that, I expect significant changes there once the Web API/UI would be ready, and I don't want to redo the work =). The setup is farily simple and shouldn't be needed again in short term, so I am going to postpone that. |
I just had a thought - Recognize an opt-out option |
@ChALkeR I know this is about the server but is there any chance that we soon get server-side searches? Or is there any chance to get ssh access? |
@BridgeAR I added your ssh key to the server.
Ask me if you have any questions =). |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is remove or a comment is made. |
@ChALkeR is this still active? |
Line 15 in f682379
We eventually did get a server into our infra for this so this issue could be closed. I am wondering though if it's even used and whether we shouldn't be decomissioning it? |
Gzemnid is currently hosted on my personal VPS, which is limited in resources and unoptimal — and that's what currently limits improving it further. Also, virtually no one uses it from there (even though I tried setting up access once).
Now, with lz4 compression fully supported (since recently), I can outline the required specs for the server to move that to a foundation-owned server.
Performing a move would both remove the need in me hosting that and provide the room for future improvements (e.g. translate that to a close to real-time follower with rebuilds and a proper web interface for contributors once I or someone else would have time to implement that).
CPU
A 4-core CPU is preferred if we would do both the UI and the scheduled rebuilds on the server.
No hard speed requirements, faster CPU would mean faster rebuilds and queues.
Storage uses compression, and a significant part of operations are CPU-bound.
Memory
Something around 4 GiB should be fine, more could be used for caches.
Fast storage (SSD)
Around 120 GiB should be fine here, but that could grow.
Faster is better here.
Total: ~80 GiB.
As the ecosystem grows exponentially, I expect an under ~1.5 times growth in a year or so, so we should probably plan storage resources according to that.
Slow storage (possibly network)
Could be merged with «fast storage», but also could be stored separately, even on a remote network drive.
50-100 Mbps would be enough.
Total: ~330 GiB.
This could be significantly relaxed by removing «current» in case if we can utilize some sort of a pre-existing npm mirror (do we have one or an access to one) — then we won't need to store those 285 GiB.
The most prominent usecase for the npm package cache is to give the ability to perform full rebuilds in case when/if the extraction logic changes and if we'll need to regenerate the whole dataset, or when we would want to extract some new filetypes / whatever.
If we have a fast access (i.e. ~50-100 Mbps) access to an npm package cache that would have most of the current versions of all npm packages, and that we could use to just re-pull the packages on the fly in 5-10 minutes — then we won't need local cache.
«Historic» could be thrown out, but I currently store that on my vps and I find that data useful in some circumstances.
As the ecosystem grows exponentially, I expect an under ~1.5 times growth for «current» in a year or so, so we should probably plan storage resources according to that.
«Historic» would increase in size by about 12-15 GiB for each snapshot, which I suppose would make sense to do once about 3 months.
OS
Any recent Linux distro. I personally am more used to Debian-based, but that does not matter much.
/cc @nodejs/build @Trott @mhdawson
The text was updated successfully, but these errors were encountered: