Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

request: a server for Gzemnid #1540

Closed
ChALkeR opened this issue Oct 19, 2018 · 26 comments
Closed

request: a server for Gzemnid #1540

ChALkeR opened this issue Oct 19, 2018 · 26 comments

Comments

@ChALkeR
Copy link
Member

ChALkeR commented Oct 19, 2018

Gzemnid is currently hosted on my personal VPS, which is limited in resources and unoptimal — and that's what currently limits improving it further. Also, virtually no one uses it from there (even though I tried setting up access once).

Now, with lz4 compression fully supported (since recently), I can outline the required specs for the server to move that to a foundation-owned server.

Performing a move would both remove the need in me hosting that and provide the room for future improvements (e.g. translate that to a close to real-time follower with rebuilds and a proper web interface for contributors once I or someone else would have time to implement that).

CPU

A 4-core CPU is preferred if we would do both the UI and the scheduled rebuilds on the server.

No hard speed requirements, faster CPU would mean faster rebuilds and queues.
Storage uses compression, and a significant part of operations are CPU-bound.

Memory

Something around 4 GiB should be fine, more could be used for caches.

Rebuild consumes around 2 GiB for a single task, and we still want the UI. I am not sure if more would be needed for complex tasks, though, but I assume that could be solved later, if needed.

Fast storage (SSD)

Around 120 GiB should be fine here, but that could grow.
Faster is better here.

Fast storage is the data that is constantly needed for queues and rebuilds.
Title Purpose Size
slim.code code search 12 GiB
slim.topcode partial code search 1.2 GiB
deps dependency stuff 0.3 GiB
files file listings 0.2 GiB
byField [system] packages listing from npm 0.2 GiB
meta [system] meta info about packages 4 GiB
partials [system] slim.code partials 60 GiB
OS operating system 2 GiB

Total: ~80 GiB.

As the ecosystem grows exponentially, I expect an under ~1.5 times growth in a year or so, so we should probably plan storage resources according to that.

Slow storage (possibly network)

  • ~530 GiB — if we don't have fast access to an npm mirror, or
  • ~100 GiB — if we have fast access to an npm mirror.

Could be merged with «fast storage», but also could be stored separately, even on a remote network drive.
50-100 Mbps would be enough.

Title Purpose Size
current current versions of npm packages 285 GiB
historic historic versions of slim.code/deps — to check how the ecosystem evolves 45 GiB

Total: ~330 GiB.

This could be significantly relaxed by removing «current» in case if we can utilize some sort of a pre-existing npm mirror (do we have one or an access to one) — then we won't need to store those 285 GiB.
The most prominent usecase for the npm package cache is to give the ability to perform full rebuilds in case when/if the extraction logic changes and if we'll need to regenerate the whole dataset, or when we would want to extract some new filetypes / whatever.

If we have a fast access (i.e. ~50-100 Mbps) access to an npm package cache that would have most of the current versions of all npm packages, and that we could use to just re-pull the packages on the fly in 5-10 minutes — then we won't need local cache.

«Historic» could be thrown out, but I currently store that on my vps and I find that data useful in some circumstances.

As the ecosystem grows exponentially, I expect an under ~1.5 times growth for «current» in a year or so, so we should probably plan storage resources according to that.
«Historic» would increase in size by about 12-15 GiB for each snapshot, which I suppose would make sense to do once about 3 months.

OS

Any recent Linux distro. I personally am more used to Debian-based, but that does not matter much.


/cc @nodejs/build @Trott @mhdawson

@ChALkeR ChALkeR changed the title A server for Gzemnid request: a server for Gzemnid Oct 19, 2018
@refack
Copy link
Contributor

refack commented Oct 19, 2018

LGTM. Someone from @nodejs/build-infra needs to allocate the resources, but then we could add it to the test CI cluster as configure a job for it..

@mhdawson
Copy link
Member

We will need to assess where/which hosters we have additional capacity that we can use. @rvagg @jbergstroem

@Trott
Copy link
Member

Trott commented Oct 23, 2018

We will need to assess where/which hosters we have additional capacity that we can use. @rvagg @jbergstroem

Aside: @rvagg is around a bit and @jbergstroem not at all much. How do we get from where we are to where it's possible for new people to move into that area? Right now, it seems like there's a few old timers who have the maximum privileges and there's no defined path for anyone else to be able to take that stuff on. It seems like sort of a modified BDFL setup and I don't think that's what we want.

@mhdawson
Copy link
Member

@Trott it's back to the same discussion we've had about expanding overall access, although a slightly different dimension as we could break up access based on hoster versus some other criteria. @rvagg had the action to try to suggest a breakdown where we could give out privileges in a more granular way but I doubt a breakdown based on hoster would have been the first suggestion.

On the Softlayer side I can chime in that I think we are relatively "full". For the other's I probably could figure it out but its a matter of time so it's easier to ask @rvagg.

Aside: Aside, I'm not sure an Aside in an issue on a different topic is the best way to push for change on this front. A dedicated issue somewhere (build or otherwise) makes more sense to me.

@Trott
Copy link
Member

Trott commented Oct 23, 2018

Aside: Aside, I'm not sure an Aside in an issue on a different topic is the best way to push for change on this front. A dedicated issue somewhere (build or otherwise) makes more sense to me.

#1337 (just dropping it here to keep someone else from opening a new issue unless they feel that one doesn't quite cover the aside issue--I'd say it does in a broad sense, but won't argue if someone thinks another issue is the way to go)

Now back to gzemnid on its own server....

@mhdawson
Copy link
Member

@ChALkeR in terms of a mirror what do you need. The package files on disk or something that serves up packages like npm does?

@jbergstroem
Copy link
Member

Is the idea here to run both Gzemnid and a npm registry mirror on the same machine? If so, we need to re-dimension disk a bit more.

@mhdawson
Copy link
Member

@jbergstroem I think that would be the case. What size of disk would we need?

@nodejs nodejs deleted a comment from jbergstroem Oct 23, 2018
@jbergstroem
Copy link
Member

@bengl: do you have any recent numbers of npm repo mirror size? any growth predictions?

@ChALkeR
Copy link
Member Author

ChALkeR commented Oct 25, 2018

@mhdawson I meant a mirror that includes (at least) the latest versions of packages, in tgz form, i.e. files like https://registry.npmjs.org/express/-/express-4.16.4.tgz, preferrably on a mountable network share. That is not required, though (see below).

@jbergstroem It does not have to be physically on the same machine. A network disk would be fine enough. If that's not possible, even http access to a fast mirror could speed up things, but that is less optimal.

I do not need a full documents mirror, I plan to use replicate.npmjs.com directly.

There is no reason to setup a mirror (on the same machine or not) just for Gzemnid — I do not need a mirror, the only reason why I asked about if we already have it is for saving up the disk storage space. If we don't have such a mirror that stores packages, I can store them locally (see «Slow storage» space requirements).
Perhaps that would be easier?

@ChALkeR
Copy link
Member Author

ChALkeR commented Nov 5, 2018

Should I provide some additional information? I can answer more questions =).

@refack
Copy link
Contributor

refack commented Nov 5, 2018

@ChALkeR IIUC we just need to find where we can allocate this (seems like we are close to our limits of the donated quotas)
I'm marking as wg-agenda.

@rvagg
Copy link
Member

rvagg commented Nov 6, 2018

OK, so I'm getting back into sync and tuning in to this now. This is a pretty hefty resource which makes it tricky to find allocation. Joyent and DigitalOcean are probably the two obvious options for now. Since we have a pretty good relationship with DO and have freed up some resources recently there I've gone ahead and put a server there. I've allocated a big disk too so we can put a mirror of the npm data on it too, that'll be handy for other purposes I suspect.

It's in Europe, so connectivity should be fast for you @ChALkeR and your SSH key is in there for root.

Host infra-digitalocean-ubuntu1804-x64-1 gzemnid
  HostName 178.128.202.158
  User root

Primary disk is 160G and extra attached 1TB is at /mnt/volume_fra1_01/. Ping me if you need any help with any of this, and keeps us updated with what's going on on this machine so we don't forget about it.

@ChALkeR
Copy link
Member Author

ChALkeR commented Nov 7, 2018

@rvagg Thanks! I'm installing things now, and moving the existing datasets there.

@ChALkeR
Copy link
Member Author

ChALkeR commented Nov 7, 2018

I moved the existing datsets to the new server, pointed gzemnid.oserv.org (my domain, as I don't have a better one) to it, and turned off the old server.

The existing datasets are now available at https://gzemnid.oserv.org/datasets/ and are served by the new server.

Packages and other things needed to rebuild the dataset are still being downloaded, that'll finish in a day or so.

@rvagg
Copy link
Member

rvagg commented Nov 7, 2018

I've added a DNS record for nodejs.org and it's funneling through Cloudflare to get our HTTPS cert. https://gzemnid.nodejs.org/datasets/

@ChALkeR perhaps at some point you could add something to the root to describe how to use this? Also, a http -> https redirect would be good too.

@ChALkeR
Copy link
Member Author

ChALkeR commented Nov 8, 2018

@rvagg Done and done.
I added the redirection and some minimal info for now, the next step would be to make a web UI for server-side searches.

The packages for rebuilding are still being downloaded.
I placed them in the /mnt/volume_fra1_01/gzemnid/pool/current/ dir.
I will also move the historic datasets to the volume, and I plan to place everything else on the main disk.

@ChALkeR
Copy link
Member Author

ChALkeR commented Nov 9, 2018

Status update: I fetched the required packages and moved over the historic datasets to the volume.
Current volume usage: 289G (packages) + 58G (datasets) = 347G (35%).
That might increase once I add support for the scoped packages (hopefully soon), but not drastically.

Other things would be stored directly at the main partition, and they are not ready yet. I will keep everyone updated.

In the meantime, I am working on the registry follower that would (later) keep the dataset updated automatically.

@refack
Copy link
Contributor

refack commented Nov 11, 2018

@ChALkeR how hard will it be to document the setup in an ansible script (example). It might also be good to consider having the "pull new dataset" script as ansible.

But since it's a singleton machine, it's not high priority. Could also be just-a-bunch-of-shell-scripts.

P.S. taking off the agenda since it's unstalled.

@ChALkeR
Copy link
Member Author

ChALkeR commented Nov 13, 2018

Status update:

  1. rebuilt dataset for 2018-11-12 — first one on the new server. Available at https://gzemnid.nodejs.org/datasets/
  2. stumbled upon Rare decompression issue in 1.8.2 lz4/lz4#560 (Data corruption with highCompression encoding (testcase attached) pierrec/node-lz4#69), but I just manually excluded/removed the package that gets broken for now. That's react-cryptocharts, with only 345 downloads/month.
  3. Sync using changes-stream (follower needed for realtime rebuiltds and scoped pkgs) is partially done — it now gets and stores the required data, but that data is not used yet.

@refack I think it's too early for that, I expect significant changes there once the Web API/UI would be ready, and I don't want to redo the work =). The setup is farily simple and shouldn't be needed again in short term, so I am going to postpone that.

@refack
Copy link
Contributor

refack commented Nov 16, 2018

I just had a thought - Recognize an opt-out option { gzemnid: false } or { code_index: false } in package.json.
Everything in npm is OSS, but IMO it's a nice courtesy (similar to robots.txt)

@BridgeAR
Copy link
Member

BridgeAR commented Apr 4, 2019

@ChALkeR I know this is about the server but is there any chance that we soon get server-side searches? Or is there any chance to get ssh access?

@ChALkeR
Copy link
Member Author

ChALkeR commented Apr 4, 2019

@BridgeAR I added your ssh key to the server.
ssh [email protected].

cd ~/public/datasets/out.2019-01-18
./search.topcode.sh "regex" > ~/BridgeAR/whatever1
./search.code.sh "regex" > ~/BridgeAR/whatever2

Ask me if you have any questions =).

@github-actions
Copy link

github-actions bot commented Mar 7, 2020

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is remove or a comment is made.

@github-actions github-actions bot added the stale label Mar 7, 2020
@mhdawson
Copy link
Member

mhdawson commented Mar 9, 2020

@ChALkeR is this still active?

@rvagg
Copy link
Member

rvagg commented Mar 9, 2020

ubuntu1804-x64-1: {ip: 178.128.202.158, alias: gzemnid}

We eventually did get a server into our infra for this so this issue could be closed. I am wondering though if it's even used and whether we shouldn't be decomissioning it?

@github-actions github-actions bot closed this as completed Apr 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants