request: a server for Gzemnid #1540

ChALkeR · 2018-10-19T19:56:38Z

Gzemnid is currently hosted on my personal VPS, which is limited in resources and unoptimal — and that's what currently limits improving it further. Also, virtually no one uses it from there (even though I tried setting up access once).

Now, with lz4 compression fully supported (since recently), I can outline the required specs for the server to move that to a foundation-owned server.

Performing a move would both remove the need in me hosting that and provide the room for future improvements (e.g. translate that to a close to real-time follower with rebuilds and a proper web interface for contributors once I or someone else would have time to implement that).

CPU

A 4-core CPU is preferred if we would do both the UI and the scheduled rebuilds on the server.

No hard speed requirements, faster CPU would mean faster rebuilds and queues.
Storage uses compression, and a significant part of operations are CPU-bound.

Memory

Something around 4 GiB should be fine, more could be used for caches.

Rebuild consumes around 2 GiB for a single task, and we still want the UI. I am not sure if more would be needed for complex tasks, though, but I assume that could be solved later, if needed.

Fast storage (SSD)

Around 120 GiB should be fine here, but that could grow.
Faster is better here.

Fast storage is the data that is constantly needed for queues and rebuilds.

Title	Purpose	Size
slim.code	code search	12 GiB
slim.topcode	partial code search	1.2 GiB
deps	dependency stuff	0.3 GiB
files	file listings	0.2 GiB
byField	[system] packages listing from npm	0.2 GiB
meta	[system] meta info about packages	4 GiB
partials	[system] slim.code partials	60 GiB
OS	operating system	2 GiB

Total: ~80 GiB.

As the ecosystem grows exponentially, I expect an under ~1.5 times growth in a year or so, so we should probably plan storage resources according to that.

Slow storage (possibly network)

~530 GiB — if we don't have fast access to an npm mirror, or
~100 GiB — if we have fast access to an npm mirror.

Could be merged with «fast storage», but also could be stored separately, even on a remote network drive.
50-100 Mbps would be enough.

Title	Purpose	Size
current	current versions of npm packages	285 GiB
historic	historic versions of slim.code/deps — to check how the ecosystem evolves	45 GiB

Total: ~330 GiB.

This could be significantly relaxed by removing «current» in case if we can utilize some sort of a pre-existing npm mirror (do we have one or an access to one) — then we won't need to store those 285 GiB.
The most prominent usecase for the npm package cache is to give the ability to perform full rebuilds in case when/if the extraction logic changes and if we'll need to regenerate the whole dataset, or when we would want to extract some new filetypes / whatever.

If we have a fast access (i.e. ~50-100 Mbps) access to an npm package cache that would have most of the current versions of all npm packages, and that we could use to just re-pull the packages on the fly in 5-10 minutes — then we won't need local cache.

«Historic» could be thrown out, but I currently store that on my vps and I find that data useful in some circumstances.

As the ecosystem grows exponentially, I expect an under ~1.5 times growth for «current» in a year or so, so we should probably plan storage resources according to that.
«Historic» would increase in size by about 12-15 GiB for each snapshot, which I suppose would make sense to do once about 3 months.

OS

Any recent Linux distro. I personally am more used to Debian-based, but that does not matter much.

/cc @nodejs/build @Trott @mhdawson

refack · 2018-10-19T21:34:41Z

LGTM. Someone from @nodejs/build-infra needs to allocate the resources, but then we could add it to the test CI cluster as configure a job for it..

mhdawson · 2018-10-23T18:30:13Z

We will need to assess where/which hosters we have additional capacity that we can use. @rvagg @jbergstroem

Trott · 2018-10-23T20:25:05Z

We will need to assess where/which hosters we have additional capacity that we can use. @rvagg @jbergstroem

Aside: @rvagg is around a bit and @jbergstroem not ~~at all~~ much. How do we get from where we are to where it's possible for new people to move into that area? Right now, it seems like there's a few old timers who have the maximum privileges and there's no defined path for anyone else to be able to take that stuff on. It seems like sort of a modified BDFL setup and I don't think that's what we want.

mhdawson · 2018-10-23T21:05:16Z

@Trott it's back to the same discussion we've had about expanding overall access, although a slightly different dimension as we could break up access based on hoster versus some other criteria. @rvagg had the action to try to suggest a breakdown where we could give out privileges in a more granular way but I doubt a breakdown based on hoster would have been the first suggestion.

On the Softlayer side I can chime in that I think we are relatively "full". For the other's I probably could figure it out but its a matter of time so it's easier to ask @rvagg.

Aside: Aside, I'm not sure an Aside in an issue on a different topic is the best way to push for change on this front. A dedicated issue somewhere (build or otherwise) makes more sense to me.

Trott · 2018-10-23T21:20:12Z

Aside: Aside, I'm not sure an Aside in an issue on a different topic is the best way to push for change on this front. A dedicated issue somewhere (build or otherwise) makes more sense to me.

#1337 (just dropping it here to keep someone else from opening a new issue unless they feel that one doesn't quite cover the aside issue--I'd say it does in a broad sense, but won't argue if someone thinks another issue is the way to go)

Now back to gzemnid on its own server....

mhdawson · 2018-10-23T21:21:57Z

@ChALkeR in terms of a mirror what do you need. The package files on disk or something that serves up packages like npm does?

jbergstroem · 2018-10-23T21:36:59Z

Is the idea here to run both Gzemnid and a npm registry mirror on the same machine? If so, we need to re-dimension disk a bit more.

mhdawson · 2018-10-23T21:37:43Z

@jbergstroem I think that would be the case. What size of disk would we need?

jbergstroem · 2018-10-23T21:39:39Z

@bengl: do you have any recent numbers of npm repo mirror size? any growth predictions?

ChALkeR · 2018-10-25T09:04:35Z

@mhdawson I meant a mirror that includes (at least) the latest versions of packages, in tgz form, i.e. files like https://registry.npmjs.org/express/-/express-4.16.4.tgz, preferrably on a mountable network share. That is not required, though (see below).

@jbergstroem It does not have to be physically on the same machine. A network disk would be fine enough. If that's not possible, even http access to a fast mirror could speed up things, but that is less optimal.

I do not need a full documents mirror, I plan to use replicate.npmjs.com directly.

There is no reason to setup a mirror (on the same machine or not) just for Gzemnid — I do not need a mirror, the only reason why I asked about if we already have it is for saving up the disk storage space. If we don't have such a mirror that stores packages, I can store them locally (see «Slow storage» space requirements).
Perhaps that would be easier?

ChALkeR · 2018-11-05T22:47:44Z

Should I provide some additional information? I can answer more questions =).

refack · 2018-11-05T23:26:14Z

@ChALkeR IIUC we just need to find where we can allocate this (seems like we are close to our limits of the donated quotas)
I'm marking as wg-agenda.

rvagg · 2018-11-06T00:26:33Z

OK, so I'm getting back into sync and tuning in to this now. This is a pretty hefty resource which makes it tricky to find allocation. Joyent and DigitalOcean are probably the two obvious options for now. Since we have a pretty good relationship with DO and have freed up some resources recently there I've gone ahead and put a server there. I've allocated a big disk too so we can put a mirror of the npm data on it too, that'll be handy for other purposes I suspect.

It's in Europe, so connectivity should be fast for you @ChALkeR and your SSH key is in there for root.

Host infra-digitalocean-ubuntu1804-x64-1 gzemnid
  HostName 178.128.202.158
  User root

Primary disk is 160G and extra attached 1TB is at /mnt/volume_fra1_01/. Ping me if you need any help with any of this, and keeps us updated with what's going on on this machine so we don't forget about it.

ChALkeR · 2018-11-07T12:28:15Z

@rvagg Thanks! I'm installing things now, and moving the existing datasets there.

ChALkeR · 2018-11-07T18:55:52Z

I moved the existing datsets to the new server, pointed gzemnid.oserv.org (my domain, as I don't have a better one) to it, and turned off the old server.

The existing datasets are now available at https://gzemnid.oserv.org/datasets/ and are served by the new server.

Packages and other things needed to rebuild the dataset are still being downloaded, that'll finish in a day or so.

rvagg · 2018-11-07T22:19:26Z

I've added a DNS record for nodejs.org and it's funneling through Cloudflare to get our HTTPS cert. https://gzemnid.nodejs.org/datasets/

@ChALkeR perhaps at some point you could add something to the root to describe how to use this? Also, a http -> https redirect would be good too.

ChALkeR · 2018-11-08T15:13:37Z

@rvagg Done and done.
I added the redirection and some minimal info for now, the next step would be to make a web UI for server-side searches.

The packages for rebuilding are still being downloaded.
I placed them in the /mnt/volume_fra1_01/gzemnid/pool/current/ dir.
I will also move the historic datasets to the volume, and I plan to place everything else on the main disk.

ChALkeR · 2018-11-09T21:13:37Z

Status update: I fetched the required packages and moved over the historic datasets to the volume.
Current volume usage: 289G (packages) + 58G (datasets) = 347G (35%).
That might increase once I add support for the scoped packages (hopefully soon), but not drastically.

Other things would be stored directly at the main partition, and they are not ready yet. I will keep everyone updated.

In the meantime, I am working on the registry follower that would (later) keep the dataset updated automatically.

refack · 2018-11-11T20:33:39Z

@ChALkeR how hard will it be to document the setup in an ansible script (example). It might also be good to consider having the "pull new dataset" script as ansible.

But since it's a singleton machine, it's not high priority. Could also be just-a-bunch-of-shell-scripts.

P.S. taking off the agenda since it's unstalled.

ChALkeR · 2018-11-13T15:41:25Z

Status update:

rebuilt dataset for 2018-11-12 — first one on the new server. Available at https://gzemnid.nodejs.org/datasets/
stumbled upon Rare decompression issue in 1.8.2 lz4/lz4#560 (Data corruption with highCompression encoding (testcase attached) pierrec/node-lz4#69), but I just manually excluded/removed the package that gets broken for now. That's react-cryptocharts, with only 345 downloads/month.
Sync using changes-stream (follower needed for realtime rebuiltds and scoped pkgs) is partially done — it now gets and stores the required data, but that data is not used yet.

@refack I think it's too early for that, I expect significant changes there once the Web API/UI would be ready, and I don't want to redo the work =). The setup is farily simple and shouldn't be needed again in short term, so I am going to postpone that.

refack · 2018-11-16T17:01:43Z

I just had a thought - Recognize an opt-out option { gzemnid: false } or { code_index: false } in package.json.
Everything in npm is OSS, but IMO it's a nice courtesy (similar to robots.txt)

BridgeAR · 2019-04-04T12:48:20Z

@ChALkeR I know this is about the server but is there any chance that we soon get server-side searches? Or is there any chance to get ssh access?

ChALkeR · 2019-04-04T13:43:53Z

@BridgeAR I added your ssh key to the server.
ssh [email protected].

cd ~/public/datasets/out.2019-01-18
./search.topcode.sh "regex" > ~/BridgeAR/whatever1
./search.code.sh "regex" > ~/BridgeAR/whatever2

Ask me if you have any questions =).

github-actions · 2020-03-07T00:11:20Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is remove or a comment is made.

mhdawson · 2020-03-09T14:50:43Z

@ChALkeR is this still active?

rvagg · 2020-03-09T23:29:09Z

build/ansible/inventory.yml

Line 15 in f682379

ubuntu1804-x64-1: {ip: 178.128.202.158, alias: gzemnid}

We eventually did get a server into our infra for this so this issue could be closed. I am wondering though if it's even used and whether we shouldn't be decomissioning it?

ChALkeR changed the title ~~A server for Gzemnid~~ request: a server for Gzemnid Oct 19, 2018

ChALkeR mentioned this issue Oct 19, 2018

Move to a Foundation-owned server nodejs/Gzemnid#25

Closed

refack added enhancement infra ci-public ci-job-request labels Oct 19, 2018

nodejs deleted a comment from jbergstroem Oct 23, 2018

refack added the build-agenda label Nov 5, 2018

rvagg mentioned this issue Nov 6, 2018

added infra-digitalocean-ubuntu1804-x64-1 gzemnid #1565

Merged

refack removed the build-agenda label Nov 11, 2018

refack mentioned this issue Nov 16, 2018

Remove util.inherits usage internally? nodejs/node#24395

Closed

github-actions bot added the stale label Mar 7, 2020

github-actions bot closed this as completed Apr 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

request: a server for Gzemnid #1540

request: a server for Gzemnid #1540

ChALkeR commented Oct 19, 2018 •

edited

Loading

refack commented Oct 19, 2018

mhdawson commented Oct 23, 2018

Trott commented Oct 23, 2018 •

edited

Loading

mhdawson commented Oct 23, 2018

Trott commented Oct 23, 2018

mhdawson commented Oct 23, 2018

jbergstroem commented Oct 23, 2018

mhdawson commented Oct 23, 2018

jbergstroem commented Oct 23, 2018

ChALkeR commented Oct 25, 2018 •

edited

Loading

ChALkeR commented Nov 5, 2018 •

edited

Loading

refack commented Nov 5, 2018

rvagg commented Nov 6, 2018

ChALkeR commented Nov 7, 2018

ChALkeR commented Nov 7, 2018 •

edited

Loading

rvagg commented Nov 7, 2018

ChALkeR commented Nov 8, 2018

ChALkeR commented Nov 9, 2018 •

edited

Loading

refack commented Nov 11, 2018 •

edited

Loading

ChALkeR commented Nov 13, 2018 •

edited

Loading

refack commented Nov 16, 2018

BridgeAR commented Apr 4, 2019

ChALkeR commented Apr 4, 2019 •

edited

Loading

github-actions bot commented Mar 7, 2020

mhdawson commented Mar 9, 2020

rvagg commented Mar 9, 2020

request: a server for Gzemnid #1540

request: a server for Gzemnid #1540

Comments

ChALkeR commented Oct 19, 2018 • edited Loading

CPU

Memory

Fast storage (SSD)

Slow storage (possibly network)

OS

refack commented Oct 19, 2018

mhdawson commented Oct 23, 2018

Trott commented Oct 23, 2018 • edited Loading

mhdawson commented Oct 23, 2018

Trott commented Oct 23, 2018

mhdawson commented Oct 23, 2018

jbergstroem commented Oct 23, 2018

mhdawson commented Oct 23, 2018

jbergstroem commented Oct 23, 2018

ChALkeR commented Oct 25, 2018 • edited Loading

ChALkeR commented Nov 5, 2018 • edited Loading

refack commented Nov 5, 2018

rvagg commented Nov 6, 2018

ChALkeR commented Nov 7, 2018

ChALkeR commented Nov 7, 2018 • edited Loading

rvagg commented Nov 7, 2018

ChALkeR commented Nov 8, 2018

ChALkeR commented Nov 9, 2018 • edited Loading

refack commented Nov 11, 2018 • edited Loading

ChALkeR commented Nov 13, 2018 • edited Loading

refack commented Nov 16, 2018

BridgeAR commented Apr 4, 2019

ChALkeR commented Apr 4, 2019 • edited Loading

github-actions bot commented Mar 7, 2020

mhdawson commented Mar 9, 2020

rvagg commented Mar 9, 2020

ChALkeR commented Oct 19, 2018 •

edited

Loading

Trott commented Oct 23, 2018 •

edited

Loading

ChALkeR commented Oct 25, 2018 •

edited

Loading

ChALkeR commented Nov 5, 2018 •

edited

Loading

ChALkeR commented Nov 7, 2018 •

edited

Loading

ChALkeR commented Nov 9, 2018 •

edited

Loading

refack commented Nov 11, 2018 •

edited

Loading

ChALkeR commented Nov 13, 2018 •

edited

Loading

ChALkeR commented Apr 4, 2019 •

edited

Loading