Jupyterhub Maintenance Nightmare

I just finally finished what was a week-long nightmare involving Jupyterhub.

It started simply enough. I have always kept my own documentation on “how to do” various things. In this case I had recorded steps to update all the packages in Jupyterhub on an ongoing basis. Jupyterhub lives inside something called Anaconda, or conda for short. Upgrading is supposed to be as simple as “conda upgrade xxx” where xxx is the package to upgrade.

I have been using this process since first installing Jupyterhub on a server back in January 2019.

Last week I started working through the 8 or so packages that I normally update, and one of them failed the update with a blizzard of messages and warnings. But that wasn’t the worst part. The worst part was that it also killed conda dead. Even typing “conda -h”, which should print out a simple help file, instead resulted in a page of error messages warning of missing libraries before failing. Nothing I tried would work – conda was dead.

I grabbed my notes from January, and started to reinstall anaconda/conda. It’s really quite simple: delete the anaconda3 directory and reinstall.

It all worked until I came to the last package; xeus-cling, which provides the C++ support I require. That installation failed with (again) a set of rather bizarre “missing stuff” messages.

I left it at that… after all everything else worked, so I left a message on the xeus-cling github page and waited.

This week I got a reply: conda 4.7.9 was broken. Conda 4.7.10 worked with xeus-cling. I checked, and sure enough I was on version 4.7.9. I updated (conda upgrade –all) and tried xeus-cling. It did not work. Instead, it just hung trying to resolve the ‘environment’.

Eventually, after several trial-and-error sessions, I resolved the situation and once again have a perfectly good, working Jupyterhub. I’m very glad that reinstalling anaconda3 is so simple, as this was done several times until I got the order of things correct.

In a nutshell, you must install anaconda3 (from 2019.03 script) (v 4.6.11), then install xeus-cling before anything else. Only in this way can you avoid the timeout problem with xeus-cling and conda 4.7.10. This is because all installs via conda update conda to the latest version.

Once xeus-cling is installed, everything else installs just fine, and the system is back up and running. Still, it was a nightmare figuring out what was wrong, and I can’t say I’m very impressed with conda for breaking things which were working in pre-production environment.

Ubuntu LAMP & WordPress

As discussed in a prior post https://jrcrofter.huntrods.com/updating-wordpress-just-got-messy-really-messy/ I was able to install LAMP (Linux, Apache, Mysql & PHP) on my home server that had been converted from Solaris 10 to Ubuntu 18 about a month ago.

It was converted to run the latest version of Tomcat with my application. The conversion wasn’t strictly necessary, but Solaris 10 is long in the tooth and updating the Java virtual machine (JVM) and MySQL had become nigh-on impossible. On the other hand, installing the latest versions on Linux was almost easy. I chose Ubuntu 18.04 because I’ve like Ubuntu since the very first; it’s easy to install and maintain, and seems a very robust Linux distro.

With Apache2 installed and running, my web sites were in good shape… except some things like WordPress started to complain because they weren’t ‘secure’.

Securing Apache means moving from HTTP to HTTPS, which in turn requires getting a security certificate. Fortunately, I’d already switched to LetsEncrypt for my security certs, so I was comfortable with the process. Likewise, my application was already using HTTPS, so I’m also comfortable with that process.

What was new was a) moving my application to another port so that the default HTTPS port (443) was free for my Apache web pages, and getting HTTPS running on Apache2.

As it turns out, it was quite easy. Following a good guide, I installed and used Certbot to obtain new LetsEncrypt certs for my web domains, which also handled renewals and much of the background installation work. With a few config changes and adding the secure mods, all the web pages were switched over to HTTPS without incident.

The more I use Ubuntu (and thus Linux) to do actual tasks, the more I’m liking the ease of installation and maintenance. The fact that there is now a critical mass of GOOD help online is also a great bonus.

PiDP11 Issues

The PiDP11 is a scale model of the PDP11 front panel, complete with working lights and switches, all driven by the SIMH software running on a Raspberry Pi.

It was designed by Oscar Vermeulen, who sells it as a kit for the hobbyist to build and enjoy. I bought my kit last summer, but due to the home renovations did not build it until Christmas.

Oscar and others have developed software for it that customizes SIMH and drives the lights and reads the switches. There is also a marvelous manual describing how to build and operate the kit in great detail.

There’s been one problem: the front panel occasionally locks up. There was a lot of discussion on the PiDP11 discussion forum, and several causes and solutions were offered and discussed.

Eventually the source of the lock-up was traced to a race condition in one of the files controlling the front panel. A corrected version of the file was provided in February 2019, and I finally got around to installing it (and recompiling the software) on May 22. It ran until May 28, when it locked up again.

I reported this to the group, and one of the developers (Bill) modified the source file to try and fix the problem permanently. The fix was only offered as a set of lines to edit and recompile rather than a new file to keep versions clean. The fix was posted yesterday (May 29)

Today (May 30), I edited the source file and recompiled, then rebooted the PiDP11 at 9am. The main purpose of this post is to document the date & time of this latest fix in the event it does not lock up again. I can then refer back to this post for the date/time the fix was made.

Updating WordPress Just got Messy. Really Messy.

As the title says, really, really messy.

It’s not wordpress’s fault. Rather, wordpress is keeping up with the times, and the times say PHP needs to be kept current.

Up until the last version of WordPress (5.2), my OpenBSD server created many years ago was OK. It’s version of PHP was old, but it worked. When I updated to 5.1, WordPress warned me that my PHP was obsolete, but there wasn’t much I could do at the time. My old version of OpenBSD did not have a simple path to update PHP. Rather, to update such things one is expected to update OpenBSD.

Then came WordPress 5.2. My existing WordPress stated “cannot upgrade due to older PHP” or something similar.

Time to update PHP, which meant a new OpenBSD.

That’s when Messy happened. The latest OpenBSD (6.4) is wonderful. It’s shiny and new and fast, but … they replaced Apache with a new program ‘httpd’ that was written to be ultra-secure. Too secure, in fact.

I spent two weeks fighting OpenBSD 6.4 and httpd, but could not get it to do what I needed. Worse, there’s almost zero helpful documents written about it. The manual is OK but dense, and the only “how to” site covered setting up the ultra-secure version, and nothing else.

Yesterday I finally gave up. Instead of updating OpenBSD, what if I just modified the firewall to scoot all the apache pages to a new server? Something like Ubuntu 18.04 that I’d recently put on all my tomcat/jupterhub servers? Would it be as difficult as OpenBSD?

I found a couple of how-tos, and they seemed utterly simple. “apt-get install apache2”. Done, and it was running! “apt-get install php”. Done, and also running. “apt-get install mysql” (this was for a virtual test server). Done and running. This was scary easy.

Even configuring Apache has gotten much easier due to the plug-and-play structure that’s been adopted (probably for some time).

The only difficult part was installing wordpress. It really wants to be in one place, and doesn’t like port forwarding. For example, if the test server was “http://10.1.1.214”, then that’s where wordpress wanted to be. Port forward “http://huntrods.com:8008” and it just reverted to either 10.1.1.214 or “huntrods.com” and didn’t work. Eventually I realized that port forwarding would happen when I threw the “big switch” and turned off the current (old) server.

With that in mind I did some more testing on WordPress – duplicating this blog from database backup and latest wordpress. It hated the older version and refused to run, but copying over the latest files did the trick.

Finally the big moments: install apache2 and php on the physical server, create the various accounts necessary for users, then copy all the files from the old server to the new server. Most were static web sites, but there were 3 wordpress blogs that had to come over, complete with new databases (from today’s backups).

At last it was all working locally. Time to throw the switch. On the OpenBSD server, I stopped the old apache and disabled it, then forwarded port 80 to the new server. SUCCESS, or at least partial success. I still needed to create all the Virtual Hosts from before, but with the plug-and-play Apach2 that turned out to be easy if time consuming.

Lastly I fired up the WordPress accounts, and they failed. It turned out I had to copy “latest.tar.gz’ over the older WordPress files and then everything worked.

So after two weeks of fighting httpd, I was able to get Apache2 working on an Ubuntu server, complete with full testing, in just under two days.

Success, indeed.

Updating Java is not fun

I’ve been using OpenSDK Java (Java 11) on my Ubuntu servers for a while now, so thought it would be a good idea to update Java on my main production machine.

After digging into “what’s the difference between OpenSDK and Oracle SDK”, I came to the conclusion the main difference was … nothing. Since it was a lot easier to find Oracle SDK for a Win 7 box, I chose that and downloaded the installer.

After installing… nothing. It was still pointing to the older version. I removed the older version(s) and still… nothing. I reinstalled and there was a “next steps” button which led to a help file that basically said “you must manually adjust the system PATH variable for Java to work”. Really? In 2019???

Oh well, it’s easy to do and afterward I did indeed have Java 11 running. I also updated my ‘go.bat’ script that sets up Java for command shell compiling, which is used by my production ant builds of my enterprise application.

Of course, the ant build failed, but I knew it would as the (very old) libraries were removed when I toasted Java 8.

And now came the messy part. I pointed CLASSPATH to the new (better) Tomcat location of the various servlet libraries, but still no good news. After much reading to little effect, I grabbed some of the code and tried compiling it in a shell window, which worked just fine. I had established that ant was not getting the CLASSPATH.

In the midst of this I upgraded Apache Ant to version 10 with no ill effect.

Again, after much more reading, it became clear that Ant actually and openly HATES the classpath variable and erases it when used. There was no functional work-around to this, so I reluctantly rewrote my build.xml files to embed a classpath for the application compiles. This worked, but showed more missing files.

I adjusted the Ant build until everything compiled, but it required a new ‘altlib’ directory of lesser-used jar files to fully work.

Finally, an almost-good compile. There were only a few remaining errors, but they ended up costing me time. All errors amounted to the removal of “com.sun” libraries. One was found in apache.commons, and the other was so deprecated that the advise just said “comment it out”. I did, and everything compiled successfully.

Now I’m running on Java 11 and Ant 10 with what is considered a “proper” build link to libraries. I disagree but there’s little I can do.

Server Madness

I have been running industrial strength servers for many years now, as part of a project that has been moderately successful.

The current set of servers are 1U rack-mount ‘pizza boxes’, most quad-core xeon Intel servers, and one dual-CPU AMD Opteron. All are Hewlett Packard (HP) Proliant servers from roughly 2005. All have been running without major incidents, many of them since 2005.

About 2 years ago the dual Opteron died in production. No warning, it just up and died. I got the call in the morning and went and retrieved it from the co-locate service by lunch. Basically I just took one of the quad-core Xeons with all the software ‘ready to go’ and did a box swap.

Once home I didn’t spend much time on it, but a few reboots seemed to confirm a hard drive problem.

Fast forward to just before Christmas last year. I wanted a box to run JupyterHub on Ubuntu, and this box with it’s dual AMD processors and 16GB memory seemed a quick prospect.

After various diagnostics, I determined the hard drive was OK, but the SATA system was not. I managed to install Ubuntu from a burned DVD iso, but it took a very long time and then didn’t reboot successfully. I shelved the server at that point and went instead with a brand new purpose built server, which works wonderfully.

In late January I wanted a server to run OPM – an open source reservoir simulator. I’ve run it before on virtual machines, but wanted a ‘real’ server. I pulled out the dual Opteron and tried installing Ubuntu again, this time from a known good USB data key that I’d used on the JupyterHub purpose-built box. This time it installed quickly and rebooted to an operating system.

I installed it in my newly built server rack and after moving everything to the server room it fired up and has been running since.

Late last week Ubuntu updates indicated a reboot of all my Ubuntu systems was needed. All except the dual Opteron rebooted without incident. Unfortunatly, the dual Opteron would not boot, instead returning to the dreaded “SATA channel X is slow to respond…”. Eventually this fails with “SATA channel is down” and no hard drive.

It’s annoying as the POST clearly shows a SATA HDD as expected. It’s only on OS boot that SATA dies.

I decided to diagnose and fix the problem if I could. I started by pulling the cover and then tried a few boots – no good. I then pulled all the cables and eventually pulled the HDD and set it all on top. This time it booted without a hiccup! I verified all was good with a few proper shutdowns and then cold boots, and proceeded to replace all the cables and button up the case. And… the boot failed.

After much checking and rechecking, I had this niggling feeling in my brain that there was ‘something up’. At one point when it was failing to boot (at the “SATA channel X is slow to respond…” I pulled the mini-power connector off the optical drive. INSTANTLY the boot process proceeded normally.

I tested this a couple of times to confirm, and indeed the problem is the optical drive (DVD ROM). Specifically, the optical drive is a removable mini-height drive that unclips and slides out of a circuit card. The circuit card stays in the machine, and has a 36pin IDE header plus the micro-power connector. Pulling the power connector kills the drive electronics and allows the machine to boot cleanly.

It would seem that somehow the single IDE channel for the DVD ROM optical drive is interfering with the SATA channels for the HDD. That or there’s a power drain when the DVD ROM is plugged in that messes with the power to the HDD. Both HDD bays and the Optical drive share the same power harness. Power to the DVD splits off the power connector for one drive, but all are on the same set of wires from the power supply. It’s possible there’s a drain that kills one of the power lines. Unfortunately with the HDD in place you cannot tell if it’s spinning or not.

However, I don’t really think that’s the issue. Before Christmas one of the problems I experienced was the utter slowness of the DVD ROM to read the known-good disc. I suspect the DVD ROM electronics are bad and somehow influence the SATA controller. For all I know it’s the same chipset on the motherboard.

At any rate, the machine is now back together and running perfectly – just with the DVD ROM power cable disconnected.

It was very difficult to diagnose, but I’m glad I was able to finally solve the mystery.


Server – A Nightmare No More

My last post about the new server indicated the Gigabyte motherboard was returned and an ASUS motherboard ordered as a replacement. I also ordered the cheapest video card I could find as there is/was no on-board video with the basic AMD Ryzen CPU.

The parts arrived in the first week of March, and I promptly put everything together. This time I installed the cpu and head sink/fan on a sturdy table (with static protection), as well as the memory and M.2 SSD. The new motherboard is a bit longer than the first one. It nicely picks up some mounting standoffs on the end, leaving nothing unsupported.

The motherboard went into the case, and the power supply connected easily. There was an initial problem with the front panel connectors, but ASUS had a QR code linking to a very complete ‘motherboard connector and header’ manual that helped immeasurably.

With everything connected, I started the machine and was immediately rewarded by a good boot sequence and the video BIOS screen. After verifying the BIOS settings, I started to install the OS.

Here I had a problem. The 16gig data key was not recognized about 9 times out of 10. Finally I grabbed a different brand data key (same size), re-flashed Ubutntu 18.04 server and was able to install the OS in very short order.

Once I was sure all was well, I buttoned up the case and installed it in my server rack. It’s been running without issues since, having JupyterHub installed as the primary application. It’s also blazingly fast compared to all my other Ubuntu boxes.

Overall I am now quite happy with the 6-core AMD Ryzen chip, though I still wish it had come with at least minimal VGA graphics, as that’s really all a server requires.

Server Nightmare, continued

When last we left our tale of woe, the server was running but without video. Messages to AMD and Gigabyte were unanswered, and the internet was not much help other than to suggest a BIOS update was needed.

Since then much has happened. I did finally hear from both vendors; more on that later.

In the meantime, I decided to try one internet suggestion – adding a separate graphics card to flash the BIOS. The motherboard has three PCI slots, but I didn’t have a PCI video card. I called a local computer shop to inquire whether they might have something in a “junk bin”, and they did. I went to town (literally) and picked it up… FREE.

Back home I plugged it in, and it worked. I now had VGA video, which is all I really need for a server, and certainly sufficient to flash BIOS.

With video, I managed to flash the BIOS from the older version (F2), first to F3 and finally to F4 (the latest). There were a few issues along the way, but by now I had downloaded a full manual and used it as a guide. I did notice a few issues with the motherboard and USB data keys that bothered me at the time, such as booting with some keys locked up the USB keyboard, but set that aside for the time.

With the BIOS updated to “latest”, I removed the PCI graphics card, powered on, and… STILL no video. I was flummoxed.

However, help came from a surprising source. FINALLY I heard back from both vendors. Both said the same thing: the AMD Ryzen 5 2600 CPU does not have on-board graphics. This was a surprise to me as there was NOTHING on the manufacturer sites or manufacturer materials supplied to Amazon.ca to suggest this was the case when I chose these components. However, given what I was seeing, it made sense. As it turns out, AMD sells two “things”: a CPU with no on-board GPU, and a thing called an APU which has the GPU on-board. Who knew?

I decided I could live with this and sourced a cheap PCI low profile graphics card as the free one was full height and won’t fit in the 2U rack case.

I decided also to install my server OS – Ubuntu 18.04.01 (server) as the video card wouldn’t be an issue as I always use VGA on servers.

Here is where the USB issue finally bit me. More than half of the time, the install failed with a USB error. Sometimes it locked up the USB keyboard as well. Only once out of perhaps 12 attempts did it start to load the OS, and then it failed when I plugged in the network cable by scrambling the video (what???).

Ultimately I decided the USB was flakey and initiated a return from Amazon.ca for the Gigabyte motherboard (reason: defective… ‘flakey USB’). It’s already boxed and mailed back as I write this.

I decided to keep everything else, as I do like the other components and am willing to give the AMD Ryzen a chance. I would have kept the Gigabyte motherboard as well were it not defective. However, given the several reports of similar USB flakey-ness by other reviewers, I decided to buy an ASUS motherboard designed for this CPU.

One last annoying tidbit – the ASUS site actually states that the AMD chip does not have on-board graphics and you’ll need to buy a video card. I wish I’d gone with ASUS from the start – at least I would not have been surprised and wasted 3 days chasing phantom video problems.

Vendors who should know better (a server nightmare)

Gigzbyte and AMD, I’M TALKING TO YOU!!!

I need a new server for JupyterHub, and since I do like building servers and such things, I decided to do some research and buy a decent lower-cost “server-as-parts”.

I found from many reviews that the Gigabyte B450M DS3G motherboard, paird with the AMD Ryzen 5 2600 CPU was a killer low-cost solution. I added appropriate speed (3000MHz) Corsair DDR4 memory (16GB to start) and a M.2 250GB SSD, all to go into a 2 rack-space case with an EVGA 500W power supply.

After all the bits came, I carefully assembled it and tried the first “smoke test”. It ran, but immediately gave a set of BIOS “error beeps”. Specifically “long-short-short” which means NO VIDEO for this BIOS.

Sure enough, plugging in either known good HDMI or DVI cables to a working monitor gave nothing.

Searching on the internet proved this to be a VERY common problem, known since at least Nov 2018. Essentially, the motherboard is shipped with the wrong BIOS version. It’s early and doesn’t know about the new CPU with on-board video.

The solution is to flash a new BIOS… but how? With no video, you can’t see what’s going on to flash a BIOS. Very expensive motherboards have “Qflash+” which lets you put the bios on a data key in a special USB slot and it “just flashes”. My motherboard, the less expensive one, doesn’t have that feature. It can update from USB key (Qflash) but not “the plus”.

AMD’s solution is to have you request “a boot kit”. They send you a lesser (older) CPU “on loan” to fire up the motherboard, flash the bios and then send back. However, it was instantly obvious they have zero intention of doing this – you must “prove” you own the chip by taking a photo of the CPU clearly showing the serial number and model. PROBLEM: these are now covered with opaque white thermal compound if you’ve installed the supplied CPU cooling fan as any intelligent builder would do. So AMD wants you to scrape off the thermal compound and take the photo, then use ??? (what???) when you finally put it all back together. Well, I’m not stupid so I’m not running a CPU “dry”. Which means I can’t take the obligatory photo, so I can’t have the “boot kit”. What a bunch of idiots. (and I told them so by reply email and in an on-line review).

Next idea: put in a PCIE graphics card into one of the PCIE slots and boot graphics that way. I was able to score a very old PCIE VGA/DVI card from a local computer company’s scrap bin, and sure enough, it WORKED!!!

It sits an inch higher than the case, so it’s not a permanent solution, but it worked and I had VGA to see the BIOS screen.

After reading the BIOS update procedures, I carefully updated the BIOS to the latest version. Everything works… EXCEPT STILL NO VIDEO!!!

I’ve got a second trouble ticket in with Gigabyte, but who knows when they’ll answer.

Since this is a server I could buy a $50 shorter PCI graphics card and just use it to install Ubuntu, as the server will actually never be connected to a video monitor unless there’s a problem.

BUT WHAT WERE THESE IDIOTS THINKING – SELLING STUFF THAT DOESN’T WORK AND THEN HAVING ABYSMAL CUSTOMER SUPPORT (and the latest BIOS still doesn’t work).

If this was ‘bleeding edge’ like a game machine, I could see this as a typical issue, but this isn’t bleeding edge stuff – or shouldn’t be.

WORST CASE, BOTH CHIP AND MOTHERBOARD GET RETURNED IN MARCH.

Well, this is unexpected (a server story)

As the title says, I’ve been having a most weird server experience, culminating in a rather fascinating and unexpected discovery.

As posted recently, I’ve been experimenting with Jupyter Notebooks using JupyterHub on Ubuntu 18.04 Server.

I started with a server built on Oracle’s VirtualBox 5.x running on my development machine, which is an Intel quad-core I7 with 16GB of memory and a couple of smaller SSDs. I gave the virtual Ubuntu 8GB of memory and 2 cores, plus 64GB of disk space. This is where I cut my teeth on installing Jupyter, first locally, then JupyterLabs then JupyterHub (again locally) before finally installing JH globally. On the way I learned quite a lot, and took these lessons to all other platforms via some detailed documents I wrote.

The first Ubuntu was desktop, complete with lots of X-type stuff. It was fast, it was good, but I wanted a more “dedicated” server.

My second server was Ubuntu server running on my Windows Server 2008 R2 file server. It’s a backup file server, so it wasn’t doing much. I installed Oracle VirtualBox, this time V6.0 and Ubuntu 18.04 server as I don’t need the x-stuff and wanted a lean, faster server rather than a desktop. (the iso install images are quite dramatically different in size). This machine is a quad-core Xeon of recent vintage.

he server only had 8GB physical memory, so I could only give the virtual server 4GB. As a result, it was very slow.

About that time I resurrected a ‘pizza box’ 1U quad-core Xeon server that also had 8GB of memory (it was the max for that vintage machine). As this was a dedicated box, I could install Ubuntu 18.04 server as the native OS and give it all the memory. After installing JupyterHub, it seemed… VERY sluggish. Opening notebooks took a very long time (minutes) and sometimes they would not open at all. I experienced problems connecting to the kernel, and it was just very frustrating.

I’d deleted both virtual machines, so decided to try another on the development I7 box. Giving it 8GB, 2 cores and 64GB disk as before, I installed JupyterHub.

At this point, I have two almost identical servers, with the same memory. The quad-Xeon has 4 cores, the virtual I7 has only 2, but otherwise things are very close.

And here came the unexpected surprise. The I7 virtualmachine is easily 10x faster to my perception than the Xeon. It’s truly a night and day difference. Where the Xeon is sluggish to open notebooks and connect the kernel (if it even succeeds), the I7-virtual is quick and responsive. Editing notebooks is a joy, instead of a grind. Things are quick, kernels don’t die and it’s just a totally different environment.

Yet aside from hardware, everything about the installs is identical. Even the notebooks come from the same github repo, so are identical.

Today I ran some benchmark tests on both machines, and every test shows the I7-virtual machine (with 2 cores) is double the speed of the quad-core xeon with 4 cores. It’s astounding.