Technology Archives - Jodok Batlogg

TLDR;: I’m member of the board of directors of Alphacruncher, the EdTech startup building nuvolos.cloud. We’ve just raised CHF 1.5M to rock the EdTech space.

I’m a natural scientist by education. I do my best to operate on objective reasoning, rely on data, trust the interpretation of experts. A lot of that thinking goes back to my time at University/studying (NTB Buchs in Switzerland, University of Karlruhe in Germany, but also studies abroad, e.g. Harvard Business School).

When founding Tree.ly, it was clear to us that datascience is going to play an important role. We started with some local Jupyter Notebooks in Visual Studio Code (works actually pretty well), but quickly realized we need something where we can collaborate within the team and with others.

Analyzing Forests
Image Credits: Tree.ly, Ocell.io, illwerkevkw, TU Vienna

We continued with tools like Amazon SageMaker, Microsoft Notebooks, but also Google Colaboratory. Not to forget to run JupyterHub within our Kubernetes infrastructure. It’s amazing to see how easy everybody can do Data Science nowadays! I strongly encourage to check out these tools.

Only until my friend Oliver from Zeughaus connected me with Alexandru Popescu from Alphacruncher and I learnt about Nuvolos Cloud.

It’s primarly targeted to educational customers (that’s also where it is coming from), but I also see a great potential for “commercial” datascience. What do I find especially cool?

Snapshot and shared workspaces.

Inside the platform one can easily snapshot and share datasets with others. E.g. a teacher can create an environment for an exercise and share that very environment with the entire class – and each student can continue in her/his personal environment. Only to hand in the solved problem to the teacher afterwards.

Or assume that you wrote a scientific paper, who’s findings are based on a larger dataset and a couple of computations. You cannot only share the PDF, but also provide the possibility to access the full environment and validate the findings – or even build upon them. Magic!

In our case we’re using larger datasets (e.g. a few TB of airborne laserscanning pixel clouds, parcel data and all kind of other readings) that we collaboratively work on. We can use a shared Kernel image with all dependencies installed and work on the same shared data folders – while still preserving our personal preferences and spaces.

Resource efficiency / Shared resources

Data science has the characteristic that you need quite some resources for a rather short amount of time – and then for a larger amount of time you don’t need the instances running. Nuvolos and it’s billing/usage model makes that cloud-elasticity super simple for the user. One can book a base level of resources (that are only spun up when needed, and automatically terminated afterwards), but also book spike resources.

That’s not only convenient for companies, but even more for universities or school classes that need these resources for every student.

Everything in the Cloud

We spoke mainly about Jupyter Notebooks, but the Nuvolos environment also provides access to Snowflake (One of the coolest databases on earth) and many other Tools. In the Cloud. In the browser. In a shared space. The team is working on expanding that toolsuite permanently. Right now it’s RStudio, VS Code, Spyder, JupyterLab, Julia, Stata, Matlab, GNU Octave, SAS, IBM SPSS, REDCap, Airflow and others.

A winning team

During due diligence I took a look at the tech stack and got personally known to some of the core team members. They are not only using state-of-the-art technology and methodology, but also managed to attract top talent to build and operate the product.

I couldn’t be more excited to play a small road on Alexandru’s and his team journey. Thanks for letting me ride with you.

Recently I came across this tweet of Steve O’Grady (he helped to found Redmonk, a research firm I really value highly).

have been asked about the term NoSQL a couple of times over the last few quarters. quantitative evidence suggests usage is in decline. what's not obvious, however, is what replaces it. pic.twitter.com/kEStN4dID2
— steve o'grady (@sogrady) January 17, 2020

Tweet by Steve O’Grady

Especially this Graph stroke me. This is what I predicted 6 years ago. NoSQL isn’t the solution. And one of the main reasons we’ve built CrateDB.

Providing a SQL interface to a NoSQL architecture was and is one of the key and core ideas of CrateDB. CrateDB is purpose-built to scale modern applications. SQL is a major component of that.

The first wave – and the reason SQL was invented – were relational databases. The start was SEQUEL, invented around 1977 by IBM for System R. SQL1 became an ANSI standard in 1986, in the 2003 Version Window functions were added and SQL:2016 got JSON support.

Glad to see that SQL ~~is back~~ still strong. There are many reasons why it is great and we fully rely on it. But the main point to talk about:

It’s standardized.

“The good thing about standards is that there are so many to choose from.”
Andrew S. Tanenbaum

Over the past 20+ years some of the brightes computer scientists thought about how to best work with data. SQL has 4 language subcategories:

Querying Data (DQL)
(CrateDB’s querying commands)

That’s what you do most of the time. CrateDB excels when it comes to querying data. Nowadays real-time analytics usecases of require aggregations (SELECT SUM(revenue)) out of a table (FROM my_table) that is potentially large (billions of rows) at are limited to a subset of these rows by a filter clause (WHERE ts>myts AND id=myid). Benchmarks show that we consistenly outperform other query engines for these types of queries. Add JOINs to the mix and the complexity explodes – read about the complexity involved to make that really fast. Not only on standard data types, but also on modern things like geospatial data.

Manipulating Data (DML)
(CrateDB’s manipulating commands)

Correct, to be able to query data, you need to start with writing data to the database, in single rows, in batches, via bulk load, updates or from other queries). CrateDB is fully standard compliant, the only slight difference is based on the eventual consistent model (also read about optimistic concurrency control) that CrateDB uses: it might take up to the REFRESH_INTERVAL until records are fully considered in results (selecting by primary key is always consistent).

Defining Schemas (DDL)
(CrateDB’s definition commands)

Here’s where on needs a bit more that the original SQL standard had in. We’re storing JSON data and need the possibility to define object columns (that can have a strict schema, a dynamic schema or a simply not indexed). Btw. schemas are a great idea – storing data without a schema is just postponing problems.

When it comes to scaling to large amounts of data, the concepts of sharding, partitioning and replication are key. Here’s where the knowledge of an database administrator is needed. In contrast to a traditional database, where you can change more aspects of a table after creation, you need to think of and define more settings beforehand. I don’t think this is a problem, all of you that ever were forced to do hour long schema migrations where the database was locked – will agree.

Rights and Transaction Control (DCL, DTL)

Now we’re talking about differences. Traditional OLTP databases and their workloads heavily rely on transactions and fine-grained permissions. But this comes at a high price (performance, speed, concurrency ). A price one in the IoT or Timeseries space doesn’t want to pay. Especially because the yield is low. Therefore CrateDB supports user-management, but no further functionality (yet).

SQL-99 Complete, Really

In case you want to learn more about SQL or look up some reference documentation – at Crate.io we’ve acquired the rights to republish the fantastic SQL-99 Complete, Really book. So for all oldskool SQL fanbois – here’s a teaser – World’s Longest SQL Poem (for obvious reasons)

All the snake-oil peddlers say, there's a fast and easy way,
To get your SQL program up and running,
But they're silent re the traps, that cause subtly buggy apps,
For to catch the unaware a chasm's yawning!

Date-arithmetic exceptions, auto-rollbacked disconnections,
Bit precisions, overflows, collate coercions,
And how NULL affects your summing, for to keep your DB humming,
You must know what happens in all vendors' versions!

Should this field be DOUBLE PRECISION?
Will logic rules soon see revision?
By the ANSI:Databases sub-committee?
When you DROP should you CASCADE?
How are NATURAL joins made?
Re UNIQUE-keys matching check the nitty-gritty!

Yeah the true and standard facts, you'll avoid those later hacks
That make Structured Query Language such a bore,
You'll find tips and charts aplenty, in this one-thousand-and-twenty
Four page volume (with an index), and yet more!

Author anon (also for obvious reasons)