Platform as a Service – Data Management and Open-Source Software Services

Computing Center

Our group has a powerful dedicated computing cluster for the research on big biomedical data. The computing cluster has 24 computer nodes, 674 CPU cores, 6 Terabytes of operating memory and 2 Petabytes of storage capacity.

Platform as a Service with HOPS

The cluster runs on HOPS (Hadoop Open Platform-as-a-Service) which is a next generation distribution of Apache Hadoop. The platform is leveraged with the Hopsworks, the UI front-end to Hops, which lowers the barrier to entry for users getting started with Hadoop by providing graphical access to Hops platform ecosystems and services such as Spark, Flink, Kafka, HDFS, YARN, Python, Jupyter and Zeppelin Notebooks, etc.

Hopsworks covers: users, projects and datasets, analytics, metadata mangement and free-text search.

Users

authentication (normal/two-factor).
roles.

Projects and Datasets

project-based multi-tenancy with dynamic roles.
the ability to share DataSets securely between projects (reuse of DataSets without copying).
DataSet browser.
import/export of data using the Browser.

Analytics

interactive data analytics; Jupyter and Zeppelin Notebooks.
general-purpose cluster computing system, analytics engine for big data processing; Apache Spark.
high-level APIs in Java, Scala, Python and R.
long running jobs, batch-based submission, analytics; Spark.

Metadata Management

intuitive UI.
support for the design and entry of extended metadata for files and directories.

Free-text Search

Elastic to provide free-text search for files/directories and their extended metadata.
Global free-text search for projects and DataSets in the filesystem.
Project-based free-text search of all files and extended metadata within a project.