Data Engineering

Data Engineering Stack

A practical open-source data engineering stack focused on batch and analytical workloads. It combines scalable storage, pipeline automation, orchestration, visualization, and observability for data teams.

data-engineeringetlanalyticsorchestrationobservability

Tools in this Stack

SeaweedFSData Storage / Data Lake

SeaweedFS is an open source distributed file system supporting WebDAV, S3 API, FUSE mount, HDFS, etc, optimized for lots of small files, and easy to add capacity. `Apache-2.0` `Go`

open source

parapipeETL / Data Pipelines

FIFO Pipeline which parallels execution on each stage while maintaining the order of messages and results.

open source

rundeckOrchestration

Enable Self-Service Operations: Give specific users access to your existing tools, services, and scripts

open sourceApache-2.0

JupyterLabData Analysis

Web-based environment for interactive and reproducible computing. ([Demo](https://mybinder.org/v2/gh/jupyterlab/jupyterlab-demo/try.jupyter.org?urlpath=lab), [Source Code](https://github.com/jupyterlab/jupyterlab/)) `BSD-3-Clause` `Python/Docker`

open source

dataeaseVisualization / BI

🔥 人人可用的开源 BI 工具，数据可视化神器。An open-source BI tool alternative to Tableau.

open sourceGPL-3.0

TempoMonitoring / Tracing

GUI Git client. Replace the Git CLI with a clear UI and AI assist. [![Freeware][Freeware Icon] ![Open-Source Software][OSS Icon]](https://github.com/maoyama/Tempo)

open source

Why This Stack Works

This stack is designed around the core lifecycle of data engineering: ingest, transform, store, analyze, and monitor. SeaweedFS provides a simple but proven distributed storage layer suitable for data lake patterns, while Parapipe enables building repeatable ETL workflows using code-first pipelines. Rundeck complements this by handling scheduling, dependency management, and operational control of data jobs in production. For analytics and insight delivery, JupyterLab supports exploratory data analysis and experimentation, while DataEase serves as the visualization and BI layer for stakeholders. Tempo adds observability through distributed tracing, helping teams understand pipeline performance and quickly diagnose bottlenecks or failures. Together, these tools form a cohesive, production-ready stack that is flexible enough to evolve, with clear upgrade paths to more specialized databases or streaming platforms if needed.

Explore more stacks

View all tech stacks