OverviewArchitectureSingle Node ClusterExamplesAmazon Reviews(create indexes)TPCH Benchmark(perform joins)Foreign DataFile WrapperMongoDB WrapperSQL on HadoopGoogle DremelFAQPseudo-distributedDistributedMultiple Node ClusterLinux NodesEC2 NodesPerformance BoostFeatures Not in v2.0Before Production |
OverviewCitus DB is a scalable analytics database that is built on top of PostgreSQL. Designed with parallelism in mind, it is the first such database that enables running distributed SQL queries on data that's external to the database. Citus DB gives flexible and fast access to large volumes of data. Datasets that are append-only are particularly applicable; and these include user actions, event streams, log files, and machine generated data. Citus DB partitions these data and the queries that operate on the data; and efficiently executes queries that involve look-ups, complex selections, groupings and orderings, and analytic functions. Further, Citus DB supports joins between one large and multiple small tables. Citus DB also enables real-time responsiveness. Query run times start at 100ms for simple queries, and increase depending on query complexity and the dataset size. One feature that is missing from Citus DB involves real-time inserts and updates; and the database doesn't yet fully support real-time analytics. High-level ArchitectureAt a high level, Citus DB distributes the data across a cluster of nodes, and then processes incoming analytic queries in parallel across these nodes. Citus DB achieves this by making three particular changes to the underlying database:
These three changes closely align Citus DB's architecture with that of Hadoop's. In fact, they allow us to combine the scalability and fault tolerance of Hadoop with the performance and SQL-compliance of databases. Users can easily scale out a cluster as more data are added, take advantage of standard data visualization tools, and also benefit from performance improvements that databases typically employ (indexes, join order optimizations, etc).
|