Citus DB is an analytics database that modifies and extends PostgreSQL for scalability. Users talk to Citus DB's master node as they do with a regular database; and the master node partitions the data and queries across worker nodes in the cluster. The specifics of the underlying architecture closely resemble that of Hadoop's.
In other words, Citus DB combines the SQL expressiveness and performance of relational databases with the scalability and availability of Hadoop, in a single, uniform product.
CitusDB isn't an OLTP database, and doesn't do everything that Postgres can. You should not use CitusDB:
CitusDB outperforms purpose-built analytics appliances by more than 10x. The graph below uses the industry standard TPC-H benchmark, and compares the performance of a CitusDB cluster running on 100 EC2 instances to a dedicated analytics appliance.
CitusDB is optimized for performing ad-hoc analysis, standard reporting, and data exploration on your append-only event data; but not for modifying this data in real-time.
Data that has a natural temporal ordering (e.g. user actions, event data, text-based log files, machine generated data, clickstreams, ad impressions) and that grows rapidly is particularly well suited to CitusDB.
On these data, you can ask questions like:
Your queries can involve a particular range, join multiple tables together, filter based on complex selection criteria, group and sort results, perform aggregations, and execute other standard analytic functions.
Data should be in a structured form (csv, tab delimited, etc.) before it gets loaded into CitusDB. However, the underlying raw data can be semi-structured or unstructured. In such cases, a script (run for example by a cron job) converts the data prior to loading it into CitusDB.
CitusDB stores data in an extended PostgreSQL database; and therefore provides the SQL expressiveness and core performance benefits of databases (indexes, join optimizations, etc.) that are not available in Hadoop.
The database also enables real-time responsiveness. Simple queries can take as little as 100ms, and complex aggregations over large data sets complete within seconds.
CitusDB is built from the start with true parallelism in mind; and can efficiently scale to 100s of nodes. Its software-only architecture allows it to run anywhere: on-premise or in the cloud, without any specific expectations from hardware.
We know the devil is not in the claims we make, but in the details. Watch our video
Since you can set up multiple instances of CitusDB on a single box, you can get started right away by downloading and installing it.
We have no special expectations from the underlying hardware (such as disk arrays, RAID setup, infiniband connections), which means you can use CitusDB on your commodity hardware, or get instances in the cloud.
The number of machines you need for production depends on (1) how much data you have and (2) what performance you need out of your queries (the configuration of each machine is also relevant of course). CitusDB allows you to get linear performance improvements for a given amount of data by adding more nodes, so you choose where you want to be on the performance vs. cluster size curve.
You can use PostgreSQL's streaming replication feature to replicate the master node's data in real-time. If the master fails, one of its slaves can then take over from where the master left off. For details on setting this up, please refer to the PostgreSQL wiki.
The installation will complete normally. Our installer will also create a new postgres user in that case; and will make that user own the database directories. To start up the database, you then need to switch to the postgres user by typing "su - postgres".
Certain OS versions refer to localhost as localhost.localdomain; and the master node fails to match this name with the names in its membership file. To fix this issue, you need to change hostnames in /opt/citusdb/2.0/data/pg_worker_list.conf to localhost.localdomain. Then, you need to restart the master node to pick up your changes:
localhost# /opt/citusdb/2.0/bin/pg_ctl -D /opt/citusdb/2.0/data -l logfile restart
You may have a firewall configured on your Linux instances. You may temporarily disable this firewall for testing purposes by running the command "sudo service iptables stop".
On RHEL 5.x distributions, libreadline's symbols are provided by libncurses. You therefore need to explicitly specify this library through running the command "export LD_PRELOAD=/usr/lib/libncurses.so".