What is Citus Data about?

Citus DB is an analytics database that modifies and extends PostgreSQL for scalability. Users talk to Citus DB's master node as they do with a regular database; and the master node partitions the data and queries across worker nodes in the cluster. The specifics of the underlying architecture closely resemble those of Hadoop.

In other words, Citus DB combines the SQL expressiveness and performance of relational databases with the scalability and availability of Hadoop, in a single, uniform product.

When should I use CitusDB?

  • You have a transactional database (SQL or NoSQL) already in place for your transactional needs.
  • You want flexible access to your historic data (user actions, event streams, text logs, machine generated data) through SQL.
  • Parts of your data are growing at a pace that prohibits putting them into an expensive data warehouse.
  • You want to be able to scale your analytics, regardless of your data volume.
  • You want your customers to access their historic data with real-time responsiveness: Respond to complex aggregations over very large data sets within seconds, and to key-value lookups in under a second.

When should I not use CitusDB?

CitusDB isn't an OLTP database, and doesn't do everything that Postgres can. You should not use CitusDB:

  • As the only database that manages both your transactions and analytics.
  • As a replacement for your expensive, full-featured enterprise data warehouse.
  • If your data volume does not require sharding across multiple machines.

How fast is CitusDB?

CitusDB outperforms purpose-built analytics appliances by more than 10x. The graph below uses the industry standard TPC-H benchmark, and compares the performance of a CitusDB cluster running on 100 EC2 instances to a dedicated analytics appliance.

What use-cases and data sets are best suited to CitusDB?

CitusDB is optimized for performing ad-hoc analysis, standard reporting, and data exploration on your append-only event data; but not for modifying this data in real-time.

Data that has a natural temporal ordering (e.g. user actions, event data, text-based log files, machine generated data, clickstreams, ad impressions) and that grows rapidly is particularly well suited to CitusDB.

On these data, you can ask questions like:

  • Who are my most valuable/engaged customers, based on the activities they perform on my site?
  • Which group of users clicks on a given category of ads most often?
  • As an advertiser, what are all the sites I’ve published this specific ad on?
  • Is my CampaignA traffic converting better than my CampaignB traffic?
  • How many users in each age group used our newly released feature last month?
  • What are all the IP addresses involved with this specific event between these two dates?

Your queries can involve a particular range, join multiple tables together, filter based on complex selection criteria, group and sort results, perform aggregations, and execute other standard analytic functions.

How is CitusDB different than Hadoop?

CitusDB stores data in an extended PostgreSQL database; and therefore provides the SQL expressiveness and core performance benefits of databases (indexes, join optimizations, etc.) that are not available in Hadoop.

The database also enables real-time responsiveness. Simple queries can take as little as 100ms, and complex aggregations over large data sets complete within seconds.

How is CitusDB different than other analytics databases?

CitusDB is built from the start with true parallelism in mind; and can efficiently scale to 100s of nodes. Its software-only architecture allows it to run anywhere: on-premise or in the cloud, without any specific expectations from hardware.

We know the devil is not in the claims we make, but in the details. Watch our video to see CitusDB in action, and then get started with our sample data sets or your own data.

What infrastructure do I need to get started?

Since you can set up multiple instances of CitusDB on a single box, you can get started right away by downloading and installing it.

We have no special expectations from the underlying hardware (such as disk arrays, RAID setup, infiniband connections), which means you can use CitusDB on your commodity hardware, or get instances in the cloud.

The number of machines you need for production depends on (1) how much data you have and (2) what performance you need out of your queries (the configuration of each machine is also relevant of course). CitusDB allows you to get linear performance improvements for a given amount of data by adding more nodes, so you choose where you want to be on the performance vs. cluster size curve.

How can I guard against master node failures?

You can use PostgreSQL's streaming replication feature to replicate the master node's data in real-time. If the master fails, one of its slaves can then take over from where the master left off. For details on setting this up, please refer to the PostgreSQL wiki.

What if I install as root and the installer can't determine the current user?

The installation will complete normally. Our installer will also create a new postgres user in that case; and will make that user own the database directories. To start up the database, you then need to switch to the postgres user by typing "su - postgres".

When I try to stage data, why do I get the error "could not connect to server: No route to host"?

You may have a firewall configured on your Linux instances. You may temporarily disable this firewall for testing purposes by running the command "sudo service iptables stop".

On RHEL 5.x, why does psql fail to find the symbols defined in the libreadline shared object?

On RHEL 5.x distributions, libreadline's symbols are provided by libncurses. You therefore need to explicitly specify this library through running the command "export LD_PRELOAD=/usr/lib/libncurses.so".