apache-cassandra database nosql Technology

Apache Cassandra — The minimum internals you need to know

Apache Cassandra — The minimum internals you need to know

For efficient knowledge modeling and comparability with different DB’s

We’ll talk about two elements right here; first, the database design internals which will assist you examine between database’s, and second the primary instinct behind auto-sharding/auto-scaling in Cassandra, and the way to mannequin your knowledge to be aligned to that mannequin for the most effective efficiency.

Half 1: Database Structure — Grasp-Slave and Masterless and its impression on HA and Scalability

There are two broad varieties of HA Architectures Grasp -slave and Masterless or master-master structure.

Right here is an fascinating Stack Overflow QA that sums up fairly simply one major trade-off with these two sort of architectures.

Q. -I’ve heard about two sort of database architectures.master-master and master-slaveIsn’t the master-master extra appropriate for right now’s net trigger it’s like Git, each unit has the entire set of knowledge and if one goes down, it doesn’t fairly matter.A.- There’s a elementary rigidity:One copy: consistency is straightforward, but when it occurs to be down everyone is out of the water, and if individuals are distant then might pay horrid communication prices. Deliver moveable units, which can need to function disconnected, into the image and one copy gained’t minimize it.Grasp Slave: consistency is just not too troublesome as a result of each bit of knowledge has precisely one proudly owning grasp. However then what do you do if you can’t see that grasp, some sort of postponed work is required.Grasp-Grasp: properly if you could make it work then it appears to supply every part, no single level of failure, everybody can work on a regular basis. Hassle is it very exhausting to protect absolute consistency. See the wikipedia article for extra.https://stackoverflow.com/questions/3736969/master-master-vs-master-slave-database-architecture

In master-slave, the grasp is the one which usually does the write and reads could be distributed throughout grasp and slave; the slave is sort of a scorching standby. The primary drawback occurs when there’s an automated switchover facility for HA when a grasp dies.

I’ll add a phrase right here about database clusters. I used to work shortly in a undertaking with an enormous Oracle RAC system, and have seen the issues associated to sustaining it within the context of the info that scaled out with time. We would have liked Oracle help and in addition an professional in storage/SAN networking to stability disk utilization. I’m nevertheless no professional. With this disclaimer -Oracle RAC is claimed to be masterless, I’ll contemplate it to be a pseudo-master-slave structure as there’s a shared ‘master’ disk that’s the foundation of its structure. This can imply that the slave (multi oracle situations in several nodes) can scale learn, however when it comes to writing issues usually are not that straightforward. Here’s a quote from a greater professional

I’ll begin this weblog publish with a fast disclaimer. I’m what you would name a “born and raised” Oracle DBA. My first job, 15 years in the past, had me answerable for administration and creating code on manufacturing Oracle eight databases. Since then, I’ve had the chance to work as a database architect and administrator with all Oracle variations up to and together with Oracle 12.2. All through my profession, I’ve delivered lots of profitable tasks utilizing Oracle because the relational database componen….Though you can scale learn efficiency simply by including extra cluster nodes, scaling write efficiency is a extra complicated topic. Technically, Oracle RAC can scale writes and reads collectively when including new nodes to the cluster, however makes an attempt from a number of periods to modify rows that reside in the identical bodily Oracle block (the bottom degree of logical I/O carried out by the database) may cause write overhead for the requested block and have an effect on write efficiency. That is well-known phenomena and why RAC-Conscious purposes are an actual factor in the actual world. …With the restrictions for pure write scale-out, many Oracle RAC clients select to cut up their RAC clusters into a number of “services,” that are logical groupings of nodes in the identical RAC cluster.Splitting writes from totally different particular person “modules” within the software (that’s, teams of unbiased tables) to totally different nodes within the cluster. That is also called “application partitioning” (not to be confused with database desk partitions).In extraordinarily un-optimized workloads with excessive concurrency, directing all writes to a single RAC node and load-balancing solely the reads.https://aws.amazon.com/blogs/database/amazon-aurora-as-an-alternative-to-oracle-rac/

Word that for scalability there may be clusters of master-slave nodes dealing with totally different tables, however that can be mentioned later).

Clearly, that is accomplished by a 3rd node which is neither grasp or slave as it might solely know if the grasp is gone down or not (NW down can also be grasp down). That is primarily flawed.

The Cut up Mind Curse -Excessive Availability in a Grasp-Slave auto failover System

In a grasp slave-based HA system the place grasp and slaves run in several compute nodes (as a result of there’s a restrict of vertical scalability), the Cut up Mind syndrome is a curse which doesn’t have a superb answer.


The Cut up mind syndrome — if there’s a community partition in a cluster of nodes, then which of the 2 nodes is the grasp, which is the slave? Is determined by the place the NW partition occurs; It appears straightforward to remedy, however until there’s some assure that the third node/widespread node has 100% connection reliability with different nodes, it’s exhausting to resolve. We have been utilizing pgpool-2 and this was I assume one of many bugs that bit us. If you are the sort who assume that uncommon issues don’t occur, within the pc world, you won’t ever stroll on the street at any time a meteor might hit your head.

Here’s a brief snippet with one thing that I used to be part of making an attempt to solve- analyzed by a colleague wrestling to make the pgpool work a couple of years again, principally automated failover of a master-slave system.

· WE have two pgpool and two postgresql providers configured as grasp and standby in to VMs. In case of postgresql, they’re configured as active-standby. A digital IP resides on the grasp pgpool and migrates to standby pgpool in case of failure.· In regular working, all DB calls cross by means of the grasp pgpool which redirects them to the grasp postgresql node.· The standby postgresql retains replicating knowledge from the grasp node utilizing WAL log synchronization.· If the grasp postgresql node goes down/crashes, any of the 2 pgpool triggers a failover and promotes the standby postgresql as new grasp. When the previous grasp node comes again up, it’s introduced as a standby node.Drawback:With the present implementation, we’ve the next points/short-comings:· If the pgpool node that detects a grasp postgresql failure is on the identical node, because the failed grasp, then it has to set off a distant failover to the opposite postgresql node utilizing SSH. If there’s an SSH failure at that second, the failover will fail, leading to a standby-standby state of affairs.· If postgresql node will get indifferent from pgpool due to heavy load (this occurs if pgpool shouldn’t be in a position to make connections to postgresql), then there isn’t any method to re-attach the node once more. It has to be manually hooked up utilizing repmgr library.· Within the above case, if the slave node detaches itself and grasp node goes down, then pgpool has no extra nodes to set off failover to. This once more causes in a standby-standby state of affairs…

It isn’t only a Postgres drawback, a basic google search (under) on this could throw up many issues most such software program, Postgres, MySQL, Elastic Search and so forth.


Earlier than we depart this for these curious you can see right here the mechanism from Oracle RAC to deal with the split-brain (all master-slave architectures this can crop up however by no means in a real masterless system)-where they assume the widespread shared disk is all the time obtainable from all cluster; I don’t know in depth the RAC construction, however seems to be like a classical distributed computing fallacy or a single level of failure if not configured redundantly; which on additional studying, they’re recommending to cowl this half.

Voting disk wants to be mirrored, ought to it develop into unavailable, cluster will come down. Therefore, you ought to keep a number of copies of the voting disks on separate disk LUNs in order that you get rid of a Single Level of Failure (SPOF) in your Oracle 11g RAC configuration. http://oracleinaction.com/voting-disk/

One other from a weblog referred from Google Cloud Spanner web page which captures type of the essence o fthis drawback.

We use MySQL to energy our web site, which permits us to serve hundreds of thousands of scholars each month, however is troublesome to scale up — we need our database to deal with extra writes than a single machine can course of. There are lots of options to this drawback, however these might be complicated to run or require in depth refactoring of your software’s SQL querieshttps://quizlet.com/blog/quizlet-cloud-spanner

These sort of situations are widespread and lots of situations might be discovered of SW making an attempt to repair this. It’s your decision to avoid this; the Database’s utilizing the master-slave (with or with out automated failover) -MySQL, Postgres, MongoDB, Oracle RAC(notice MySQL current Cluster appears to use grasp much less idea (comparable/based mostly on Paxos) however with limitations, learn MySQL Galera Cluster)

It’s your decision to select a database that help’s Grasp-less Excessive Availability( additionally learn Replication )

  1. Apache Cassandra

Cassandra has a peer-to-peer (or “masterless”) distributed “ring” structure that’s elegant, straightforward to arrange, and keep.In Cassandra, all nodes are the identical; there isn’t any idea of a grasp node, with all nodes speaking with one another by way of a gossip protocol. https://www.datastax.com/wp-content/uploads/2012/09/WP-DataStax-MultiDC.pdf

Apache Cassandra doesn’t use Paxos but has tunable consistency (sacrificing availability) with out complexity/learn slowness of Paxos consensus. ( It makes use of Paxos just for LWT. (Here’s a mild introduction which appears simpler to comply with than others (I don’t know the way it works))

2. Google Cloud Spanner?

Spanner claims to be constant and obtainable Regardless of being a worldwide distributed system, Spanner claims to be constant and extremely out there, which means there are not any partitions and thus many are skeptical.1 Does this imply that Spanner is a CA system as outlined by CAP? The brief reply is “no” technically, however “yes” in impact and its customers can and do assume CA. The purist reply is “no” as a result of partitions can occur and actually have occurred at Google, and through (some) partitions, Spanner chooses C and forfeits A. It’s technically a CP system. We discover the impression of partitions under.First, Google runs its personal personal international community. Spanner just isn’t operating over the general public Web — the truth is, each Spanner packet flows solely over Google-controlled routers and hyperlinks (excluding any edge hyperlinks to distant shoppers).One delicate factor about Spanner is that it will get serializability from locks, however it will get exterior consistency (comparable to linearizability) from TrueTimehttps://storage.googleapis.com/pub-tools-public-publication-data/pdf/45855.pdf

three. Cockroach DB is an open supply in-premise database of Cloud Spanner -that is Extremely Obtainable and strongly Constant that makes use of Paxos sort algorithm.

Writes are serviced utilizing the Raft consensus algorithm, a well-liked various to Paxos. — https://www.cockroachlabs.com/docs/stable/strong-consistency.htmlThe major distinction is that since CockroachDB doesn’t have Google infrastructure to implement TrueTime API to synchronize the clocks throughout the distributed system, the consistency assure it offers is called Serializability and never Linearizability (which Spanner supplies). http://wp.sigmod.org/?p=2153

Cockroach DB perhaps one thing to see because it will get extra secure;

Scalability — Software Sharding and Auto-Sharding

This immediately takes us to the evolution of NoSQL databases. Database scaling is completed by way of sharding, the important thing factor is that if sharding is automated or guide. By guide, I imply that software developer do the customized code to distribute the info in code — software degree sharding. Automated sharding is completed by NoSQL database like Cassandra whereas virtually all older SQL sort databases (MySQL, Oracle, Postgres) one need to do sharding manually.

Auto-sharding is a key function that ensures scalability with out complexity growing within the code.

Here’s a snippet from the web. It covers two elements, the disk I/O half (which I assume early designers by no means thought will develop into a bottleneck afterward with extra data-Cassandra designers knew absolutely nicely this drawback and designed to reduce disk seeks), and the opposite which is extra necessary touches on software degree sharding.

Why doesn’t PostgreSQL naturally scale nicely?It comes down to the efficiency hole between RAM and disk.But when the info is sufficiently giant that we will’t match all (equally fixed-size) pages of our index in reminiscence, then updating a random a part of the tree can contain vital disk I/O as we learn pages from disk into reminiscence, modify in reminiscence, after which write again out to disk (when evicted to make room for different pages). And a relational database like PostgreSQL retains an index (or different knowledge construction, corresponding to a B-tree) for every desk index, to ensure that values in that index to be discovered effectively. So, the issue compounds as you index extra columns.Basically, if you are writing numerous knowledge to a PostgreSQL desk, sooner or later you’ll need partitioning. https://blog.timescale.com/scaling-partitioning-data-postgresql-10-explained-cd48a712a9a1

There’s one other half to this, and it relates to the master-slave structure which suggests grasp is the one which writes and slaves simply act as a standby to replicate and distribute reads. (Extra precisely, Oracle RAC or MongoDB Replication Units aren’t precisely restricted by just one grasp to write and a number of slaves to learn from; however both use a shared storage and a number of masters -slave units to write and skim to, in case of Oracle RAC; and comparable in case of MongoDB makes use of a number of replication units with every replication set being a master-slave mixture, however not utilizing shared storage like Oracle RAC. Please see above the place I discussed the sensible limits of a psuedo master-slave system like shared disk methods)

Allow us to now see how this automated sharding is completed by Cassandra and what it means to knowledge Modelling.

Half 2 : Cassandra Internals for Knowledge Modelling

Cassandra Write — Instinct

Cassandra Write path Supply

Observe the Reminiscence and Disk Half. The flush from Memtable to SStable is one operation and the SSTable file as soon as written is immutable (no more updates). Many individuals might have seen the above diagram and nonetheless missed few elements. SSTable flush occurs periodically when reminiscence is full. Commit log has the info of the commit additionally and is used for persistence and recovering in situations like power-off earlier than flushing to SSTable. It’s all the time written in append mode and skim solely on startup. Since SSTable is a unique file and Commit log is a special file and since there is just one arm in a magnetic disk, that is the rationale why the primary guideline is to configure Commit log in a unique disk (not even partition and SStable (knowledge listing)in a separate disk.

Cassandra performs very properly on each spinning exhausting drives and strong state disks. In each instances, Cassandra’s sorted immutable SSTables permit for linear reads, few seeks, and few overwrites, maximizing throughput for HDDs and lifespan of SSDs by avoiding write amplification.Nevertheless, when utilizing spinning disks, it’s necessary that the commitlog (commitlog_directory) be on one bodily disk (not merely a partition, however a bodily disk), and the info information (data_file_directories) be set to a separate bodily disk. By separating the commitlog from the info listing, writes can profit from sequential appends to the commitlog with out having to search across the platter as reads request knowledge from numerous SSTables on disk. -http://cassandra.apache.org/doc/4.0/operating/hardware.html

Please, notice that the SSTable file is immutable. Which means after a number of flushes there can be many SSTable. This may imply that learn question might have to learn a number of SSTables. Additionally, updates to rows are new insert’s in one other SSTable with a better timestamp and this additionally has to be reconciled with totally different SSTables for studying. To optimize there’s something referred to as periodic compaction that’s finished the place a number of SSTables are mixed to a brand new SSTable file and the older is discarded.

Observe that Delete’s are like updates however with a marker referred to as Tombstone and are deleted throughout compaction. Nevertheless, due to the complexity of the distributed database, there’s further security (learn complexity) added like gc_grace seconds to forestall Zombie rows. This is likely one of the causes that Cassandra doesn’t like frequent Delete.

If you need to get an instinct behind compaction and the way relates to very quick writes (LSM storage engine) and you can learn this extra

These SSTables may include outdated knowledge — e.g., totally different SSTables may include each an previous worth and new worth of the identical cell, or an previous worth for a cell later deleted. That’s positive, as Cassandra makes use of timestamps on every worth or deletion to work out which is the newest worth. Nevertheless, it’s a waste of disk area. It additionally slows down reads: totally different SSTables can maintain totally different columns of the identical row, so a question may need to learn from a number of SSTables to compose its end result.For these causes, compaction is required. Compaction is the method of studying a number of SSTables and outputting one SSTable containing the merged, most up-to-date, info.This system, of protecting sorted information and merging them, is a well known one and sometimes referred to as Log-Structured Merge (LSM) tree.https://github.com/scylladb/scylla/wiki/SSTable-compaction-and-compaction-strategies + others

This weblog provides the internals of LSM if you have an interest.

We’ve skipped some elements right here. One principal half is Replication. Once we need to distribute the info throughout multi-nodes for knowledge availability (learn knowledge security), the writes have to be replicated to that many numbers of nodes as Replication Issue.

Additionally when there are a number of nodes, which node ought to a shopper join to?

It connects to any node that it has the IP to and it turns into the coordinator node for the shopper.

The coordinator node is usually chosen by an algorithm which takes “network distance” under consideration. Any node can act because the coordinator, and at first, requests can be despatched to the nodes which your driver is aware of about….The coordinator solely shops knowledge regionally (on a write) if it finally ends up being one of many nodes answerable for the info’s token vary –https://stackoverflow.com/questions/32867869/how-cassandra-chooses-the-coordinator-node-and-the-replication-nodes

Position of PARTITION Key in Write

Now allow us to see how the auto-sharding happening. Suppose there are three nodes in a Cassandra cluster. Every node will personal a specific token vary.

Assume a specific row is inserted. Cassandra makes use of the PARTITION COLUMN Key worth and feeds it a hash perform which tells which of the bucket the row has to be written to.

It makes use of the identical perform on the WHERE Column key worth of the READ Question which additionally provides precisely the identical node the place it has written the row.

A Main key ought to be distinctive. Extra particularly a ParitionKey ought to be distinctive and all values of these are wanted within the WHERE clause. (Cassandra doesn’t do a Learn earlier than a write, so there isn’t a constraint examine just like the Main key of relation databases, it simply updates one other row)

The partition key has a particular use in Apache Cassandra past displaying the individuality of the document within the database -https://www.datastax.com/dev/blog/the-most-important-thing-to-know-in-cassandra-data-modeling-the-primary-key

The relation between PRIMARY Key and PARTITION KEY.

PARTITION KEY == First Key in PRIMARY KEY, relaxation are clustering keys

Instance 1: PARTITION KEY == PRIMARY KEY== videoidCREATE TABLE movies (
…PRIMARY KEY (videoid)
);Instance 2: PARTITION KEY == userid, remainder of PRIMARY keys are Clustering keys for ordering/sortig the columnsCREATE TABLE user_videos (
 PRIMARY KEY (userid, added_date, videoid)
);Instance three: COMPOSITE PARTITION KEY ==(race_year, race_name)CREATE TABLE rank_by_year_and_name ( 
PRIMARY KEY ((race_year, race_name), rank) 

Now right here is the primary instinct. Half 1

For writes to be distributed and scaled the partition key must be chosen in order that it distributes writes in a balanced approach throughout all nodes.

However don’t you assume it’s common sense that if a question learn has to contact all of the nodes within the NW it is going to be sluggish. Sure, you are proper; and that’s what I needed to spotlight. Earlier than that permit us go shallowly into — Cassandra Learn Path

Now right here is the primary instinct. Half 2

For reads to be NOT distributed throughout a number of nodes (that’s fetched and mix from a number of nodes) a learn triggered from a shopper question ought to fall in a single partition (overlook replication for simplicity)

That is illustrated superbly within the diagram under

You’ll be able to see how the COMPOSITE PARTITION KEY is modeled in order that writes are distributed throughout nodes and reads for specific state lands in a single partition.

supply http://db.geeksinsight.com/2016/07/19/cassandra-for-oracle-dbas-part-2-three-things-you-need-to-know/

To have a superb learn efficiency/quick question we need knowledge for a question in a single partition learn one node.There’s a stability between write distribution and skim consolidation that you need to obtain, and you need to know your knowledge and question to know that.

The purpose is, these two objectives typically battle, so you’ll need to attempt to stability them.Conflicting Guidelines?If it’s good to reduce the variety of partitions that you learn from, why not put every little thing in a single huge partition? You’d find yourself violating Rule #1, which is to unfold knowledge evenly across the cluster.The purpose is, these two objectives typically battle, so you’ll need to attempt to stability them.Mannequin Round Your QueriesThe method to reduce partition reads is to mannequin your knowledge to suit your queries. Don’t mannequin round relations. Don’t mannequin round objects. Mannequin round your queries. Right here’s how you do this -https://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling

That is probably the most important talent that one wants when doing modeling for Cassandra.


A extra detailed instance of modeling the Partition key together with some rationalization of how CAP theorem applies to Cassandra with tunable consistency is described partially 2 of this collectionhttps://hackernoon.com/using-apache-cassandra-a-few-things-before-you-start-ac599926e4b8