Wednesday 18 May 2016

Distributed Deletes in Cassandra



Cassandra cluster defines a ReplicationFactor that determines how many nodes each key and associated columns are written to. In Cassandra, the client controls how many replicas to block for on writes, which includes deletions. In particular, the client may, and typically will, specify a ConsistencyLevel of less than the cluster's ReplicationFactor, that is, the coordinating server node should report the write successful even if some replicas are down or otherwise not responsive to the write.

A delete operation can't just wipe out all traces of the data being removed immediately. if we did, and a replica did not receive the delete operation, when it becomes available again it will treat the replicas that did receive the delete as having missed a write update, and repair them! So, instead of wiping out data on delete, Cassandra replaces it with a special value called a tombstone. The tombstone can then be propagated to replicas that missed the initial remove request.

Tombstones exist for a period of time defined by gc_grace_seconds(Table property). After data is marked with a tombstone, the data is automatically removed during the normal compaction process.
Facts about deleted data to keep in mind are:
  • Cassandra does not immediately remove data marked for deletion from disk. 
  • The deletion occurs during compaction.If you use the sized-tiered or date-tiered compaction strategy, you can drop data immediately by manually starting the compaction process.
  •  Before doing so, understand the documented disadvantages of the process. A deleted column can reappear if you do not run node repair routinely.

Why deleted data can reappear

Marking data with a tombstone signals Cassandra to retry sending a delete request to a replica that was down at the time of delete. If the replica comes back up within the grace period of time, it eventually receives the delete request. However, if a node is down longer than the grace period, the node can miss the delete because the tombstone disappears after gc_grace_seconds. Cassandra always attempts to replay missed updates when the node comes back up again. After a failure, it is a best practice to run node repair to repair inconsistencies across all of the replicas when bringing a node back into the cluster. If the node doesn't come back within gc_grace,_seconds, remove the node, wipe it, and bootstrap it again.

No comments:

Post a Comment