Big Data Analysis: Apache Cassandra

Apache Cassandra

Download the ppt: https://www.dropbox.com/scl/fi/sqtdndeiw297gbe17ohtr/07_Cassandra.pptx?rlkey=ngf7igwqqajk5l2eokf1o9486&dl=0

What is Gossip Protocal?
Gossip protocol is a process of computer peer-to-peer communication that allows nodes in a distributed cluster to talk to one another.
Key functions and characteristics of the gossip protocol include:
Sharing Metadata: Nodes use this protocol to share and discover information about the metadata of the network and the cluster.
Determining Data Responsibility: Through gossip, nodes communicate about which specific node is responsible for which token ranges (data partitions).
Request Coordination: When a client request hits any node in the cluster (acting as a coordinator), that node uses the information gained through the gossip protocol to identify the correct node responsible for that data and forwards the request accordingly.
Masterless Architecture Support: Because Cassandra has a masterless architecture where every node is equal, the gossip protocol is essential for ensuring all nodes stay updated on the status and location of their peers without a central authority
---------------------------------------------------------------------------------------------------------------------
What is Query-first modeling?
Query-first modeling is a database design methodology used in Apache Cassandra where tables are designed specifically to satisfy the exact queries an application will perform, rather than modeling based on entities or normalization as is common in SQL.
According to the sources, this approach involves the following key principles:

Focus on Fetching over Storage: Instead of thinking about how to store data efficiently to save space, developers must think about how they will fetch the data and what the specific use cases are. This ensures that when a query is fired, it hits a node that can return the data instantly without complex processing.
Denormalization: Unlike relational databases (SQL) that use normalization to avoid repeating data, query-first modeling favors denormalization. This means data is purposefully repeated across different tables so that all information needed for a specific query is available in one place, eliminating the need for time-consuming JOIN operations.
Optimization for Speed: The primary goal of this model is to retrieve data in a fraction of a second. In a world of "Big Data," joining multiple huge tables is considered a "weakest link" because it is too slow for real-time transactional workloads.
Primary Key and Partition Design: In this model, the primary key (specifically the partition key) is chosen based on how the application filters data. For example, if a website frequently searches for users based on their location, the table would be designed with "location" as the partition key to ensure all related data is stored together on the same node for fast retrieval.
Clustering Keys for Sorting: After the partition key is determined, clustering keys are used to define how the data is sorted within that partition, further optimizing the specific query's output.

In summary, while SQL modeling starts with the data's structure, Cassandra's query-first modeling starts with the application's requirements to ensure maximum performance and horizontal scalability.

------------------------------------------------------------------------------------------------------

Determining whether MongoDB or Cassandra is the better choice depends on your specific requirements for data flexibility, availability, and consistency. Below are examples of situations where one outperforms the other based on the sources.

Situation where MongoDB is better: Evolving Content Management

MongoDB is the superior choice for applications with rapidly changing or unstructured data structures, such as a modern e-commerce platform or a content management system.

Reasoning: Because MongoDB is schema-less and stores data in flexible, JSON-like BSON documents, you can store items with entirely different attributes in the same collection without performing complex and time-consuming database migrations.
Advantage: This makes it ideal for startups or agile projects where the data model is constantly evolving as new features are added. It also offers richer query capabilities (MQL) and more intuitive APIs for complex data transformations compared to Cassandra.

Situation where Cassandra is better: Mission-Critical Global Logging

Cassandra is the better choice for high-velocity, globally distributed systems that require 100% uptime, such as a real-time sensor logging system for a logistics company or a global messaging app.

Reasoning: Cassandra uses a masterless, peer-to-peer architecture where every node is equal. Unlike MongoDB, which relies on a single primary node for writes—creating a potential "single point of failure" and brief downtime during elections—Cassandra has no single point of failure.
Advantage: It provides linear scalability and is specifically optimized for fast write performance, making it the right choice for "always on" mission-critical data. Its tunable consistency also allows developers to choose exactly how many nodes must acknowledge a request, balancing speed and data integrity per query.

Comparison Summary

Feature	MongoDB is better when...	Cassandra is better when...
Data Structure	Data is dynamic, nested, or unstructured.	Data is in a more fixed, structured format.
Availability	Small amounts of downtime during failover are acceptable.	You require 100% uptime and high availability.
Scalability	You need granular control over sharding based on logic.	You need linear scaling by simply adding hardware.
Consistency	Strong consistency is required (typically a CP system).	Eventual consistency is acceptable (typically an AP system).

Last modified: Wednesday, 6 May 2026, 9:58 AM