If data analysis is one of the core features of your product, then you probably already know that choosing a data storage and processing solution requires careful consideration. Let’s discuss the pros, and cons of the most popular choices, Redshift/EMR, DynamoDB + EMR, AWS RDS for PGSQL, and Cassandra + Spark.
Managed Amazon Redshift/EMR
Pro - It’s fully-managed by Amazon with no need to hire support staff for maintenance.
Pro - It’s scalable to petabyte-size with very few mouse clicks. Pro - Redshift is SQL-compatible, so you can use external BI tools to analyze data.
Pro - Redshift is quite fast and performant for its price on typical BI queries. Con - Redshift's SQL is the only way to structure/analyze data inside Redshift. It may be easier for simple tasks, but to do complex tasks like social network analysis or text mining (or even running custom AWS EMR tasks) you have to manually export all data to external storage (to S3 for example). You then run all your external analytics tasks, and load results back to Redshift. The amount of manual work will only grow with time ultimately making the use of Redshift an obstacle. Con - Redshift' SQL dialect for data analysis used is also very limited (as a tradeoff for its performance), the main drawbacks are: missing secondary indexing support, no full-text search, and no unstructured JSON data support. Usually it's OK for structured and pre-cleaned sterile data, but it will be really hard to store and analyze semi-structured data there (like data from social networks or text from webpages) Con - EMR has very weak integration with Redshift: you have to export/import all data through S3. Con - To write analytical EMR jobs, you have to hire people with pricey Big Data/Hadoop competence
Managed Amazon DynamoDB + EMR
PRO - It’s fully-managed by Amazon with no need to hire support staff for maintenance. PRO - It’s scalable to petabyte-size with very few clicks of the mouse. PRO - Pricing is opaque and it may be rather costly to run analytical workload (with full-table scans as for text mining) on large workloads. CON - DynamoDB is a columnar NoSQL store. For most analytical queries, you have to use EMR tools like Hive, which is rather slow, taking minutes for simple queries that typically execute instantly on Redshift/RDS). CON - DynamoDB is closed technology which is unpopular in the Big data community (mostly because of its prices). We’ve also noticed difficulty finding people with required competences to extend the system later.
Custom 'light' solution with AWS RDS for PGSQL
PRO - Postgresql is easily deployable anywhere, has very large community and there's a lot of people with required competence. You can use either hosted RDS version, or install your own on EC2 - it does not require any hardcore maintenance (like own custom hadoop cluster) and just works. PRO - RDS Postgresql supports querying unstructured JSON data (so you can store social network data in a more natural way than in Redshift), full-text search (so you can query user's friends for custom keywords), and multiple datatypes (like arrays which are very useful for storing social graph data). PRO - Has full-featured unrestricted SQL support for your analytical needs and external BI tools. CON - PGSQL is not “big” data friendly. Although versatile for small to medium data, our experience has uncovered difficulties when scaling for large datasets sizes. Scaling may be a serious issue later and require non-easy architectural modifications in the whole analytical backend, but may speed up development if data size is not an issue.
Custom 'heavy' solution with Cassandra + Spark:
PRO – Cassandra + Spark can easily handle storing and analyzing petabytes of data. PRO - Cassandra deals with semi-structured data well, which comes in handy when storing social network data like user's Facebook wall posts, friends, etc. PRO - Spark has good machine-learning (for example, dimensionality reduction) and graph-processing (useable for SNA analysis) libraries included. Also has python API to use ant other external tools from numpy and scikit. PRO - As a self-hosted solution, Cassandra + Spark is much more flexible for future complex analysis tasks. PRO - Spark has SparkSQL which is an easy integration add-on for external BI tools. CON - Cassandra may need higher tier competences for challenges that arise when scaling which, in turn, may require additional investment in support staff supervision. CON - Spark is a rather new technology, but it has already positioned itself well within the big data community as a next-gen Hadoop. At present, it may be hard to find people with Spark competency, but the user community is quickly growing, thus making skills easier to find as time passes.
The final choice is dictated by your current business priorities.
If you need to move forward fast with less maintenance routine and are not afraid of later technical debt, we recommend using the “Light” solution or Amazon DynamoDB. If your top priority is system scalability, then the ‘Heavy’ solution surfaces as the clear choice.