Big Data at AWS re:Invent 2016

AWS re:Invent 2016 has kicked off for me in the realm of Big Data. It’s a challenging topic and one of great interest to companies around the globe so it was a no-brainer to be hanging around with folks at The Mirage for the Big Data talks. This blog post will be a quick write up on some interesting topics, announcements and features of the various tools covered today.

Big Data in AWS

The Big Data Mini Con had no announcements for new services. However, Amazon’s ecosystem for Big Data tools is growing rapidly and we got a sweet introduction to what is currently available, here’s some of the more interesting ones:

Import/Export Snowball – A nearly indestructible petabyte scale means of importing or exporting data into or out of Amazon S3.
Kinesis – AWS flavor of data stream and real-time analytics processing.
Redshift – A petabyte scale data warehouse solution as a service.
EMR – Apache Ecosystem / Hadoop as a service.
Data Pipeline – Data orchestration service for inter-AWS and on-premise workflows.

S3 – Durable, infinitely scalable, distributed object storage in the cloud.
Direct Connect – Up to 10GB direct connection from your VPC to your on-premise network.
Machine Learning – A real-time predictive modeling service.
Quick Sight – A business intelligence, data visualization and analytics tool.

Announcement: Data Transformation with Lambda on Kinesis Streams

A new feature that is coming to Kinesis — the ability for you to transform your streaming data with Lambda. The idea is to have a lambda function transform your data as it comes in instead of relying on an application running in EC2 processing the stream. It’s another very effective way of controlling costs and reducing the overhead of dealing with scaling yourself. In order to kickstart acceptance, Amazon intends to provide a library of templates for common transformation use-cases.

Amazon EMR

Recently, autoscaling in Amazon ElasticMapReduce became generally available. You can now configure your EMR clusters to scale based on metrics in AWS CloudWatch. It’s no dummy either, not only will it perform scaling operations on your actually processing throughput but it will also optimize your instance time.

The amount of metrics available in CloudWatch for EMR clusters is staggering and it’s this level of integration that makes autoscaling super intelligent. Instead of relying on abstract information about CPU and Memory — which can be hit or miss based on your work loads — you can configure scaling events to happen on REAL throughput metrics such as MapSlotsOpen, ReduceSlotsOpen or AppsPending based on which tools you’re running.
Instance time optimization is built into your EMR cluster’s autoscaling. It will automatically give you full-utilization of your instance before terminating due to a scale-down event. So when you scale-up and purchase an hour of EC2 capacity, you will get the entire hour of extra horsepower before it scales down. This way you get all of the capacity you paid for versus paying for the full hour and only utilizing a few minutes of it.

Announcement: Advanced Spot Provisioning & Spot Block support

A feature coming soon to EMR is Advanced Spot Provisioning, an extension to the spot instance reservations specifically tailored for distributed systems in AWS. This new feature will allow you to configure spot instance reservations for a list of instance types. You will be able to have a range of instance sizes running in your cluster and have spot instances requested differently for both your core node fleet and your task node fleet. The provisioning tool will select the most optimal instance and availability zone based on the capacity and price you have configured.
In addition to the optimizations of spot provisioning I mentioned above, EMR will also take advantage of Spot Instance Blocks. With traditional spot instances, you can reserve instances with great discounts at the risk of losing that capacity when normal demand increases. With Spot Instance Blocks, you can block off 1 to 6 hours of spot instance capacity. Spot Instance Blocks are priced differently than Spot Instances, but can be a big source of cost reduction in your data processing architecture for larger workloads.

Compute & Storage Decoupling

The final concept with EMR that was really driven home today was the decoupling of your compute and storage resources. In a traditional setup you typically have storage and compute bound together — meaning when you need more storage you end up with more compute and vis-a-vis.
With the latest iteration of Amazon EMR (5.2.0 recently released) HBase is now able to be fully integrated with EMRFS, which uses Amazon S3 for storage. By moving your storage into an S3 backed solution, you no longer have to scale your cluster for storage demands, get effectively infinite scalability, and you take advantage of S3’s eleven 9’s of durability. Traditional HDFS is still installed onto EMR so you can take advantage of a distributed local data store as needed.

Amazon Redshift

Amazon Redshift is an exceptional tool for Data Warehousing and it’s one of my favorite services offered by AWS. If you haven’t already, take some time to dive deep into the documentation and understand the complexities behind maximizing your architecture. This session was an excellent starter into concepts like data processing, distribution keys, sort keys, and compression.

Keys

Optimizing your queries in Redshift, like any other database, is going to be based around your keys. Since this is a lengthy topic, I’ll give you a quick overview of the important keys in Redshift. I recommend watching this session online or reviewing the documentation to fully understand how to architect them.

Distribution keys are used to physically store rows together and collate. Using the distribution key in all your JOINs is a best practice for performance — even though it may seem redundant.
Sort Keys are columns you specify for Redshift to optimize queries with. Redshift can skip entire blocks of data by referencing header data internally thus dramatically improving query performance.
Interleaved Sort Keys are designed for very large data sets and provide more performance as the table size increases. You can create interleaved sort keys with up to eight columns.

AWS Schema Conversion Tool

The schema conversion tool can be pointed to an existing database to copy its schema into a Redshift cluster and recommend schema changes for compatibility. Currently, it works with Oracle, Netezza, Greenplum, Teradata and more recently Redshift itself.
The schema conversion tool now supports Redshift -to- Redshift conversion as a means to optimizing your existing data structure. By analyzing your existing redshift cluster, the tool can provide recommendations for distribution keys and sort keys. Since the optimization service only provides recommendations based on existing usage and a very flat view of your data, you’ll want to test extensively before making a hard switchover.

Wrap Up

Today’s talks were enlightening and I eagerly await for new big data-as-a-service products to be announced in the coming days. If you want more information about these topics or want to hear the case studies yourself, I recommend you watch the sessions when they are released:

BDM205 – Big Data Mini Con State of the Union
BDM401 – Deep Dive: Amazon EMR Best Practices & Design Patterns
BDM402 – Best Practices for Data Warehousing with Amazon Redshift

For those of you at re:Invent, some of these sessions will be repeated and I highly recommend them.
If you see my ugly mug the next couple days, be sure to say hello — I’d love to know what your doing with Big Data, DevOps or AWS.
Going to AWS re:Invent is one of many excellent perks for Stelligent engineers. Good news! We’re hiring, check out our Careers page.

Stelligent Amazon Pollycast