New features give AWS users improved control over data processing

Two new integrations that would make it easier for AWS users to connect and analyze data across data stores without having to move data between services have been announced at this year’s re:Invent holding in Las Vegas, US.

With the announcement, users would be able to analyze Amazon Aurora data with Amazon Redshift in near real time, eliminating the need to extract, transform, and load (ETL) data between services, the company says. Customers can also now run Apache Spark applications easily on Amazon Redshift data using AWS analytics and machine learning (ML) services (e.g., Amazon EMR, AWS Glue, and Amazon SageMaker). Together, these new capabilities help customers move toward a zero-ETL future on AWS.

“The vastness and complexity of data that customers manage today means they cannot analyze and explore it with a single technology or even a small set of tools,” said Swami Sivasubramanian, vice president of Databases, Analytics, and Machine Learning at AWS. “Many of our customers rely on multiple AWS database and analytics services to extract value from their data, and ensuring they have access to the right tool for the job is important to their success.” He explained that the new capabilities would help AWS move customers toward a zero-ETL future, reducing the need to manually move or transform data between services. By eliminating ETL and other data movement tasks, users are freed to focus on analyzing data and driving new insights for their business—regardless of the size and complexity of their organization and data.

One of these is the Amazon Aurora zero-ETL integration with Amazon Redshift, which makes it easier to run petabyte-scale analytics on transactional data in Amazon Aurora in near real time. The requirement for near real-time insights on transactional data (e.g., purchases, reservations, and financial trades) grows as organizations seek to better understand core business drivers and develop strategies to increase sales, reduce costs, and gain a competitive advantage, AWS says.

Currently, many organizations rely on a three-part solution to analyze their transactional data—a relational database to store data, a data warehouse to perform analytics, and a data pipeline to ETL data between the relational database and the data warehouse. However, data pipelines can be expensive to build and maintain, and could also take days before data is ready for analysis.

With Amazon Aurora zero-ETL integration with Amazon Redshift, transactional data is automatically and continuously replicated seconds after it is written into Amazon Aurora and seamlessly made available in Amazon Redshift. Once data is available in Amazon Redshift, customers can start analyzing it immediately and apply advanced features like data sharing and Amazon Redshift ML to get holistic and predictive insights.

Also, there is the Amazon Redshift integration for Apache Spark which makes it easier to use AWS analytics and ML services to build and run Apache Spark applications on data from Amazon Redshift.

Amazon Redshift integration for Apache Spark makes it easier for developers to build and run Apache Spark applications on data in Amazon Redshift using AWS-supported analytics and ML services. Amazon Redshift integration for Apache Spark is certified, packaged, and supported by AWS, eliminating the cumbersome and error-prone process associated with third-party connectors. Developers can begin running queries on Amazon Redshift data from Apache Spark-based applications within seconds using popular language frameworks (e.g., Java, Python, R, and Scala). Intermediate data-staging locations are managed automatically, eliminating the need for customers to configure and manage these in application code.

Illustrating a use-case benefit, Jack Lull, principal scientist for Adobe Acrobat Sign, said “Adobe’s mission is to change the world through digital experiences, and in today’s world, that means having analytics that can deliver both deep and real-time insights.”

He further said, “As an Amazon Aurora customer, we are excited for Amazon Aurora support for zero-ETL integration with Amazon Redshift, which will provide our growing Acrobat Sign customer base with new insights and faster analytics performance as their usage increases—all without the need for ongoing maintenance for our own teams.”

Also, for GE Aerospace, a global provider of jet engines, components, and systems for commercial and military aircraft, “Amazon Redshift is a focal point of our strategy to make data extremely accessible and usable across our organization,” said Alcuin Weidus, senior principal data architect at GE Aerospace. “Data scientists, engineers, and developers leverage Apache Spark to build data products and run analytics workloads on Amazon EMR, AWS Glue, and third-party ML platforms hosted on AWS. We are excited for the Amazon Redshift integration for Apache Spark, which will streamline our developers’ building process and help make applications more performant and secure.”