DataSync

Problem Solved by DataSync

Manual transfer of large datasets is often difficult, slow, and unreliable, requiring complex handling of errors, integrity, encryption, and parallelization. DataSync automates and accelerates this process.

Addresses the difficulty, slowness, and unreliability of manually transferring large datasets (terabytes/petabytes) over the internet or Direct Connect using tools like rsync.

Requires manual error handling, data integrity verification after connection drops, and complex setup for network, encryption, and parallelization.

Automates and accelerates online data transfer, simplifies setup, and ensures data integrity.

DataSync Architecture

AWS DataSync operates as a managed service, using an on-premises agent to connect local storage to AWS cloud destinations with an optimized protocol.

AWS DataSync is a managed data transfer service.

The DataSync agent is deployed on-premises as a virtual machine (VMware, Hyper-V) or a physical hardware appliance. It requires TCP/IP network connectivity (internet or private network). Direct Connect or Site-to-Site VPN is not typically required but recommended for better performance, security, and predictability.

Technical Specs: Deployment: VMware, Hyper-V VM or physical hardware appliance; Connectivity: TCP/IP network

The agent connects to on-premises storage using standard protocols like NFS or SMB. It communicates with the AWS cloud control plane for task management.

Technical Specs: On-premises storage protocols: NFS, SMB

DataSync uses a proprietary optimized protocol for accelerated transfers (up to 10x faster than standard methods). It automatically parallelizes operations, handles integrity checks, and scheduling.

Technical Specs: Transfer speed: Up to 10x faster than standard methods; Features: Automatic parallelization, integrity checks, scheduling

DataSync efficiently transfers data to AWS storage services like Amazon S3, Amazon EFS, and various FSx file systems (FSx for Windows File Server, FSx for Lustre, FSx for OpenZFS, FSx for NetApp ONTAP).

Key Features

DataSync offers a suite of features to optimize and secure data transfer operations.

AWS DataSync includes features such as transfer acceleration, data integrity verification, filtering capabilities, and flexible scheduling for automated data movement.

Accelerated Transfers

Utilizes a proprietary protocol to significantly speed up data movement compared to standard methods.

Data Integrity

Guarantees data integrity throughout the transfer process through end-to-end verification and checksums.

Filtering

Allows users to filter files during transfer, which is useful for moving specific datasets.

Use Cases:

Moving specific data sets

Scheduling

Automates recurring data movement, such as nightly backups or replication tasks.

Use Cases:

Nightly backups
Replication

Task Modes - Enhanced Mode

Transfers virtually unlimited objects with higher performance by optimizing and parallelizing listing, preparing, transferring, and verifying operations. Available for transfers to Amazon S3 destinations or between Azure Blob/other cloud storage and Amazon S3 (without an agent).

object_limit: Virtually unlimited

performance_optimization: Parallelized listing, preparing, transferring, verifying

Task Modes - Basic Mode

Transfers files/objects between supported AWS storage services and other DataSync locations. It is subject to quotas on the number of files, objects, and directories, and sequentially prepares, transfers, and verifies data, making it slower than Enhanced Mode.

processing_method: Sequentially prepares, transfers, and verifies data

Practical Use Cases

DataSync is utilized across various scenarios for moving data to, from, and between AWS storage services.

Migrating active data or data from NAS servers/fileshares to S3, EFS, or FSx. Also used for staging data for cloud-based applications or data lakes.

Keeping a secondary copy of on-premises data in AWS for backups or analysis staging.

Automating the transfer of critical data to AWS for disaster recovery.

Moving sensor data or log files to S3 for processing by analytics services (e.g., Redshift, Athena).

Restoring data from S3, EFS, or FSx back to on-premises during recovery, or making AWS-processed data available to local applications or users.

Replicating data to a different AWS region for business continuity, replicating data between file systems in different AWS accounts, or transferring data between storage services (e.g., S3 to EFS for applications on ECS/Fargate).

Purpose-built for securely and efficiently transferring large amounts of data (e.g., 10TB of daily JSON instrumentation data) from on-premises SAN to Amazon S3 for near-real-time analytics. It automates transfer, accelerates the process with built-in encryption and integrity checks.

Technical Specs: Data volume: 10TB daily; Data type: JSON instrumentation data; Source: On-premises SAN; Destination: Amazon S3

DataSync Sources and Destinations

DataSync supports a variety of on-premises storage systems and AWS cloud storage services.

AWS DataSync is flexible, supporting common on-premises protocols and integrating with a broad range of AWS object and file storage services.

On-Premises Sources

DataSync can pull data from Network File System (NFS), Server Message Block (SMB), Hadoop Distributed File System (HDFS), and other object storage.

AWS Destinations

Data can be transferred to Amazon S3, Amazon EFS, Amazon FSx for Windows File Server, Amazon FSx for Lustre, Amazon FSx for OpenZFS, and Amazon FSx for NetApp ONTAP.

Other Cloud Storage Services

Supported for transfers, especially when using Enhanced Mode.

Exam Tips and Distinctions

comparison-table

Understanding DataSync's capabilities and how it differs from other AWS services is crucial for certification exams.

DataSync is often contrasted with other data transfer and storage solutions. Key differentiators revolve around transfer method (online vs. offline), integration with other services, and primary purpose.

Option	Primary Use Case	Transfer Method	Scalability/Performance	Key Differentiators	Limitations
AWS DataSync	Online data transfer (internet, Direct Connect, VPN) for large datasets; migrations, recurring replication, DR, moving locally generated data.	Online, proprietary optimized protocol; agent-based for on-premises; Direct S3 Glacier/Deep Archive transfer.	Up to 10x faster than standard methods; automates parallelization; handles terabytes/petabytes.	Managed service, automates/accelerates transfers, ensures data integrity, direct copy to S3 Glacier/Deep Archive.	Cannot directly replicate DynamoDB tables; Not optimal for offline/petabyte-scale transfers that would take weeks (use Snow Family). Does not seamlessly extend on-premises storage or provide automatic lifecycle management (unlike S3 File Gateway).
AWS Snow Family	Offline environments, remote locations, or petabyte-scale data transfers that would take weeks over a network.	Offline, physical devices (Snowball Edge).	Designed for petabyte-scale transfers where online methods are impractical.	Physical appliance for disconnected/offline data transfer.	Not for online data transfer.
AWS Storage Gateway	Hybrid cloud access and keeping data synced, extending on-premises storage.	Online; provides hybrid access.	Focus on hybrid cloud integration and continuous sync.	Primarily for hybrid access, not one-time migration acceleration. File Gateway provides SMB/NFS interface with local caching backed by S3, and supports S3 lifecycle policies for archiving.	Not primarily a one-time migration accelerator like DataSync.
rsync	Manual file transfers.	Online (standard methods).	Slow for large transfers.	Requires custom scripting, manual error handling.	Slow, complex for large transfers, lacks automated error handling and integrity verification.
AWS DMS	Migrating/replicating databases.	Online (Public Internet or Direct Connect).	Optimized for database workloads.	Database-specific migration service.	Not for general-purpose file transfers like JSON instrumentation data.
Amazon S3 File Gateway	Extending on-premises SMB file server storage to S3 with local caching and lifecycle management.	Online, SMB interface with local caching.	Provides SMB access with local caching for low latency, storing files in S3; automates transitions to cheaper storage classes via S3 lifecycle policies.	Seamlessly extends on-premises storage, provides SMB interface, supports S3 lifecycle policies for automatic tiered storage.	Lacks full NTFS support (compared to FSx for Windows File Server); better for archival/backup than a primary highly available backend for applications like SharePoint.

Learning Objectives