Windows Azure Storage Redundancy Options and Read Access Geo Redundant Storage

We are excited to announce the ability to allow customers to achieve higher read availability for their data. This preview feature called “Read Access - Geo Redundant Storage (RA-GRS)” allows you to read an eventually consistent copy of your geo-replicated data from the storage account’s secondary region in case of any unavailability to the storage account’s primary region.

Before we dive into the details of this new ability, we will briefly summarize the available redundancy options in Windows Azure Storage. We will then cover in detail each of the options available including the new option of Read Access – Geo Redundant Storage (RA-GRS) and how one can sign up for this limited preview. We will also cover the storage client library changes that one can use to achieve higher read availability using RA-GRS.

Redundancy Options in Windows Azure Storage

Windows Azure Storage provides following options for redundancy for Blobs, Tables and Queues:

1. Locally Redundant Storage (LRS): All data in the storage account is made durable by replicating transactions synchronously to three different storage nodes within the same region. The below section will cover more details on LRS including on how to select LRS.

2. Geo Redundant Storage (GRS): This is the default option for redundancy when a storage account is created. Like LRS, transactions are replicated synchronously to three storage nodes within the primary region chosen for creating the storage account. However, the transaction is also queued for asynchronous replication to another secondary region (hundreds of miles away from the primary) where data is again made durable by replicating it to three more storage nodes there. The below section will cover in depth the asynchronous replication process, information on region pairings and the failover process.

3. Read Access - Geo Redundant Storage (RA-GRS): For a GRS storage account, we now have introduced in limited preview the ability to turn on read only access to a storage account’s data in the secondary region. Since replication to the secondary region is done asynchronously, this provides an eventual consistent version of the data to read from. The below section will cover more details on RA-GRS, how to enable this in preview mode and details on storage analytics.

Locally Redundant Storage (LRS)

What is LRS?

Locally redundant storage stores multiple copies of your data synchronously within a region for durability. To ensure durability, we replicate the transaction synchronously across three different storage nodes across different fault domains and upgrade domains. A fault domain (FD) is a group of nodes that represent a physical unit of failure and can be considered as nodes belonging to the same physical rack. An upgrade domain (UD) is a group of nodes that will be upgraded together during the process of service upgrade (rollout). The three replicas are spread across UDs and FDs to ensure that data is available even if hardware failure impacts a single rack and when nodes are upgraded during a rollout.

In addition to returning success only when all three replicas are persisted, we store CRCs of the data to ensure correctness and periodically read and validate the CRCs to detect bit rot (random errors occurring on the disk media over a period of time). In addition, Windows Azure Storage erasure codes data which further improves durability. More details on how data is made durable can be found in our SOSP paper.

Scenarios for LRS

LRS costs less than GRS. Based on current price structure, the reduction in price compared to GRS is around 23% to 34% depending on how much data is stored. Here are some reasons why one may choose LRS over GRS.

1. Applications that store data which can be easily reconstructed may choose to not geo replicate data not only for cost but also because they get higher throughput for the storage account. LRS accounts get 10 Gibps ingress and 15 Gibps egress as compared to 5 Gibps ingress and 10 Gibps egress for a GRS account.

2. Some customers want their data only replicated within a single region due to application’s data governance requirements.

3. Some applications may have built their own geo replication strategy and not require geo replication to be managed by Windows Azure Storage service.

How to configure LRS

GRS is the default redundancy option when creating a storage account and is included in current pricing for Azure Storage. To configure LRS using the Windows Azure Portal, you would need to choose “Locally Redundant” for replication in the “configure” page for the selected storage account and only then the discounted pricing would apply. When you select LRS, data will be deleted from the secondary location. It is important to note that after you select LRS, changing back to GRS (i.e. Geo Redundant) again would incur an additional charge for egress involved in copying existing data from primary location to the secondary location. Once the initial data is copied there is no further additional egress charge for geo replicating the data from the primary to secondary location for GRS. The details for bandwidth charges can be found here.

Geo Redundant Storage (GRS):

What is GRS?

A geo redundant storage account has its blob, table and queue data replicated to a secondary region hundreds of miles away from the primary region. So even in the case of a complete regional outage or a regional disaster in which the primary location is not recoverable, your data is still durable. As explained above in LRS, updates to your storage account are synchronously replicated to three storage nodes in the primary region and success is returned only once three copies are persisted there. For GRS, after the updates are committed to the primary region they are asynchronously replicated to the secondary region. On the secondary, the updates are again committed to a three replica set before returning success back to the primary.

Our goal is to keep the data completely durable at both the primary location and secondary location. This means we keep three replicas in each of the locations (i.e. total of 6 copies) to ensure that each location can recover by itself from common failures (e.g., disk, node, rack, TOR failing), without having to communicate with the other location. The two locations only have to talk to each other to geo-replicate the recent updates to storage accounts. This is important, because it means that if we had to failover a storage account from the primary to the secondary, then all the data that had been committed to the secondary location via geo-replication will already be durable there. Since transactions are replicated asynchronously, it is important to note that opting for GRS does not impact latency of transactions made to the primary location. However, since there is a delay in the geo replication, in the event of a regional disaster it is possible that delta changes that have not yet been replicated to the secondary region may be lost if the data cannot be recovered from the primary region.

What are the secondary locations?

When a storage account is created, the customer chooses the primary location for their storage account. However, the secondary location for the storage account is fixed and customers do not have the ability to change this. The following table shows the current primary and secondary location pairings:

Primary	Secondary
North Central US	South Central US
South Central US	North Central US
East US	West US
West US	East US
North Europe	West Europe
West Europe	North Europe
South East Asia	East Asia
East Asia	South East Asia
East China	North China
North China	East China

What transactional consistency can be expected with geo replication?

To understand transactional consistency with geo replication, it is important to understand that Windows Azure Storage uses a range based partitioning system in which every object has a property called Partition Key which is the unit of scale. All objects with the same value for Partition Key will be served by the same Partition Server (see SOSP paper for details). The Partition Key for objects are:

Blob: Account name, Container name and Blob name
Table Entity: Account name, Table name and app defined PartitionKey
Queue Message: Account name and Queue name

More details on the scalability targets of a storage account and these objects can be found here.

Geo-replication ensures that transactions to objects with same Partition Key value are committed in the same order at the secondary location as at the primary location. That said, it is also important to note that there are no geo-replication ordering guarantees across objects with different Partition Key values. This means that different partitions can be geo-replicating at different speeds. Once all the updates have been geo-replicated and committed at the secondary location, the secondary location will have the exact same state as the primary location.

For example, consider the case where we have two blobs, foo and bar, in our storage account (for blobs, the complete blob name is the Partition Key). Now say we execute transactions A and B on blob foo, and then execute transactions X and Y against blob bar. It is guaranteed that transaction A will be geo-replicated before transaction B, and that transaction X will be geo-replicated before transaction Y. However, no other guarantees are made about the respective timings of geo-replication between the transactions against foo and the transactions against bar. If a disaster happened and caused recent transactions to not get geo-replicated, that would make it possible for, transactions A and X to be geo-replicated, while losing transactions B and Y. Or transactions A and B could have been geo-replicated, but neither X nor Y had made it to the secondary. The same holds true for operations involving Tables and Queues, where for Tables the partitions are determined by the application defined Partition Key of the entity instead of the blob name, and for Queues the Queue name is the Partition Key.

Because of this, to best leverage geo-replication, one best practice is to avoid cross-Partition Key relationships whenever possible. This means you should try to restrict relationships for Tables to entities that have the same Partition Key value. All transactions within a single Partition Key value are committed on the secondary in the same order as the primary. However, for high scale scenarios, it is not advisable to have all entities have same Partition Key value since the scalability target for a single partition is lot lower than that of a single storage account.

The only multiple object transaction supported by Windows Azure Storage is Entity Group Transactions for Windows Azure Tables, which allow clients to commit a batch of entities, all with-in the same Partition Key, together as a single atomic transaction. Geo-replication also treats this batch as an atomic operation. Therefore, the whole batch transaction is committed atomically on the secondary.

What is the Geo-Failover Process?

Geo failover is the process of configuring a storage account’s secondary location as the new primary location. At present, failover is at stamp level and we do not have the ability to failover a single storage account. We plan to provide an API to allow customers to trigger a failover at an account level, but this is not available yet. Given that failover is at the stamp level, in the event of a major disaster that affects the primary location, we will first try to restore the data in the primary location. Restoring of primary is given precedence since failing over to secondary may result in recent delta changes being lost because of the nature of replication being asynchronous, and not all applications may prefer failing over if the availability to the primary can be restored.

If we needed to perform a failover, affected customers will be notified via their subscription contact information. As part of the failover, the customer’s “account.<service>.core.windows.net” DNS entry would be updated to point from the primary location to the secondary location. Once this DNS change is propagated, the existing Blob, Table, and Queue URIs will work. This means that you do not need to change your application’s URIs – all existing URIs will work the same before and after a geo-failover. For example, if the primary location for a storage account “myaccount” was North Central US, then the DNS entry for myaccount.<service>.core.windows.net would direct traffic to North Central US. If a geo-failover became necessary, the DNS entry for myaccount.<service>.core.windows.net would be updated so that it would then direct all traffic for the storage account to South Central US. After the failover occurs, the location that is accepting traffic is considered the new primary location for the storage account. Once the new primary is up and accepting traffic, we will bootstrap to a new secondary to get the data geo redundant again.

What is the RPO and RTO with GRS?

Recover Point Objective (RPO): In GRS and RA-GRS the storage service asynchronously geo-replicates the data from the primary to the secondary location. If there was a major regional disaster and a failover had to be performed, then recent delta changes that had not been geo-replicated could be lost. The number of minutes of potential data lost is referred to as RPO (i.e., the point in time to which data can be recovered to). We typically have a RPO less than 15 minutes, though there is currently no SLA on how long geo-replication takes.

Recovery Time Objective (RTO): The other metric to know about is RTO. This is a measure of how long it takes us to do the failover, and get the storage account back online if we had to do a failover. The time to do the failover includes the following:

The time it takes us to investigate and determine whether we can recover the data at the primary location or if we should do the failover
Failover the account by changing the DNS entries

We take the responsibility of preserving your data very seriously, so if there is any chance of recovering the data, we will hold off on doing the failover and focus on recovering the data in the primary location. In the future, we plan to provide an API to allow customers to trigger a failover at an account level, which would then allow customers to control the RTO themselves, but this is not available yet.

Scenarios for GRS

GRS is chosen by customers requiring the highest level of durability for Business Continuity Planning (BCP) by keeping their data durable in two regions hundreds of miles apart from each other in case of a regional disaster.

Introducing Read-only Access to Geo Redundant Storage (RA-GRS):

RA-GRS allows you to have higher read availability for your storage account by providing “read only” access to the data replicated to the secondary location. Once you enable this feature, the secondary location may be used to achieve higher availability in the event the data is not available in the primary region. This is an “opt-in” feature which requires the storage account be geo-replicated.

How to enable RA-GRS?

During limited preview, customers will need to sign up for the preview on the Windows Azure Preview page. This puts your subscription ID in a queue to be approved. Once approved, you will get a mail notifying you about the approval. Once approved, you can enable RA-GRS for any of the accounts associated with that subscription. You can enable Read-Only Access to your secondary region via Service Management REST APIs for Create and Update Storage Account or via the Windows Azure Portal. When using the APIs to enable RA-GRS, you would need to ensure that the property GeoReplicationEnabled and SecondaryReadEnabled are set to true in the request payload. Via the portal, you can configure the storage account’s replication property to “Read Access Geo-Redundant” Storage. Note, during this preview RA-GRS is not yet available in North China and East China, but it is available everywhere else. We will update the blog post once RA-GRS becomes available in China.

How does RA-GRS work?

When you enable read-only access to your secondary region, you get a secondary endpoint in addition to the primary endpoint for accessing your storage account. This secondary endpoint is similar to the primary endpoint except for the suffix “-secondary”. For example: if the primary endpoint is myaccount.<service>.core.windows.net, the secondary endpoint is myaccount-secondary.<service>.core.windows.net. The secret keys used to access the primary endpoint are the same ones used to access the secondary endpoint. Using the same keys enables the same Shared Access Signature to work for both the primary and secondary endpoints. This means that the canonicalization of the resource used for signing to access both the primary and secondary needs to remain the same. Therefore, the account name used in the canonicalized resource should exclude the “-secondary” suffix for the canonicalization (note, existing storage explorers that use the DNS to extract the account name may not exclude it and hence may not be able to read from the secondary endpoint). The secondary endpoint obtained can then be used to dispatch read requests when the primary is not available to achieve higher availability. Please note that any put/delete requests to this secondary endpoint will automatically be rejected with HTTP status code 403.

Let us revisit the process of geo replication we explained above. A transaction on the primary is replicated asynchronously to the secondary region. However, since transactions across Partition Keys can happen out of order, we introduce a new term called “Last Sync Time” which acts as the conservative RPO time. All primary updates preceding the Last Sync Time (defined in UTC) are guaranteed to be available for read operations at the secondary. Primary updates after this point in time may or may not be available for reads. There is a separate Last Sync Time value provided for Blobs, Tables and Queues for a storage account. The Last Sync Time is calculated by tracking the geo replicated sync time for each partition and then reporting the minimum time for blobs, tables and queues. Let’s use an example to better illustrate the concept. The below table lists a timeline of operations on blob “myaccount.blob.core.windows.net/mycontainer/blob1.txt” and blob “myaccount.blob.core.windows.net/mycontainer/blob2.txt”

UTC Time	User Action	Replication	Last Sync	Read request on secondary
Wed, 23 Oct 2013 22:00:00	User uploaded the blob1 with contents “Hello” – Transaction # 1		Wed, 23 Oct 2013 21:58:00
Wed, 23 Oct 2013 22:01:00	User uploaded the blob2 with contents “Cheers” – Transaction # 2		Wed, 23 Oct 2013 21:58:00
Wed, 23 Oct 2013 22:02:00	User issues read requests for blob1 and blob2 on secondary location		Wed, 23 Oct 2013 21:58:00	A read on blob1 & blob2 returns 404
Wed, 23 Oct 2013 22:03:00		The upload transaction # 2 is replicated to secondary location	Wed, 23 Oct 2013 21:58:00
Wed, 23 Oct 2013 22:04:00	User updated blob1 with contents “Adios” – Transaction # 3		Wed, 23 Oct 2013 21:58:00
Wed, 23 Oct 2013 22:05:00	User issues a read request for blob1 and blob2 on secondary location		Wed, 23 Oct 2013 21:58:00	A read on blob1 returns 404 and blob2 returns “Cheers”
Wed, 23 Oct 2013 22:05:30		The upload transaction # 1 is replicated to secondary location	Wed, 23 Oct 2013 22:01:00
Wed, 23 Oct 2013 22:06:00	User issues a read request for blob1 on secondary location		Wed, 23 Oct 2013 22:01:00	A read on blob1 returns “Hello”
Wed, 23 Oct 2013 22:07:00		The upload transaction # 3 is replicated to secondary location	Wed, 23 Oct 2013 22:04:00
Wed, 23 Oct 2013 22:08:00	User issues a read request for blob1 on secondary location		Wed, 23 Oct 2013 22:04:00	A read on blob1 returns “Adios”

A few things to note here:

At Wed, 23 Oct 2013 22:03:00, though transaction 2 has been replicated and blob2 is available for read, the Last Sync time is still “Wed, 23 Oct 2013 21:58:00” since transaction 1 has not been replicated yet. The Last Sync Time is a conservative RPO time and guarantees that all transactions up to that time are available for read access on the secondary across all blobs for the storage account.
At Wed, 23 Oct 2013 22:05:30, only when transaction 1 has been replicated, would the Last Sync Time move to 22:01 (since transaction 2 is already replicated). Read on blob1 would return the contents “Hello” set by transaction 1 since changes related transaction 3 have not yet been replicated.
At Wed, 23 Oct 2013 22:07:00, only when transaction 3 has been replicated, would the Last Sync Time move to 22:04:00 to signify that all changes on blob1 and blob2 would be available for read on secondary. At that point, any read on blob1 will reflect the change in contents to “Adios”.

How to find the Last Sync Time using RA-GRS?

Starting with REST version 2013-08-15, a new REST API “Get Service Stats” is made available for Blobs, Tables and Queue services. This API is available only from the secondary endpoint and provides the geo replication stats for the service. The stats includes two pieces of information maintained for each service:

1. The status of geo replication: This can be one of the following three values

a. Live: Indicates that geo replication is enabled, active and operational

b. Bootstrap: Indicates the initialization phase of bootstrapping the data from primary to secondary when the storage account is change from LRS to GRS. During this phase, the secondary endpoint may not be available for reads.

c. Unavailable: The system cannot compute the Last Sync Time due to an outage or has not yet computed the Last Sync Time.

2. Last Sync Time: This indicates the replication lag time for the service which we explained above. This is empty if the status is Bootstrap or Unavailable. When the status is in “Live” state, this is a valid UTC time.

What is the RA-GRS SLA and Pricing?

The benefit of using RA-GRS is that it provides a higher read availability (99.99+%) for a storage account over GRS (99.9+%). When using RA-GRS the write availability continues to be 99.9+% (same as GRS today) and read availability for RA-GRS is 99.99+%, where the data is expected to be read from secondary if primary is unavailable. In terms of pricing, the capacity (GB) charge is slightly higher for RA-GRS than GRS, whereas the transaction and bandwidth charges are the same for GRS and RA-GRS. See the Windows Azure Storage pricing page here for more details about the SLA and pricing.

Storage Analytics for RA-GRS?

Windows Azure Storage service provides users with storage analytics data that can be utilized to monitor the usage of the storage service. With RA-GRS, storage metrics for transactions made to the secondary endpoint are also made available if metrics are enabled using the primary endpoint using Set Service Properties for Windows Azure Blob, Table and Queue. To keep it simple, the metrics data for the transactions issued against secondary endpoint are made available only on the primary endpoint in the following table names:

$MetricsHourSecondaryTransactionsBlob
$MetricsHourSecondaryTransactionsTable
$MetricsHourSecondaryTransactionsQueue
$MetricsMinuteSecondaryTransactionsBlob
$MetricsMinuteSecondaryTransactionsTable
$MetricsMinuteSecondaryTransactionsQueue

Metrics for transactions made to primary endpoints are still available at:

$MetricsHourPrimaryTransactionsBlob
$MetricsHourPrimaryTransactionsTable
$MetricsHourPrimaryTransactionsQueue
$MetricsMinutePrimaryTransactionsBlob
$MetricsMinutePrimaryTransactionsTable
$MetricsMinutePrimaryTransactionsQueue

With the preview release of RA-GRS, logs are not yet made available for transactions against the secondary endpoint. Please refer to MSDN for details on analytics.

Scenarios

Read-only access to the secondary enables higher read availability. Applications that require higher availability and can handle eventually consistent reads can issue secondary reads when the primary for a storage account is unavailable.

RA-GRS support in the Storage Client Library

The Storage Client Library 3.0 which uses REST version 2013-08-15 provides new capabilities for RA-GRS. It provides the ability to query the Last Sync Time for Blobs, Tables and Queues, and it provides library support for reads to automatically retry the secondary if the request to the primary times out.

1. GetServiceStats API for CloudBlobClient, CloudTableClient and CloudQueueClient: This API allows applications to easily retrieve the replication status and LastSyncTime for each service.

Example:

CloudStorageAccount account = CloudStorageAccount.Parse(cxnString);
CloudTableClient client = account.CreateCloudTableClient();

// Note that Get Service Stats is supported only on secondary endpoint 
client.LocationMode = LocationMode.SecondaryOnly;
ServiceStats stats = client.GetServiceStats();
string lastSyncTime = stats.GeoReplication.LastSyncTime.HasValue ? 
stats.GeoReplication.LastSyncTime.Value.ToString() : "empty"; 
Console.WriteLine("Replication status = {0} and LastSyncTime = {1}", stats.GeoReplication.Status, lastSyncTime);

2. LocationMode property: This property allows the secondary endpoint to be used when the primary is not available. The important values for this are:

a. PrimaryOnly: All read requests should be issued only to the primary endpoint.

b. PrimaryThenSecondary: Read requests should first be issued to primary and if it fails with retryable error then subsequent retries will alternate between secondary and primary. If secondary access returns 404 (Not Found) for the object then subsequent retries will remain on primary.

c. SecondaryOnly: The read will be issued to the secondary endpoint.

Note that all LocationMode options except for “SecondaryOnly” will continue to send write requests to just the primary endpoint. Using “SecondaryOnly” option for write requests will throw a StorageException.

This property can be set either on:

Cloud[Blob|Table|Queue]Client: The LocationMode option is used for all requests issued using via objects associated with this client.
[Blob|Table|Queue]RequestOptions: The LocationMode option can be overridden at an API level using the same client object.

Code Example for using PrimaryThenSecondary for download blob request:

CloudStorageAccount account = CloudStorageAccount.Parse(cxnString);
CloudBlobClient client = account.CreateCloudBlobClient();
CloudBlobContainer container = client.GetContainerReference(containerName);
CloudBlockBlob blob = container.GetBlockBlobReference(blobName);

// Set the location mode for the request using request options. This request will first try
// to download the blob using primary and if that fails, it will try the secondary location for subsequent retries
blob.DownloadToFile(
localFileName,
FileMode.OpenOrCreate,
null/* access condition */,
new BlobRequestOptions()
{
LocationMode = LocationMode.PrimaryThenSecondary,
ServerTimeout = TimeSpan.FromMinutes(3)
});

3. A new retry policy interface IExtendedRetryPolicy has been introduced to allow users to extend retries in which target location for subsequent retries can be changed. This interface provides a new method Evaluate which replaces ShouldRetry. Note that ShouldRetry is still retained in this interface for backward compatibility but is unused.

The Evaluate method allows users to return RetryInfo which contains the location to use on subsequent retry in addition to the RetryInterval.

In this example of retry class implementation, we change the target location to be secondary only on the last attempt.

public RetryInfo Evaluate(RetryContext retryContext, OperationContext operationContext)
{
    statusCode = retryContext.LastRequestResult.HttpStatusCode;

if (retryContext.CurrentRetryCount >= this.maximumAttempts 
        || ((statusCode >= 300 && statusCode < 500 && statusCode != 408)
        || statusCode == 501 // Not Implemented
        || statusCode == 505 // Version Not Supported
            ))
    {
// do not retry
returnnull;
    }

    RetryInfo info = new RetryInfo();
    info.RetryInterval = EvaluateBackoffTime();
if (retryContext.CurrentRetryCount == this.maximumAttempts - 1)
    {
// retry against secondary
        info.TargetLocation = StorageLocation.Secondary;
    }

return info;
}

We hope you enjoy this new feature and please provide us feedback using comments on this blogs or via Windows Azure Storage forums.

Jai Haridas and Brad Calder