MongoPush Your GridFS to Atlas

GridFS is one of the most popular features of MongoDB and many implementations use GridFS as their file system to store binary files, such as PDF documents and images.  Atlas is also gaining popularity among corporations and has become the home of many mission critical databases.  In this blog, I will be solving an interesting use case of migrating a replica set to Atlas, but the existing storage size is greater than the maximum Atlas allows.

Use Case

For applications use GridFS and without a high transaction rate, a replica set works well. The maximum storage size Atlas allows is 4TB but the storage provided by Cloud providers can go as high as 16TB on a single volume. The challenge is to migrate 10TB compressed data into a 4TB bucket. To scale out, converting to a sharded cluster is recommended, but the team is seeking a way to complete data migration to Atlas immediately with minimum application changes.

The Officially Supported Way

Let’s first examine the officially supported method.  Because the existing 10TB data is more than the 4TB limit Atlas allows, an upgrade to a sharded cluster is what MongoDB recommends.  The first step is to shard all fs.chunks collections with the shard key {files_id: 1, n: 1} before using Atlas Live Migration service.  However, for those MongoDB clusters behind corporate firewalls and are not accessible by the Atlas Live Migration service, this is simply not an option.

The MongoPush Way

It would be ideal if we can migrate a replica set directly to a sharded cluster on Atlas.  Using MongoPush makes it possible with a few steps below:
  1. Create a sharded cluster on Atlas.  The number of shards depends on the storage requirement.

  2. Use MongoPush to create all indexes on the target cluster.

  3. Shard all fs.chunks collections on {files_id: 1, n: 1} and ensure there is at least one chunk on each shard.

  4. Use MongoPush to copy data.

This is probably the simplest way for those looking at eliminating years of technical debts in one shot.  I have already discussed similar steps in another blog Change Shard Key and Migrate to Atlas Using MongoPush.

The Online Archive Way

Uses are cost conscious and always look for ways to reduce cost.  To “expand” the storage size of a replica set beyond the 4TB limit, Atlas Online Archive seems an ideal place to store infrequently used documents.  Online Archive is a feature that archives infrequently accessed data from an Atlas cluster to a MongoDB-managed read-only Data Lake on a S3 storage. Once the data is archived, a unified view of Online Archive and Atlas data is available from a new connection endpoint. But, here comes the challenges. Can Online Archive move data (and quickly enough) during data migration to prevent the 10TB data overflowing the 4TB limit? The answer is yes if the collections are properly indexed and archives are properly partitioned. The instructions are as follows.


Configuration and Migration Steps

There are two collections to support GridFS; one stores file information and the other keeps the sliced file data.  Documents of a file in both collections have to be archived together.  Atlas provides web user interface as well as APIs to configure Online Archive.  In the steps followed, I’ll use the APIs to demonstrate.  The steps are summarized as follows:

  1. Create indexes of {archive: 1} on GridFS collections of the source cluster.

  2. Identify documents in GridFS collections to be archived and add a boolean field archive and set the value to true.

  3. Use MongoPush to create all indexes in the target cluster.

  4. Configure Online Archive using APIs

  5. Use MongoPush to copy data. 

Online Archive API Examples

Configure db.fs.files

curl --user "${PUBLIC-KEY}:${PRIVATE-KEY}" --digest \
--header "Content-Type: application/json" \
--include \
--request POST "https://cloud.mongodb.com/api/atlas/v1.0/groups/<group_id>/clusters/target/onlineArchives?pretty=true" \
--data '
{
        "dbName": "db",
        "collName": "fs.files",
        "partitionFields": [
                {
                        "fieldName": "_id",
                        "order": 0
                }],
        "criteria": {
                "type": "CUSTOM",
                "query": "{ \"archive\": true }"
  }
}'

Configure db.fs.chunks

curl --user "${PUBLIC-KEY}:${PRIVATE-KEY}" --digest \
--header "Content-Type: application/json" \
--include \
--request POST "https://cloud.mongodb.com/api/atlas/v1.0/groups/<group_id>/clusters/target/onlineArchives?pretty=true" \
--data '
{
        "dbName": "db",
        "collName": "fs.chunks",
        "partitionFields": [
                {
                        "fieldName": "files_id",
                        "order": 0
                }],
        "criteria": {
                "type": "CUSTOM",
                "query": "{ \"archive\": true }"
  }
}'


This solution works but has its own disadvantages.  Files retrieved from S3 buckets have higher read latency than from MongoDB clusters. In addition, Atlas provides 3 different connection endpoints, including a federated one, to access data from different data sources.  This implementation certainly requires minor application changes to handle different data sources.  Unless cost consideration outweighs performance considerations and applications maintainability, I would recommend upgrading to a sharded cluster over this solution.

Summary

The solutions discussed in this blog provide 3 solutions to migrate a large amount of GridFS data to Atlas.  Alternatively, for cost consideration, one should consider storing binary files in S3 buckets or other cheaper storages and only store metadata in MongoDB.  To migrate big data to MongoDB Atlas is a time consuming task.  I hope this provides helpful information to assist your Atlas migration project.  Please post your feedback in the comments.

Comments

Popular posts from this blog

Build and Download Keyhole

MongoPush - Push-Based MongoDB Atlas Migration Tool

Survey Your Mongo Land