MongoPush Your GridFS to Atlas
Use Case
The Officially Supported Way
Let’s first examine the officially supported method. Because the existing 10TB data is more than the 4TB limit Atlas allows, an upgrade to a sharded cluster is what MongoDB recommends. The first step is to shard all fs.chunks collections with the shard key {files_id: 1, n: 1} before using Atlas Live Migration service. However, for those MongoDB clusters behind corporate firewalls and are not accessible by the Atlas Live Migration service, this is simply not an option.
The MongoPush Way
Create a sharded cluster on Atlas. The number of shards depends on the storage requirement.
Use MongoPush to create all indexes on the target cluster.
Shard all fs.chunks collections on {files_id: 1, n: 1} and ensure there is at least one chunk on each shard.
Use MongoPush to copy data.
The Online Archive Way
Uses are cost conscious and always look for ways to reduce cost. To “expand” the storage size of a replica set beyond the 4TB limit, Atlas Online Archive seems an ideal place to store infrequently used documents. Online Archive is a feature that archives infrequently accessed data from an Atlas cluster to a MongoDB-managed read-only Data Lake on a S3 storage. Once the data is archived, a unified view of Online Archive and Atlas data is available from a new connection endpoint. But, here comes the challenges. Can Online Archive move data (and quickly enough) during data migration to prevent the 10TB data overflowing the 4TB limit? The answer is yes if the collections are properly indexed and archives are properly partitioned. The instructions are as follows.
There are two collections to support GridFS; one stores file information and the other keeps the sliced file data. Documents of a file in both collections have to be archived together. Atlas provides web user interface as well as APIs to configure Online Archive. In the steps followed, I’ll use the APIs to demonstrate. The steps are summarized as follows:
Create indexes of {archive: 1} on GridFS collections of the source cluster.
Identify documents in GridFS collections to be archived and add a boolean field archive and set the value to true.
Use MongoPush to create all indexes in the target cluster.
Configure Online Archive using APIs
Use MongoPush to copy data.
Online Archive API Examples
Configure db.fs.files
Configure db.fs.chunks
This solution works but has its own disadvantages. Files retrieved from S3 buckets have higher read latency than from MongoDB clusters. In addition, Atlas provides 3 different connection endpoints, including a federated one, to access data from different data sources. This implementation certainly requires minor application changes to handle different data sources. Unless cost consideration outweighs performance considerations and applications maintainability, I would recommend upgrading to a sharded cluster over this solution.
Comments