Exploring MongoPush Using a Case of Filtering and Rename
One of the useful use cases of MongoPush is to copy a subset of a collection and rename the namespace in the target cluster. In this blog, I'll use a simple use case to describe its usage, the snapshot file, and tasks metadata of a migration. If you are unfamiliar with MongoPush, see MongoPush - Push-Based MongoDB Atlas Migration Tool for an introduction.
Use Case
The use case presented here is to copy a subset of a collection from a 2-shard MongoDB cluster to a replica set. I use Keyhole to populate data into the vehicles collections. The requirements of this use case are:
Locate all red vehicles, using filter {"color": "Red"}, from namespace atlanta.vehicles of the source cluster
Copy the data to the cars collection under the austin database in the target replica set using the optional field to
To satisfy the described requirements, use the command below:
Upon completion, verify the number of documents from both clusters.
Execution
The above command executes the following steps:
Connect to the source, a 2-shard sharded cluster
Connect to the target, a replica set
Query documents from namespace atlanta.vehicles of the source using filter {"color": "Red"}. There are two possible scenarios:
If the total number of matched documents is less than the default block size, 10,000, begin copying documents to the target cluster
Otherwise, divide them into small blocks and process them in parallel using multiple threads
Documents are copied to a different namespace, austin.cars, in the target cluster
It is very important to have an index created on the color field of the source collection atlanta.vehicles. Without a proper index, it’ll perform a collection scan resulting in a poor performance.
In addition to the console messages, status is also averrable to be viewed in a browser. The default port is 5226; for example, use http://hostname.example.com:5226 to view progress and status.
Review and Audit
To Stop MongoPush, click on the Stop MongoPush button from the UI, and an HTML status report is automatically downloaded. The report is a beautified reflection of the snapshot file, which is a compressed bson file with a suffix of -mongopush.bson.gz under the snapshot directory. We can view the progress of the migration and regenerate an HTML report using the following command:
The hostname.example.com is the FQDN where mongopush was executed.
Splitting a large collection into smaller tasks makes it possible to achieve high parallelism, resumable, auditable, and progress/status reporting. Tasks, stored in a JSON document, are saved to the _mongopush.tasks.<replica> collections in the target cluster. Each document has four fields and they are:
_id: task ID
ids: an array of _id from the source shard (or replica set)
ns: namespace
replica_set: shard (or replica set) name
The information is used to report status and to resume a migration. About resuming a migration, unless the server hosting mongopush crashes, mongopush will keep trying in the cases of network connectivity interruptions or clusters out of service.
Pause and Resume
What's Next?
To migrate big data to MongoDB Atlas is a time consuming task. I hope this provides helpful information to assist your Atlas migration project. In the next blog, Change Shard Key and Migrate to Atlas Using MongoPush, I'll discuss changing shard key and migrating to Atlas at the same time. Please post your feedback in the Comment below.
Comments