Moving CouchDb database files between servers
There are several methods available for copying data between CouchDb servers. The most obvious is replication, which CouchDb does extremely well and it’s built in. If that option is viable it is probably the way to go. A while back I posted about a method I have used to use bulk document handling in Couch to copy data. That process works well too, and I continue to use that from time to time for some data.
Recently I had a situation in which I needed to set up several development and testing servers with the same initial state for the data. These were not on the same networks, and replication wasn’t convenient. I was moving a good bit of data across several databases, so the bulk document approach wasn’t attractive either. So I resorted to just copying the data files between servers. CouchDb’s design makes this easy to do.
The steps I take are probably overly cautious, but here’s what I do:
- Stop the couchdb service on the source host
- tar.gz the data files. On my Ubuntu servers this is typically in /var/lib/couchdb (sometimes in a subdirectory based on the Couch version). If you aren’t sure where these files are, you can find the path in your CouchDb config files, or often by doing a
ps -A w
to see the full command that started CouchDb. Make sure you get the subdirectories that start with . when you archive the files. - Restart the couchdb service on the source host.
- scp the tar.gz file to the destination host and unpack them in a temporary location there.
- chown the files to the user and group that owns the files already in the database directory on the destination. This is likely couchdb:couchdb. This is important, as messing up the file permissions is the only way I’ve managed to mess up this process so far.
- Stop CouchDb on the destination host.
- cp the files into the destination directory. Again on my hosts this has been /var/lib/couchdb.
- Double check the file permissions in their new home.
- Restart CouchDb on the destination host.
You may or may not have to stop the CouchDb services, but it seems like a good idea to me to decrease the chances of inconsistent files.
I have done this a number of times now with no problems, other than when I manage to mess up the file permissions.
This doesn’t work when moving down version from 1.1.0 to 1.0.1
An error occurred retrieving a list of all documents in futon.
Unexpected message, restarting couch_server: {‘EXIT’,,
{{badmatch,{error,eacces}},
[{couch_file,init,1},
{gen_server,init_it,6},
{proc_lib,init_p_do_apply,
3}]}}
Right, this approach requires the databases to be the same version. If they aren’t, use replication or a dump and bulk load instead.
Nice Article David, good recipe to have! In addition to this would you happen to know how to copy or move documents between two databases on the same CouchDB instance?
If I were moving documents between two databases in the same instance, and had a lot of them to move at once, I’d probably dump the documents to a file and use a bulk load. The other option would be replication.
“There are several methods available for copying data between CouchDb servers”.
=> How about other ways?
The only ones I can really comment on are things I’ve actually done, which include:
1. replication, mentioned above. This works well if you are copying thousands of files, less well if you are copying tens of millions or are simultaneously writing a lot and have several views.
2. copying files as described in the post above. This is often the quickest, but it does require compatible couchdb versions and that you can take databases offline.
3. dumping to json text files and loading them in bulk (described in another post). This works well if the couchdb versions don’t match and the quantity of the data is relatively small (thousands of records)
4. writing a script to walk one database and write to the other. This is useful when you need logic in the middle rather than just a raw copy. It is obviously the most resource intensive.
In the vast majority of cases I’d do the first or second options of these four.
We tried doing it that way with a massive 500GB database since the inbuilt replication would have taken weeks in order to finish the initial replication of both databases. Both servers are connected through a rather small connection which we can’t change.
We assumed that copying those files from A to B would do the trick yet after doing we set up replication just to find out that the initial replication still takes a long time. Do those two databases need to “exchange” their document’s states after the files have been copied from one place to another?
Did you ever run into similar issues with that method? From what I know there’s a small delta on the source database already, could that cause the issue?
Yes, Couch still has to go through all of the documents to start the replication, which takes a long time if there are many documents. As far as I know, there isn’t a good way to skip that. Copying the data does save having the replication copy the data, however, so the replication just has to go through and compare IDs and revision numbers.