Archive

Archive for the ‘couchDB’ Category

Moving CouchDb database files between servers

July 8, 2011 6 comments

There are several methods available for copying data between CouchDb servers. The most obvious is replication, which CouchDb does extremely well and it’s built in. If that option is viable it is probably the way to go. A while back I posted about a method I have used to use bulk document handling in Couch to copy data. That process works well too, and I continue to use that from time to time for some data.

Recently I had a situation in which I needed to set up several development and testing servers with the same initial state for the data. These were not on the same networks, and replication wasn’t convenient. I was moving a good bit of data across several databases, so the bulk document approach wasn’t attractive either. So I resorted to just copying the data files between servers. CouchDb’s design makes this easy to do.

The steps I take are probably overly cautious, but here’s what I do:

  1. Stop the couchdb service on the source host
  2. tar.gz the data files. On my Ubuntu servers this is typically in /var/lib/couchdb (sometimes in a subdirectory based on the Couch version). If you aren’t sure where these files are, you can find the path in your CouchDb config files, or often by doing a ps -A w to see the full command that started CouchDb. Make sure you get the subdirectories that start with . when you archive the files.
  3. Restart the couchdb service on the source host.
  4. scp the tar.gz file to the destination host and unpack them in a temporary location there.
  5. chown the files to the user and group that owns the files already in the database directory on the destination. This is likely couchdb:couchdb. This is important, as messing up the file permissions is the only way I’ve managed to mess up this process so far.
  6. Stop CouchDb on the destination host.
  7. cp the files into the destination directory. Again on my hosts this has been /var/lib/couchdb.
  8. Double check the file permissions in their new home.
  9. Restart CouchDb on the destination host.

You may or may not have to stop the CouchDb services, but it seems like a good idea to me to decrease the chances of inconsistent files.

I have done this a number of times now with no problems, other than when I manage to mess up the file permissions.

Categories: couchDB

Running out of disk space for CouchDB

June 16, 2011 Leave a comment

I added some new views to CouchDb on a development server the other day. The views included three or four emits and the full document in the view, and the disk space used exploded to several times the size of the actual data stored. Everything came to a screeching halt as the server had no space to write logs, no space to build views, no space to store the data we were pushing in, and no space for the compacts that were trying to run. It was one of those moments when you are really happy that the host you just messed up is a development box. This episode raised several lessons for me, some of which were new in the specific context, some of which were good reminders.

First, CouchDb needs lots of disk space. If you are working with views, lots of disk space can disappear very fast, as Couch copies the data files for compacting. This is not a bad thing, its what allows Couch to keep running and performing while it is doing these operations, but it is something to plan for when sizing resources. Running out of disk space is a bad thing. In my case, I had estimated the size of the data and allowed some room for compacts and views, but not nearly enough.

Second, this episode started up quite a debate within my team about designing views. I tend toward making views overly inclusive. My friend tends toward making the views as small as possible, and then throwing in an include_docs if you need more in the results. This is a trade-off, and in the end you have to consider it with every view you write. Small views save lots of disk space, and speed up writes and simple queries where you don’t need much in the results. Using include_docs is fairly cheap if you are getting a handful of docs. If you’re looking for results from hundreds or thousands of records, include_docs loses its appeal, because the server has to fetch each of those documents. Its a balancing act. If you argue for larger views for faster queries as the side on which you should err, I’d suggest making sure you have plenty of disk space. Turns out it is a little embarrassing when you make that argument and then badly under estimate how much disk space you really need, even on a development box.

Third, I got to learn how to clean up from dead compact jobs. When the server is out of disk space when it tries to compact a database, it can leave behind compact files that prevent compacts from working even after the space problem is solved. In my case on Ubuntu, these are in /var/lib/couchdb. If there are 0 size .compact files in your database directory you just need to delete them and restart the compact. I also noticed that there were some non-zero size .compact files in databases that were not actually compacting, and I removed these as well. Everything went back to humming after that.

CouchDb is a great tool. It does have some peculiarities it is good to keep in mind. Life was simple when we had basically no real choices. “Welcome to LAMP! Would you like MySQL or Postres with that?” Now we have all kinds of options on the menu, and we actually have to think about finding the right tool and understanding its strengths, weaknesses, and quirks.

Categories: couchDB

couchdb bulk document transfer

July 17, 2010 1 comment

I have a few test and development servers scattered around at various locations, some of which are there just because of convenience for the different people and locations I work with. I needed to copy some databases from one couchdb server to the other, but since the servers weren’t running the same version of couchdb I couldn’t just copy the files. The bulk document functionality almost gets me there, but it has extra junk besides just the bare documents. I also wanted to be able to remove the _rev tags, since in this case they are just an extra headache (I’m deleting the database and loading it fresh from the other site).

My solution was to write a little method in PHP to dump a database and write out the contents to a file. The approach I used was sparked by the specific project I was working on when I needed it, so it is written based on Zend Framework. I’m just using Zend_HTML for my CouchDB interaction, which enables working with couchdb very nicely.

First, we need to get the data from CouchDB.


       $client = new Zend_Http_Client( );
        $resp = $client->setUri( $this->URI . '/' . $db . '/_all_docs?include_docs=true' )
                ->request( 'GET' );
        $body = $resp->getBody( );

        if( $resp->getStatus( ) != 200 ){
            echo '<div class="error">Status: ' . $resp->getStatus( ) . "<br>Did you spell the database name right?<br><pre>$body</pre></div>"; 
            exit;
        }

We’re using Zend_Http_Client to connect to CouchDB. We set the URI, which we have in our class constructor. I’m getting it from Zend_Config, and piecing together the login credentials if they are in the config, but that’s a different post. The $db is just the database name we’re dumping. If the status returned in the response isn’t 200, that almost always means (in this specific context) that we asked for a database that isn’t there ($db doesn’t exist in CouchDB), so since we can’t continue we show an error and exit.

So now we have a big glob of JSON in $body, but it isn’t formatted the way we need it to load, and we need a file to write this into. One side note here: I’m assuming the size of our database is fairly moderate. My use case was a couple thousand documents, and this approached worked well. You would need to handle it in pieces if you had more data than you could process at once.

So in the next part we convert the JSON to an array so we can iterate through it in PHP, and we drop the parts we don’t care about. We only need the part that’s in “rows.” CouchDB gives us extra information about the number of rows in the data and some other stuff we aren’t going to use here.

        $data = Zend_Json::decode( $body );
        $total_rows = $data['total_rows'];
        $data = $data['rows'];

I use Zend_Json::decode instead of just json_decode here because it converts to a nice array instead of to an object, and its just nicer in this situation to work with.

        $resource = fopen( $outpath, 'w' );

        $count = 0;
        $comma = "{\"docs\": [\n";
        foreach( $data as $key => $value ){
            if( !$preserve_rev ) unset( $value['doc']['_rev'] ); // we aren't going to use this to update, so we don't need the _rev
            $docs = $comma . Zend_Json::encode( $value['doc'] );
            fwrite( $resource, $docs );
            unset( $data[$key] );
            $comma = ",\n";
            $count++;
        }
        fwrite( $resource, "\n]}" );
        fclose( $resource );
        $filesize = filesize( $outpath );
        echo "<div>CouchDB reported $total_rows rows.  I wrote $count rows to $outfile, with a resulting file size of $filesize bytes.</div>";
    }

The first line here is opening a file so we can write to it. Make sure you check if the file exists or not and that you are safely handling the file name first.

We’re using $count and $total_rows (from that last section) to report on the results at the end.

We add the JSON opening to the front of our file (the first $comma setting does that), and iterate through the array, sticking each document from the data into the file, separating each document with a comma. We unset the documents from the array as we use them, just to tidy up as we go.

After we’ve gone through the whole array we close the JSON array and object, close the file, get the filesize to report, and output a summary of what we did.

This results in a file at $outpath that we can move to the destination server and just load in. Typically for my case it means I copy the file, delete and recreate the database, and push the file in as the new data. So far I’ve done that from the command line with curl, an example of why CouchDB is so easy to work with.

Categories: couchDB, Zend Framework
Follow

Get every new post delivered to your Inbox.