Advertisements

Archive

Archive for the ‘couchDB’ Category

Can I see your ID please? The importance of CouchDb record IDs

May 26, 2012 1 comment

One of the things I largely underestimated when I first started working with CouchDb was the importance of a meaningful ID for documents. The default of letting the database set a UUID for you seems reasonable at first, but the UUID is largely useless for querying. On the other hand, a good ID scheme is like having a free index for which you don’t have to maintain a view. You can just query using the _all_docs view, which is built in. Well thought out ID’s can save you tons of headaches and fiddling with views later. Particularly with large data sets, this can be a big deal, because views can take up a significant amount of storage space and processing. Unfortunately, they are hard to change after you get a lot of data in the database, so it is worth thinking about before you get very far into the data.

There are a handful of primary considerations when considering your document IDs. In most data sets, there is a key piece of information that most of your searches are based on. Account or customer records are usually looked up by ID. Customer transactions are usually retrieved by account and date range. That’s not to say all queries are based on these data elements, but it tends to be a significant majority of queries. Since documents are automatically indexed by ID, one good candidate for good IDs are the information that you most often are searching for in your queries.

Another important consideration when designing views and IDs in CouchDb is storage space and view size. A view that stores several pieces of information from every document can double the size of the overall data required, and even more space is needed for maintenance operations like compacting. If you need a particular view that includes several pieces of data from the document, consider designing your IDs to replace the need for the view. In fact, getting rid of as many views as you can is a worthwhile goal. Views are powerful and useful, but unnecessary views can consume huge amounts of extra space and processing to maintain.

A third consideration for IDs is ID sequence. Documents are sorted by ID. It is often worthwhile including a timestamp as a part of the ID. For example, for transaction documents the account plus a timestamp often makes a good ID. This automatically sorts the transactions within an account by time. In fact, sometimes the primary factor in looking up documents is time, and in those cases it might be a good practice to start the ID with a time stamp. How to format the time depends on your use. Do it in the way closest to how you will retrieve the data. That might be a standard javascript timestamp (1338051515556), or a date/time in a format like YYYYMMDDHHMMSS (something like 20120526-165835). Remember the point here is a useful sort order, so any date and time format that will result in the desired left to right string sort order is what you want. When you’re using timestamps, remember that IDs have to be unique. Milliseconds are not a guaranteed unique identifier for a web application. More on that in a moment.

Sequence is also important if you have multiple types of documents that you need to retrieve together. For example, blog posts and comments are often retrieved at the same time. So it often makes sense to have the ID of comments start with the ID of the post they relate to. That way, you can easily retrieve the blog post, together with all of the documents that are sorted between that post and the next post.

A fourth consideration is to make the information in the IDs be enough data for at least some queries. An _all_docs lookup without an include_docs returns the ID and the revision. If the ID is enough information that it is all you need for a significant number of queries, you can reduce the data you need to move over the wire in at least some of your queries.

In CouchDb, people often store documents of multiple types together in the same database. I already mentioned blog posts and comments. In some cases, you almost always look for only one type of document at a time. In that case, it makes sense to start the ID with an indicator of the type. Alternatively, you might have documents for which there are several types that all relate to a parent or master document (again, the blog post comments are one example of this), and in those cases it might make sense for the secondary documents to start with the master document’s id, followed by type indicator, and then a unique identifier within that type. Usually one or two characters is enough for this purpose.

ID’s often end up being several pieces of information appended together. Maintain readability by adding a separator that will also be useful in queries. Often a dash is a good choice. For example, if you have an account number plus a timestamp for transaction ID’s, I typically put a dash between the account and time. On the other hand, keeping the ID short can save on space, so don’t add a lot of extra stuff to the ID. The minimum of separators to make the result readable is a good goal. So, for example, if you’re using time stamps keep it to the meaningful digits rather than including colons and time zone information.

Remember that IDs have to be unique, so if there is any chance that you’ll end up with two documents with the same ID, change it so they are guaranteed to be unique. Milliseconds are not guaranteed to make unique IDs, particularly when you have more than one database being replicated or using BigCouch. If you have a few dozen inserts a second you’ll end up with conflicts at some point. If you can guarantee that your ID will be unique using unique information from the record itself, that’s great. However, that’s not always the case. In a distributed web application generating guaranteed sequences at the application level is not practical. So once I get all of the information I care about in the ID I will often append a few random characters to the end. Milliseconds plus four or five semi-random characters is much less likely to generate collisions. The other approach is to just figure there might be collisions, and have your application watch for document conflicts and PUT the document again with a slightly different ID if you get a collision. As a practical matter that’s not a good idea if it is very likely to happen much, but if collisions are possible but very unlikely it is often a good compromise.

Of course, all of this has to be weighed against the size of the ID. The ID is stored with every record. Make it as useful as possible, but at the same time avoid having a lot of extra stuff in it that you won’t need. Also, avoid redundancy. If information is in the ID, consider whether you can eliminate a field or two from the document body. If you have the account number and time in the transaction document’s ID, maybe you can remove those fields from the document itself and just split the ID after you retrieve it.

The overall goal is to make the document IDs as useful as possible, take advantage of the fact that the ID is stored with every record anyway and is always indexed, and that _all_docs queries operate just like a view on a simple key. With a little thought put into your database before you start adding records in volume, you can reduce your storage and maintenance resource requirements, and optimize your ability to query the data.

Advertisements
Categories: couchDB

Moving CouchDb database files between servers

July 8, 2011 8 comments

There are several methods available for copying data between CouchDb servers. The most obvious is replication, which CouchDb does extremely well and it’s built in. If that option is viable it is probably the way to go. A while back I posted about a method I have used to use bulk document handling in Couch to copy data. That process works well too, and I continue to use that from time to time for some data.

Recently I had a situation in which I needed to set up several development and testing servers with the same initial state for the data. These were not on the same networks, and replication wasn’t convenient. I was moving a good bit of data across several databases, so the bulk document approach wasn’t attractive either. So I resorted to just copying the data files between servers. CouchDb’s design makes this easy to do.

The steps I take are probably overly cautious, but here’s what I do:

  1. Stop the couchdb service on the source host
  2. tar.gz the data files. On my Ubuntu servers this is typically in /var/lib/couchdb (sometimes in a subdirectory based on the Couch version). If you aren’t sure where these files are, you can find the path in your CouchDb config files, or often by doing a ps -A w to see the full command that started CouchDb. Make sure you get the subdirectories that start with . when you archive the files.
  3. Restart the couchdb service on the source host.
  4. scp the tar.gz file to the destination host and unpack them in a temporary location there.
  5. chown the files to the user and group that owns the files already in the database directory on the destination. This is likely couchdb:couchdb. This is important, as messing up the file permissions is the only way I’ve managed to mess up this process so far.
  6. Stop CouchDb on the destination host.
  7. cp the files into the destination directory. Again on my hosts this has been /var/lib/couchdb.
  8. Double check the file permissions in their new home.
  9. Restart CouchDb on the destination host.

You may or may not have to stop the CouchDb services, but it seems like a good idea to me to decrease the chances of inconsistent files.

I have done this a number of times now with no problems, other than when I manage to mess up the file permissions.

Categories: couchDB

Running out of disk space for CouchDB

June 16, 2011 2 comments

I added some new views to CouchDb on a development server the other day. The views included three or four emits and the full document in the view, and the disk space used exploded to several times the size of the actual data stored. Everything came to a screeching halt as the server had no space to write logs, no space to build views, no space to store the data we were pushing in, and no space for the compacts that were trying to run. It was one of those moments when you are really happy that the host you just messed up is a development box. This episode raised several lessons for me, some of which were new in the specific context, some of which were good reminders.

First, CouchDb needs lots of disk space. If you are working with views, lots of disk space can disappear very fast, as Couch copies the data files for compacting. This is not a bad thing, its what allows Couch to keep running and performing while it is doing these operations, but it is something to plan for when sizing resources. Running out of disk space is a bad thing. In my case, I had estimated the size of the data and allowed some room for compacts and views, but not nearly enough.

Second, this episode started up quite a debate within my team about designing views. I tend toward making views overly inclusive. My friend tends toward making the views as small as possible, and then throwing in an include_docs if you need more in the results. This is a trade-off, and in the end you have to consider it with every view you write. Small views save lots of disk space, and speed up writes and simple queries where you don’t need much in the results. Using include_docs is fairly cheap if you are getting a handful of docs. If you’re looking for results from hundreds or thousands of records, include_docs loses its appeal, because the server has to fetch each of those documents. Its a balancing act. If you argue for larger views for faster queries as the side on which you should err, I’d suggest making sure you have plenty of disk space. Turns out it is a little embarrassing when you make that argument and then badly under estimate how much disk space you really need, even on a development box.

Third, I got to learn how to clean up from dead compact jobs. When the server is out of disk space when it tries to compact a database, it can leave behind compact files that prevent compacts from working even after the space problem is solved. In my case on Ubuntu, these are in /var/lib/couchdb. If there are 0 size .compact files in your database directory you just need to delete them and restart the compact. I also noticed that there were some non-zero size .compact files in databases that were not actually compacting, and I removed these as well. Everything went back to humming after that.

CouchDb is a great tool. It does have some peculiarities it is good to keep in mind. Life was simple when we had basically no real choices. “Welcome to LAMP! Would you like MySQL or Postres with that?” Now we have all kinds of options on the menu, and we actually have to think about finding the right tool and understanding its strengths, weaknesses, and quirks.

Categories: couchDB

couchdb bulk document transfer

July 17, 2010 1 comment

I have a few test and development servers scattered around at various locations, some of which are there just because of convenience for the different people and locations I work with. I needed to copy some databases from one couchdb server to the other, but since the servers weren’t running the same version of couchdb I couldn’t just copy the files. The bulk document functionality almost gets me there, but it has extra junk besides just the bare documents. I also wanted to be able to remove the _rev tags, since in this case they are just an extra headache (I’m deleting the database and loading it fresh from the other site).

My solution was to write a little method in PHP to dump a database and write out the contents to a file. The approach I used was sparked by the specific project I was working on when I needed it, so it is written based on Zend Framework. I’m just using Zend_HTML for my CouchDB interaction, which enables working with couchdb very nicely.

First, we need to get the data from CouchDB.


       $client = new Zend_Http_Client( );
        $resp = $client->setUri( $this->URI . '/' . $db . '/_all_docs?include_docs=true' )
                ->request( 'GET' );
        $body = $resp->getBody( );

        if( $resp->getStatus( ) != 200 ){
            echo '<div class="error">Status: ' . $resp->getStatus( ) . "<br>Did you spell the database name right?<br><pre>$body</pre></div>"; 
            exit;
        }

We’re using Zend_Http_Client to connect to CouchDB. We set the URI, which we have in our class constructor. I’m getting it from Zend_Config, and piecing together the login credentials if they are in the config, but that’s a different post. The $db is just the database name we’re dumping. If the status returned in the response isn’t 200, that almost always means (in this specific context) that we asked for a database that isn’t there ($db doesn’t exist in CouchDB), so since we can’t continue we show an error and exit.

So now we have a big glob of JSON in $body, but it isn’t formatted the way we need it to load, and we need a file to write this into. One side note here: I’m assuming the size of our database is fairly moderate. My use case was a couple thousand documents, and this approached worked well. You would need to handle it in pieces if you had more data than you could process at once.

So in the next part we convert the JSON to an array so we can iterate through it in PHP, and we drop the parts we don’t care about. We only need the part that’s in “rows.” CouchDB gives us extra information about the number of rows in the data and some other stuff we aren’t going to use here.

        $data = Zend_Json::decode( $body );
        $total_rows = $data['total_rows'];
        $data = $data['rows'];

I use Zend_Json::decode instead of just json_decode here because it converts to a nice array instead of to an object, and its just nicer in this situation to work with.

        $resource = fopen( $outpath, 'w' );

        $count = 0;
        $comma = "{\"docs\": [\n";
        foreach( $data as $key => $value ){
            if( !$preserve_rev ) unset( $value['doc']['_rev'] ); // we aren't going to use this to update, so we don't need the _rev
            $docs = $comma . Zend_Json::encode( $value['doc'] );
            fwrite( $resource, $docs );
            unset( $data[$key] );
            $comma = ",\n";
            $count++;
        }
        fwrite( $resource, "\n]}" );
        fclose( $resource );
        $filesize = filesize( $outpath );
        echo "<div>CouchDB reported $total_rows rows.  I wrote $count rows to $outfile, with a resulting file size of $filesize bytes.</div>";
    }

The first line here is opening a file so we can write to it. Make sure you check if the file exists or not and that you are safely handling the file name first.

We’re using $count and $total_rows (from that last section) to report on the results at the end.

We add the JSON opening to the front of our file (the first $comma setting does that), and iterate through the array, sticking each document from the data into the file, separating each document with a comma. We unset the documents from the array as we use them, just to tidy up as we go.

After we’ve gone through the whole array we close the JSON array and object, close the file, get the filesize to report, and output a summary of what we did.

This results in a file at $outpath that we can move to the destination server and just load in. Typically for my case it means I copy the file, delete and recreate the database, and push the file in as the new data. So far I’ve done that from the command line with curl, an example of why CouchDB is so easy to work with.

Categories: couchDB, Zend Framework