Home > couchDB > Can I see your ID please? The importance of CouchDb record IDs

Can I see your ID please? The importance of CouchDb record IDs

One of the things I largely underestimated when I first started working with CouchDb was the importance of a meaningful ID for documents. The default of letting the database set a UUID for you seems reasonable at first, but the UUID is largely useless for querying. On the other hand, a good ID scheme is like having a free index for which you don’t have to maintain a view. You can just query using the _all_docs view, which is built in. Well thought out ID’s can save you tons of headaches and fiddling with views later. Particularly with large data sets, this can be a big deal, because views can take up a significant amount of storage space and processing. Unfortunately, they are hard to change after you get a lot of data in the database, so it is worth thinking about before you get very far into the data.

There are a handful of primary considerations when considering your document IDs. In most data sets, there is a key piece of information that most of your searches are based on. Account or customer records are usually looked up by ID. Customer transactions are usually retrieved by account and date range. That’s not to say all queries are based on these data elements, but it tends to be a significant majority of queries. Since documents are automatically indexed by ID, one good candidate for good IDs are the information that you most often are searching for in your queries.

Another important consideration when designing views and IDs in CouchDb is storage space and view size. A view that stores several pieces of information from every document can double the size of the overall data required, and even more space is needed for maintenance operations like compacting. If you need a particular view that includes several pieces of data from the document, consider designing your IDs to replace the need for the view. In fact, getting rid of as many views as you can is a worthwhile goal. Views are powerful and useful, but unnecessary views can consume huge amounts of extra space and processing to maintain.

A third consideration for IDs is ID sequence. Documents are sorted by ID. It is often worthwhile including a timestamp as a part of the ID. For example, for transaction documents the account plus a timestamp often makes a good ID. This automatically sorts the transactions within an account by time. In fact, sometimes the primary factor in looking up documents is time, and in those cases it might be a good practice to start the ID with a time stamp. How to format the time depends on your use. Do it in the way closest to how you will retrieve the data. That might be a standard javascript timestamp (1338051515556), or a date/time in a format like YYYYMMDDHHMMSS (something like 20120526-165835). Remember the point here is a useful sort order, so any date and time format that will result in the desired left to right string sort order is what you want. When you’re using timestamps, remember that IDs have to be unique. Milliseconds are not a guaranteed unique identifier for a web application. More on that in a moment.

Sequence is also important if you have multiple types of documents that you need to retrieve together. For example, blog posts and comments are often retrieved at the same time. So it often makes sense to have the ID of comments start with the ID of the post they relate to. That way, you can easily retrieve the blog post, together with all of the documents that are sorted between that post and the next post.

A fourth consideration is to make the information in the IDs be enough data for at least some queries. An _all_docs lookup without an include_docs returns the ID and the revision. If the ID is enough information that it is all you need for a significant number of queries, you can reduce the data you need to move over the wire in at least some of your queries.

In CouchDb, people often store documents of multiple types together in the same database. I already mentioned blog posts and comments. In some cases, you almost always look for only one type of document at a time. In that case, it makes sense to start the ID with an indicator of the type. Alternatively, you might have documents for which there are several types that all relate to a parent or master document (again, the blog post comments are one example of this), and in those cases it might make sense for the secondary documents to start with the master document’s id, followed by type indicator, and then a unique identifier within that type. Usually one or two characters is enough for this purpose.

ID’s often end up being several pieces of information appended together. Maintain readability by adding a separator that will also be useful in queries. Often a dash is a good choice. For example, if you have an account number plus a timestamp for transaction ID’s, I typically put a dash between the account and time. On the other hand, keeping the ID short can save on space, so don’t add a lot of extra stuff to the ID. The minimum of separators to make the result readable is a good goal. So, for example, if you’re using time stamps keep it to the meaningful digits rather than including colons and time zone information.

Remember that IDs have to be unique, so if there is any chance that you’ll end up with two documents with the same ID, change it so they are guaranteed to be unique. Milliseconds are not guaranteed to make unique IDs, particularly when you have more than one database being replicated or using BigCouch. If you have a few dozen inserts a second you’ll end up with conflicts at some point. If you can guarantee that your ID will be unique using unique information from the record itself, that’s great. However, that’s not always the case. In a distributed web application generating guaranteed sequences at the application level is not practical. So once I get all of the information I care about in the ID I will often append a few random characters to the end. Milliseconds plus four or five semi-random characters is much less likely to generate collisions. The other approach is to just figure there might be collisions, and have your application watch for document conflicts and PUT the document again with a slightly different ID if you get a collision. As a practical matter that’s not a good idea if it is very likely to happen much, but if collisions are possible but very unlikely it is often a good compromise.

Of course, all of this has to be weighed against the size of the ID. The ID is stored with every record. Make it as useful as possible, but at the same time avoid having a lot of extra stuff in it that you won’t need. Also, avoid redundancy. If information is in the ID, consider whether you can eliminate a field or two from the document body. If you have the account number and time in the transaction document’s ID, maybe you can remove those fields from the document itself and just split the ID after you retrieve it.

The overall goal is to make the document IDs as useful as possible, take advantage of the fact that the ID is stored with every record anyway and is always indexed, and that _all_docs queries operate just like a view on a simple key. With a little thought put into your database before you start adding records in volume, you can reduce your storage and maintenance resource requirements, and optimize your ability to query the data.

Categories: couchDB
  1. May 21, 2013 at 8:57 pm

    Reblogged this on a record of a developer and commented:
    Great tip on CouchDB keys

  1. No trackbacks yet.

Leave a comment