Diff compressed Javascript files

February 9, 2016 Leave a comment

I periodically need to find the differences between two javascript files that have been run through something like uglifyjs.  There are a variety of ways to do this, but I haven’t found a solution that really gets me what I am looking for succinctly.

For a while I used git’s diff, but I found this to be cumbersome and not always available.

Then for a while I used wdiff with colordiff.  That would look like this:

wdiff /path/file1.js /path/file2.js | colordiff

The problem with that is that the output is often really long because compressed js doesn’t have many line feeds, and wdiff doesn’t recognize non-word characters as word boundaries, and compressed js often doesn’t have whitespace to deliniate word boundaries, so the strength of wdiff is somewhat thwarted.

What I really wanted was to split the files into their smaller pieces for diffing.  My latest solution is a bash alias, which looks like this in my .bash_alias file:

alias diffjs='function _diffjs(){ diff -w <(uglifyjs "$1" -b) <(uglifyjs "$2" -b); };_diffjs'

Its not perfect, but in many situations it gets me what I want, assuming of course that uglifyjs is available.

Categories: Uncategorized

Using mongodb text search with node.js

May 22, 2013 Leave a comment

In my last post I talked about enabling mongodb’s beta text search, which at least to me was a little less than intuitive to accomplish. That’s probably partly because of the beta nature of this feature.

The next challenge was figuring out how to interact with the text search functionality from node.js, since interacting with it from an application that needs to provide search is the whole point. I’m sure that at some point the node.js native driver will support syntax specifically for searching, but at the moment its not there yet. This post assumes that text searches are enabled and you’ve added an index.

Before I show how I am accessing the text search feature, it is helpful to know how my modules are put together in general. At the top of each module I set up the mongo connections. In this post I’m going to use “articles” as my example collection. The setup for the db object looks something like this:

var db = {};
var mongo = require('mongodb').MongoClient;
mongo.connect(config.mongodb.articles.url, function(err, cdb){
db.articles = cdb.collection('articles');
});

This leaves me with a db.articles object that provides access to the collection’s methods, including find, update, save, and so on. I would add each collection needed for the module to the db object in the same way. Unfortunately, the collections object doesn’t have a method for text searches. For that, I need access to the cdb object included in the callback to mongo.connect. To do that, I add the cdb object to my db object, which puts it in scope for the rest of my module.


var db = {};
var mongo = require('mongodb').MongoClient;
mongo.connect(config.mongodb.articles.url, function(err, cdb){
db.articles = cdb.collection('articles');
db.cdb = cdb;
});

Obviously if you have a collection called cdb you’ll need to change this. You could just set the cdb object itself in the module scope.

Then I add a search method to my module. Simplified, this looks like this:

search: function(query, callback){
db.cdb.command({text: "articles", search: query}, function(error,results){
results = (results && results.results) ? results.results : [];
callback(error,results);
});
}

We want the results that this method returns to be an array, not the object with the extra stuff mongodb adds to it. The extra conditional in there is to prevent it from throwing errors if results is undefined or something. There is probably extra logic, acl’s filtering and so on in the real thing, this is stripped down to just show the text search.

The results passed to the callback in this method will be an array of objects, each of which have two elements: score and obj. obj will have the full document for each match.

The extra steps shown here will go away when they add text searches to the driver, but for now this is a fairly functional approach. I hope it saves someone the extra time it took me to sort this out.

Categories: mongodb, node.js

MongoDb text search

May 22, 2013 1 comment

Full text search in noSQL databases is far less common than one would think. Most apps I build can benefit from full text searches, even if they don’t need sophisticated search capabilities. There are external solutions for most databases, mostly tying in Lucene through Elastic Search or Solr. Sometimes those external solutions are just the way you need to go, and I’ve used external Lucene integration with CouchDb before. But I was glad when I saw that text searches are included in MongoDb 2.4, at least at a beta stage.

The main catch I’ve had in my testing so far is I had a hard time figuring out how to enable this feature. Like many people (I expect), I’m using packages for Ubuntu, so I needed to figure out how to get this feature enabled in /etc/mongodb.conf. The documentation shows how to enable text searches in the command to start mongo, and mentions that you can put this in the config file, but it doesn’t tell you how to put it in the config file.

This doesn’t work:
textSearchEnabled=true

You end up with a response that says
error command line: unknown option textSearchEnabled

This is the syntax to put in the config file instead:
setParameter=textSearchEnabled=true

Once I added that, the feature was enabled. In the mongodb console I was then able to add my initial index like this:
db.content.ensureIndex( { title:'text', body: 'text' });

and search it like this:
db.content.runCommand("text", {search:'Lorem'})

This returns an object that contains an array called results with the results in it, one result object for each document that matched the text in either the title or body field. Each result is in turn an object with a score and the matching document. It also returns a stats object that tells how many documents were found and how long it took.

Overall, this feature is very promising. While it doesn’t appear as strong in its search capabilities as the Lucene solutions, having it directly available in MongoDb itself is a big win for deploying solutions for customers quickly.

Next up: interacting with the text search functionality through the native node.js driver.

Categories: mongodb

Can I see your ID please? The importance of CouchDb record IDs

May 26, 2012 1 comment

One of the things I largely underestimated when I first started working with CouchDb was the importance of a meaningful ID for documents. The default of letting the database set a UUID for you seems reasonable at first, but the UUID is largely useless for querying. On the other hand, a good ID scheme is like having a free index for which you don’t have to maintain a view. You can just query using the _all_docs view, which is built in. Well thought out ID’s can save you tons of headaches and fiddling with views later. Particularly with large data sets, this can be a big deal, because views can take up a significant amount of storage space and processing. Unfortunately, they are hard to change after you get a lot of data in the database, so it is worth thinking about before you get very far into the data.

There are a handful of primary considerations when considering your document IDs. In most data sets, there is a key piece of information that most of your searches are based on. Account or customer records are usually looked up by ID. Customer transactions are usually retrieved by account and date range. That’s not to say all queries are based on these data elements, but it tends to be a significant majority of queries. Since documents are automatically indexed by ID, one good candidate for good IDs are the information that you most often are searching for in your queries.

Another important consideration when designing views and IDs in CouchDb is storage space and view size. A view that stores several pieces of information from every document can double the size of the overall data required, and even more space is needed for maintenance operations like compacting. If you need a particular view that includes several pieces of data from the document, consider designing your IDs to replace the need for the view. In fact, getting rid of as many views as you can is a worthwhile goal. Views are powerful and useful, but unnecessary views can consume huge amounts of extra space and processing to maintain.

A third consideration for IDs is ID sequence. Documents are sorted by ID. It is often worthwhile including a timestamp as a part of the ID. For example, for transaction documents the account plus a timestamp often makes a good ID. This automatically sorts the transactions within an account by time. In fact, sometimes the primary factor in looking up documents is time, and in those cases it might be a good practice to start the ID with a time stamp. How to format the time depends on your use. Do it in the way closest to how you will retrieve the data. That might be a standard javascript timestamp (1338051515556), or a date/time in a format like YYYYMMDDHHMMSS (something like 20120526-165835). Remember the point here is a useful sort order, so any date and time format that will result in the desired left to right string sort order is what you want. When you’re using timestamps, remember that IDs have to be unique. Milliseconds are not a guaranteed unique identifier for a web application. More on that in a moment.

Sequence is also important if you have multiple types of documents that you need to retrieve together. For example, blog posts and comments are often retrieved at the same time. So it often makes sense to have the ID of comments start with the ID of the post they relate to. That way, you can easily retrieve the blog post, together with all of the documents that are sorted between that post and the next post.

A fourth consideration is to make the information in the IDs be enough data for at least some queries. An _all_docs lookup without an include_docs returns the ID and the revision. If the ID is enough information that it is all you need for a significant number of queries, you can reduce the data you need to move over the wire in at least some of your queries.

In CouchDb, people often store documents of multiple types together in the same database. I already mentioned blog posts and comments. In some cases, you almost always look for only one type of document at a time. In that case, it makes sense to start the ID with an indicator of the type. Alternatively, you might have documents for which there are several types that all relate to a parent or master document (again, the blog post comments are one example of this), and in those cases it might make sense for the secondary documents to start with the master document’s id, followed by type indicator, and then a unique identifier within that type. Usually one or two characters is enough for this purpose.

ID’s often end up being several pieces of information appended together. Maintain readability by adding a separator that will also be useful in queries. Often a dash is a good choice. For example, if you have an account number plus a timestamp for transaction ID’s, I typically put a dash between the account and time. On the other hand, keeping the ID short can save on space, so don’t add a lot of extra stuff to the ID. The minimum of separators to make the result readable is a good goal. So, for example, if you’re using time stamps keep it to the meaningful digits rather than including colons and time zone information.

Remember that IDs have to be unique, so if there is any chance that you’ll end up with two documents with the same ID, change it so they are guaranteed to be unique. Milliseconds are not guaranteed to make unique IDs, particularly when you have more than one database being replicated or using BigCouch. If you have a few dozen inserts a second you’ll end up with conflicts at some point. If you can guarantee that your ID will be unique using unique information from the record itself, that’s great. However, that’s not always the case. In a distributed web application generating guaranteed sequences at the application level is not practical. So once I get all of the information I care about in the ID I will often append a few random characters to the end. Milliseconds plus four or five semi-random characters is much less likely to generate collisions. The other approach is to just figure there might be collisions, and have your application watch for document conflicts and PUT the document again with a slightly different ID if you get a collision. As a practical matter that’s not a good idea if it is very likely to happen much, but if collisions are possible but very unlikely it is often a good compromise.

Of course, all of this has to be weighed against the size of the ID. The ID is stored with every record. Make it as useful as possible, but at the same time avoid having a lot of extra stuff in it that you won’t need. Also, avoid redundancy. If information is in the ID, consider whether you can eliminate a field or two from the document body. If you have the account number and time in the transaction document’s ID, maybe you can remove those fields from the document itself and just split the ID after you retrieve it.

The overall goal is to make the document IDs as useful as possible, take advantage of the fact that the ID is stored with every record anyway and is always indexed, and that _all_docs queries operate just like a view on a simple key. With a little thought put into your database before you start adding records in volume, you can reduce your storage and maintenance resource requirements, and optimize your ability to query the data.

Categories: couchDB

Finding the hostname in node.js

May 24, 2012 Leave a comment

This is a simple thing, but something I tend to forget and have to go find again. So here it is for later…

To get the OS hostname use the os module.
var hostname = require(“os”).hostname;
to just get the first piece of the name:
var hostname = require(‘os’).hostname().split(‘.’).shift();

Of course, only do it that way if you don’t otherwise need the os module. Otherwise require the os module and set it to a variable so you can reuse it without require each time.

To get the hostname for the current request, look in
request.headers.host

Categories: node.js

Am I done with Ubuntu?

October 21, 2011 Leave a comment

I made the mistake of hitting the Upgrade button on Ubuntu updates manager on my main development box the other day when it asked me if I really was going to go another day without Oneiric, and within a fairly short time had an unbootable Ubuntu box. Usually Ubuntu upgrades are fairly smooth, this one was bad. For a little bit of context, I have been using Linux for a long time. I started with RedHat, wandered through Fedora when it appeared, Gentoo, Suse, OpenSuse, CentOS, ClearOS, Debian, and Ubuntu, but for the past while I’ve been using Ubuntu. For years I used and advocated KDE until version 4, at which point (about the same time I switched to Ubuntu) I moved to Gnome. I have been using XFCE for a few months, and I’m basically done with both KDE and Gnome for now. So far I haven’t lasted for more than a few minutes on Unity before I get completely disgusted and change to something else. Somebody told me this week I’m just one of those old grumpy Linux guys.

I didn’t spend a lot of time figuring out what went wrong with the Ubuntu upgrade. Instead, I downloaded a few updated versions and put them on a thumb drive, and tried out some variations on the setup I’ve been using for a while. I installed Mint 11, Mint’s Debian XFCE version, and Xubuntu. As a side note, why aren’t there any really good tools to make bootable live Linux installs on USB for Linux? Most of the directions on the web say to do it on Windows. Blech. I ended up using unetbootin, which works.

My brief reaction to each of the three installs? Mint 11 has all the advantages of Ubuntu, except that its currently a version behind and has mintier branding. The main reason I’d use it is if I wanted to stick with Gnome, which is a possibility if it weren’t for the fact that I’m really liking XFCE. So, Mint 11 isn’t in my immediate future.

The Debian version of Mint is somewhat enticing. I like the idea of rolling updates. I’m not a huge fan of straight Debian, for no reason other than I have a kneejerk ideological reaction to software that is too ideological. Software should be practical. Debian is from a planet I’ve only ever visited for short periods. Yes, I know Ubuntu is Debian based, but it is suitably commercialized. Odd position for a Linux fan to take, isn’t it? I am fairly sure I’m not alone. All that being said, I could see myself using and liking this distro, certainly over the vanilla Mint 11 Gnome version.

Xubuntu works reasonably well. The new Ubuntu Software Center stinks. What happened to options and the ability to configure stuff? It’s pretty, but gutted. That’s basically my reaction to the direction Ubuntu is going generally. Ah, for the good old days when all the configurations were in bash and lisp files.

My first step on all three installations, after changing them so the focus follows the mouse properly, was to try to compile CouchDb 1.1. It failed on all three. There seems to be a mismatch between compiler versions and what CouchDb’s configure is expecting. I haven’t taken the time yet to figure out what the problem is. At this point I mostly just need to get on with my coding. The CouchDb binary package available on these distros is out of date. For my purpose on this dev box, it doesn’t matter enough to spend time on it. However, I will need to sort this issue out at some point. By contrast, node.js compiled easily on all three.

For now, I’ll probably use Xubuntu. When I have more time on my hands, it is likely I’ll wander off into a search for a different Distro, and move out of the Ubuntu family again. I’ll need to do something with my laptop (the machine I actually work on), which is a light weight Acer currently running Ubuntu 11.04 with XFCE. I’m open to suggestions, but I guess I’m not in much of a hurry. None of the recent installs on my dev box were exciting enough to make me want to spend more time on it. And for someone who’s spent way too many hours over the past fifteen years or so distro hopping just for fun, that’s too bad.

Categories: linux

Launching a whole new thing

August 11, 2011 Leave a comment

For the past 15 years I’ve worked in non-profits (missions, humanitarian relief organizations) to enable them to accomplish their ministry objectives through strategic application of technology. Early on, that consisted largely of systems, with administration and system design as the emphasis. Later on it shifted almost entirely to development and application architecture, but still with a strain of making sure services are always available for their intended use.

Now I’m entering a new chapter of my career, with the launch of NodePing, a partnership with my good friend Shawn Parrish. For me (this is my blog after all), that has meant something of a return to activities I haven’t focused on in quite a while: business plans and contracts. Happily though, most of my time in pulling this together has been in code, working with Node.js, Redis, CouchDb, and jQuery. That part has been a lot of fun, and there’s still a lot more to do.

The new service brings a lot of the things I have worked on through the course of my career together. I actually have used my education, a change from my normal work. It also leans on all the years I’ve been responsible for system administration departments, and ensuring services are available. And of course, it has included lots of system and software architecture and coding, by far the funnest parts.

Some posts that might otherwise appear on this blog about Javascript and systems for hosting Node.js applications will likely appear on the NodePing blog.

Huge amounts of work to do from here. I’m excited.

Categories: work and career