Monday, July 26, 2010

Learning Scala from Dead Swiss Mathematicians : Palindromes

As part of my effort to get up to speed with Scala I returned to an old favorite for exploring new languages; working through some of the problems at Project Euler. Posting complete answers to specific problems isn't terribly interesting and I have no plans to do so here. That said, some facets of these problems do lead to useful digressions which allow for a deeper exploration of the language. We'll dig into one such example here.

The problem under consideration is Problem 4 which asks for "the largest palindrome made from the product of two 3-digit numbers". It doesn't take much imagination to determine that part of the solution will involve a method, most likely a function of some kind, for identifying whether a number is in fact a palindrome. Okay, but what would such a method look like? The problem lends itself to a simple recursive implementation: divide the integer into a leading digit, a middle integer and a trailing digit and return false if the leading and trailing digits differ, otherwise return the result of the a recursive call with the middle integer.

Easy enough, but a moments reflection tells us we have a problem; we can access the trailing digit in an integer using the mod function and the base of the input integer but there's no easy way to access the leading digit. It's not as if Int (or even RichInt) extend Seq[Int] or even Seq[Byte]. Fortunately we can shift the problem domain by observing that an input integer is equivalent to an input String in which each digit of the integer maps to a Char in the String. Even better, Scala offers built-in support for regular expressions, and a fairly simple regex should give us access to both the leading and trailing characters and everything in the middle. So, how does Scala's regex support stack up?


// Pattern to be used by the regex-based predicate.
// Absolutely must use conservative matching and
// the backref here to make this work.
val palindromePattern = """(\d)(\d*?)\1""".r

// Recursive helper function to check for a
//palindrome using a regex
def byRegexHelper(n:String):Boolean = {

// Base case; empty string and single characters
// are by definition palindromes. Place this
// test up front so that we can handle input values
// of a single character.
if (n.length == 0 || n.length == 1)
return true
palindromePattern.unapplySeq(n) match {

case Some(matches) => byRegexHelper(matches(1))
case None => false
}
}

// Regex-based predicate; convert to a string and call
// the recrusive function
def byRegex(arg:Int):Boolean = byRegexHelper(arg.toString)


Not bad, actually. It's not Perl or Ruby but the deep support for pattern matching in the language combined with relatively easy generation of a Regex from a String makes for a fairly clean syntax. Regex literals would be a small improvement but this is still cleaner than what one finds in Java or even Python.

So we've solved the problem, but can we do better? Do we really need the regex? Strings (or the StringOps they're implicitly converted to in 2.8) make use of the SeqLike[Char] trait allowing easy access to the leading and trailing characters using simple method calls. This leads to the following implementation which avoids the overhead of evaluating the regular expression:


// Recursive helper function to perform the check
// based on comparisons of the head and last
// characters in a string
def byStringHelper(n:String):Boolean = {

// Base case; empty string and single characters
// are by definition palindromes. Place this test
// up front so that we can handle input values of
// a single character.
if (n.length == 0 || n.length == 1)
return true
if (n.head != n.last)
false
else
byStringHelper(n.substring(1,n.length - 1))
}

// String-based predicate; convert to string and call
// the recursive function
def byString(arg:Int):Boolean = byStringHelper(arg.toString)


Also not bad, but still not completely satisfying. Moving away from the use of regular expressions minimizes the benefit of mapping the input integer onto a String in order to solve the problem. Recall that the primary argument for doing so was the inability to access leading digits in an input integer. How significant is this constraint? Is it something we can work around? More on this next time.

Saturday, July 3, 2010

Joins in CouchDB: How not to do it

My original motivation for experimenting with the dispatch library was an attempt to leverage my familiarity with a general technology (RESTful Web services) to gather experience with a pair of tools (CouchDB, Scala) which were less familiar to me. This turned out to be a spectacularly dumb idea, largely due to a problem that should have been clear at the outset: it's difficult to leverage general experience when you don't have significant experience with any of the tools involved. A better approach might be to use a more familiar tool (Python perhaps) to monkey around with CouchDB and do a bit of woodshedding with Scala on the side. Sounds sensible enough... so what now?

One of the facets of CouchDB that has always been unclear to me is how one might join two or more databases together. A bit of context first: my general understanding of a CouchDB database was that it was an unordered collection of documents of some common type. Within a database views could be used to return documents in a certain order them, select certain subsets of the documents or transform some set of documents into something else entirely. If we were to attempt to relate this model to a SQL database [1] then a CouchDB database is equivalent to a table while a view is equivalent to a select statement on that table.

This model leads to a simple question: if I have database A for documents of type A and database B for documents of type B how can I relate documents in A to documents in B? The natural answer would be some kind of view but views in CouchDB are defined as documents within a given database and there is no notion of a cross-database or global view.

Let's try and make this a bit more concrete. We'll store a set of data gathered from Twitter (using the brilliant Python Twitter Tools package) in a CouchDB instance and interrogate that data in an attempt to answer a few questions. We're interested in the following set of resources:


  1. The first 100 tweets corresponding to a search term

  2. Information about the set of authors for these tweets

  3. The set of followers for each of these authors



Information about these resources will be stored in databases named (respectively) tweets,authors and followers. Once we've obtained this data we'd like to get answers to the following questions:


  1. For each author: what language do they write their tweets in and how many followers do they have?

  2. How many tweets did each author write?

  3. For a given tweet author foo how many tweet authors are also followers of foo?



We'll use the nice folks at Shot of Jaq as our reference point by searching for tweets containing #shotofjaq.

With our multi-database model we can easily answer the first two questions; the first can be answered with a simple map function applied to the authors database and the second with a map and reduce function applied to the tweets database. The third question poses more of a problem. A correct answer requires data from both the authors and followers database and there is simply no way to access both databases using the standard map-reduce functionality. Our model appears to be broken.

Complete code for this example (including a simple Python script to submit temporary views to a CouchDB database) can be found at Github. Note that this repository also includes code for the correct single-database solution to this problem, so no peeking until the next post goes up!

[1] Yes, the CouchDB folks go to great lengths to point out that this is a silly thing to do. I employ this comparison only to illuminate the question at hand.