This article is based on Mahout in Action, to be published on June, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]
The quality of recommendations is largely determined by the quantity and quality of data. Garbage in, garbage out was never truer than here. Likewise, recommender algorithms are data intensive, and runtime performance is greatly affected by quantity and representation of data. This article explores classes in Mahout for representing and accessing recommender-related data.
The input to a recommender engine is preference data—who likes what and how much. So, the input to Mahout recommenders is simply a set consisting of user ID, item ID, and preference value tuples—a large set, of course. Sometimes, even preference values are omitted.
The Preference object
A Preference is the most basic abstraction, representing a single user ID, item ID, and a preference value. One object represents one user’s preference for one item. Preference is an interface, and the implementation you’re most likely to use is GenericPreference. For example, the following code creates a representation of user 123’s preference value of 3.0 for item 456:
[code lang=”java”]new GenericPreference(123, 456, 3.0f).[/code]
How is a set of Preferences represented? If you gave reasonable answers like Collectionor Preference, you’d be wrong in most cases in the Mahout APIs. Collections and arrays turn out to be quite inefficient for representing large numbers of Preference objects. If you’ve never investigated the overhead of an Object in Java, prepare to be shocked!
A single GenericPreference contains 20 bytes of useful data: an 8-byte user ID (Java long), an 8-byte item ID (long), and a 4-byte preference value (float). The object entails a startling amount of overhead: 28 bytes! The actual amount of overhead varies depending on the JVM’s implementation; this figure was taken from Apple’s 64-bit Java 6 VM for Mac OS X 10.6. This includes an 8-byte reference to the object and, due to the Object overhead and other alignment issues, additional 20 bytes of space within the representation of the object itself. Hence, a GenericPreference object already consumes 140 percent more memory than it needs to, just due to the overhead.
What can be done? In the recommender algorithms, it is common to need a collection of all preferences associated to one user or one item. In such a collection, the user ID or item ID will be identical for all Preference objects, which seems redundant.
PreferenceArray and implementations
Enter PreferenceArray, an interface whose implementations represent a collection of preferences with an array-like API. For example, GenericUserPreferenceArray represents all preferences associated with one user.
Internally, it maintains a single user ID, an array of item IDs, and an array of preference values. The marginal memory required per preference in this representation then is only 12 bytes (one more 8-byte item ID and 4-byte preference value in an array). Compare this to the approximately 48 bytes required for a full Preference object. The four-fold memory savings alone justifies this special implementation, but it also provides a small performance win because far fewer objects must be allocated and examined by the garbage collector. Compare figures 1 and 2 to understand how the savings are accomplished.
Listing 1 Setting preference values in a PreferenceArray
[code lang=”java”]PreferenceArray user1Prefs = new GenericUserPreferenceArray(2);
user1Prefs.setUserID(0, 1L); A
user1Prefs.setValue(0, 2.0f); B
user1Prefs.setValue(1, 3.0f); C
Preference pref = user1Prefs.get(1); D
A Sets user ID for all preferences
B User 1 expresses preference 2.0 for item 101 now
C User 1 expresses preference 3.0 for 102
D Materializes a Preference for item 102[/code]
There is also an implementation called GenericItemPreferenceArray, which encapsulates all preferences associated to an item rather than a user. Its purpose and usage are entirely analogous.
We’ve looked at how preference data is represented in a Mahout recommender. This includes Preference objects, but also specialized array and collection-like implementations like PreferenceArray. These specializations exist largely to reduce memory usage.