This article is based on Tika in Action, to be published on Summer, 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]
also read:
- Java Tutorials
- Java EE Tutorials
- Design Patterns Tutorials
- Java File IO Tutorials
The Standards of Metadata
Introduction
While PDF file properties, and HTML page properties are useful for making decisions such as, “Do I care to read this research paper?” or “Is this web page the one I was looking for?”, the property names themselves don’t tell you everything that you need to know in order to make use of them. For example, is “PDFVersion” an integer, or an alphanumeric? This would be useful to know because it would allow you to compare different “PDFVersion” attributes. What about “Author”? Is it multivalued, meaning that a paper can have multiple authors, or is it only single-valued?
To answer these questions, we usually turn to metadata standards or metadata models. Standards describe all sorts of information about metadata such as cardinality (of fields), relationships between fields, valid values, and ranges, and field definitions, to name a few. Some representative properties of metadata standards are given in table 1.
The International Standards Organization (ISO) has published a reference standard for the description of metadata elements as part of metadata models, numbered ISO-11179. ISO-11179 prescribes a generally accepted mechanism for defining metadata models. There are tons of metadata models out there, and they can be loosely classified as either general models or content-specific models, as depicted in figure 1.
Dublin Core is a general metadata model consisting of fewer than 20 attributes (Creator, Publisher, Format) that are said to describe any electronic resource. On the other side of the coin are Content-specific models, which are unique to a particular file type and only contain metadata elements and descriptions that are relevant to the content type. Examples of these types of models are the Federal Geographic Data Committee (FGDC), a model for describing spatial data files, and Adobe XMP, a metadata standard for media files (like images and videos).
Tika supports both general and content-specific metadata standards. You can get a list of standard metadata models supported by your version of Tika via the option provided by the Tika command –list-met-models line interface.
java -jar tika-app-0.8-SNAPSHOT.jar --list-met-models
Or, you can print the same list programmatically by calling the TikaCLI from a Java program.
The version of Tika at the time of writing this book (0.8-SNAPSHOT) supports the following metadata models, as output from the DescribeMetadata tool wherein the metadata model name is shown without a left indent and the model’s associated metadata attributes are shown indented one space from the left and capitalized.
ClimateForecast ACKNOWLEDGEMENT COMMAND_LINE COMMENT CONTACT CONVENTIONS EXPERIMENT_ID HISTORY INSTITUTION MODEL_NAME_ENGLISH PROGRAM_ID PROJECT_ID REALIZATION REFERENCES SOURCE TABLE_ID CreativeCommons LICENSE_LOCATION LICENSE_URL WORK_TYPE DublinCore CONTRIBUTOR COVERAGE CREATOR DATE DESCRIPTION FORMAT IDENTIFIER LANGUAGE MODIFIED PUBLISHER RELATION RIGHTS SOURCE SUBJECT TITLE TYPE Geographic ALTITUDE LATITUDE LONGITUDE HttpHeaders CONTENT_DISPOSITION CONTENT_ENCODING CONTENT_LANGUAGE CONTENT_LENGTH CONTENT_LOCATION CONTENT_MD5 CONTENT_TYPE LAST_MODIFIED LOCATION MSOffice APPLICATION_NAME APPLICATION_VERSION AUTHOR CATEGORY CHARACTER_COUNT CHARACTER_COUNT_WITH_SPACES COMMENTS COMPANY CONTENT_STATUS CREATION_DATE EDIT_TIME KEYWORDS LAST_AUTHOR LAST_PRINTED LAST_SAVED LINE_COUNT MANAGER NOTES PAGE_COUNT PARAGRAPH_COUNT PRESENTATION_FORMAT REVISION_NUMBER SECURITY SLIDE_COUNT TEMPLATE TOTAL_TIME VERSION WORD_COUNT Message MESSAGE_BCC MESSAGE_CC MESSAGE_FROM MESSAGE_RECIPIENT_ADDRESS MESSAGE_TO TIFF BITS_PER_SAMPLE EQUIPMENT_MAKE EQUIPMENT_MODEL EXPOSURE_TIME FLASH_FIRED FOCAL_LENGTH F_NUMBER IMAGE_LENGTH IMAGE_WIDTH ISO_SPEED_RATINGS ORIENTATION ORIGINAL_DATE RESOLUTION_HORIZONTAL RESOLUTION_UNIT RESOLUTION_VERTICAL SAMPLES_PER_PIXEL SOFTWARE
Now that you know what metadata models that Tika supports and that there is a difference between the models (in other words, not all are created equally!), let’s more precisely explore the variations between generic and content-specific metadata models. We’ll use Tika to help us out.
General standards
Most electronic files available via the Internet have a common set of metadata properties, the conglomerate of which are part of what we call general metadata models or general standards for metadata. General models describe electronic resources at a high level as in, “Who authored the content?”, “What format(s) is the content represented in?”, and the like.
To illustrate, let’s take a look at some of the properties of the Dublin Core metadata model attributes, as output by Tika, which supports Dublin Core. Recall the command we showed you above.
java -jar tika-app-0.8-SNAPSHOT.jar --list-met-models
By exploring the output of the above command a bit, and by using a simple grep command, we can augment the –list-met-models output to isolate only the Dublin Core part.
java -jar tika-app-0.8-SNAPSHOT.jar --list-met-models | grep -A16 DublinCore
Which produces the output:
DublinCore CONTRIBUTOR COVERAGE CREATOR DATE DESCRIPTION FORMAT IDENTIFIER LANGUAGE MODIFIED PUBLISHER RELATION RIGHTS SOURCE SUBJECT TITLE TYPE
Looking at some of these attributes, it’s pretty clear that most or all of them are highly representative of all electronic documents. Think back to table 1. What would the valid values be for something like the FORMAT attribute? Most of the time the metadata field is filled with a valid MIME media type. What would the cardinality be for something like the DATE attribute? Often, electronic documents have a single creation date, but perhaps many last modified dates, so the cardinality is one or more values.
Let’s now focus in on content-specific metadata models.
Content-specific metadata standards
Generic metadata standards and models are great because they address two fundamentally important facets of capturing and using metadata:
- Filling in at least some value per field—Content-specific metadata standards provide at least some value for each field (for example, FORMAT, TITLE in the case of Dublin Core, and their corresponding values captured for, for example, a PDF file, like application/pdf and mypdffile.pdf, respectively) because they’re generic.
- Easily comparable—Mainly due to having some default value, the actual attributes themselves are so general that they are more likely to mean the same thing. (It’s pretty clear what a TITLE is referring to for a document.)
On the other hand, content-specific metadata standards and models are less likely to fulfill either of the above properties. First, they aren’t guaranteed to fill in any values of any of their particular fields. Take MS Office files and their field, COMPANY, derived from the same grep trickery we showed you above:
java -jar target/tika-app-0.8-SNAPSHOT.jar --list-met-models | grep -A28 MSOffice MSOffice APPLICATION_NAME APPLICATION_VERSION AUTHOR CATEGORY CHARACTER_COUNT CHARACTER_COUNT_WITH_SPACES COMMENTS COMPANY CONTENT_STATUS CREATION_DATE EDIT_TIME KEYWORDS LAST_AUTHOR LAST_PRINTED LAST_SAVED LINE_COUNT MANAGER NOTES PAGE_COUNT PARAGRAPH_COUNT PRESENTATION_FORMAT REVISION_NUMBER SECURITY SLIDE_COUNT TEMPLATE TOTAL_TIME VERSION WORD_COUNT
COMPANY is only filled out in an MS Word file’s metadata attributes when the Company name has been entered by the user or owner of the MS Office suite installed on the computer that created the MS Word file. So, in short, if you didn’t fill out the Company field when registering your MS Office, and you begin sharing MS Word files with your other software colleagues, and they want to use Tika to find out what company you work for, they are out of luck. (For privacy-minded people, though, this is a good thing!)
As for being easily comparable, this is another area where content-specific metadata models do not particularly provide a silver bullet. The field in the metadata LAST_MODIFIED HttpHeaders model does not correspond directly to the MODIFIED field in the DublinCore model nor does it correspond to LAST_SAVED field from the MSOffice metadata model. So, content-specific metadata model attributes are not easily comparable across metadata models.
Most document formats have an associated content or file-specific metadata model associated with them (even in the presence of a general model, like Dublin Core). There is the eXtensible Metadata Platform (XMP), pioneered by Adobe for media file formats (images, videos, and so on); there’s a whole slew of MS Office metadata formats, there’s metadata models for JPEG files, metadata for climate related science files in the Climate Forecast metadata model, and there’s corresponding other metadata formats (like FITS) for science files in the astrophysics community. The good news is this: Tika already supports a slew of existing content-specific metadata models and even if it doesn’t, it’s extensible and allows you to add in your own metadata models and attributes/specifications that you can leverage in your own content-specific applications.
We’ll tell you a little bit about metadata quality and how it influences all sorts of things like comparing, understanding, and validating metadata.
Metadata quality
The biggest thing we’ve glossed over while informing you about the wonders of metadata until now is, “So, how does that metadata get populated?” This is a great question. There are plenty of ways. The application program that generates a particular file (for example, MS Office generates Word Documents, PowerPoint files, and so one, or Adobe Photoshop generates PDF files, and so on) can be responsible for annotating a file with metadata.
An alternative is that a user may explicitly fill out metadata about the file on their own when authoring it. Many Software Project Management tools (like MS Project or Fastrack) prompt a user to fill out basic metadata fields (Title, Duration, Start Project Date, End Project Date, and so on) when authoring the file.
Sometimes, downstream software programs author metadata about files. A classic example of this is when a web server returns metadata about the file content it is delivering back to a user request. The web server was not the originator of the file; however, it has the ability to tell a requesting user things like file size, content-type (or MIME type), and other useful properties. This is depicted for the example in figure 2.
During that process, Word annotates the file with basic MsOffice metadata, including AUTHOR, and PAGE_COUNT. After some file is created, later the content creator may publish her file on the Apache HTTPD web server, where it will be available for downstream users to acquire. When a downstream user requests the file from Apache HTTPD, the web server will annotate the file with HttpHeaders metadata, including CONTENT_TYPE and other metadata.
With all of these actors in the system, it’s no wonder that metadata quality, or the examination and assesment of captured metadata for file types, is a big concern. In any one of the steps in figure 2, the metadata for the file could be changed or simply not populated, affecting some downstream user of the file, or some software that must make sense of it later. What’s more, even if the metadata is populated, it’s often difficult to compare metadata captured in different files, even if the metadata captured in fact does represent the same terminology. This is often due to each metadata model’s using its own terms, potentially its own units for those terms, and, ultimately, its own definitions for those terms as well.
Metadata quality is of prime importance, especially in the case of correlating metadata for files of different types and, most often, different metadata models. For a writer of software that must deal with thousands of different file types and metadata models every day, it’s no easy challenge to tackle metadata correlation.
Here comes Tika to save the day again!
Unifying heterogeneous standards
Lucky for us, Tika’s metadata layer is designed with exactly the aforementioned metadata quality challenges in mind. Tika provides a Property class that implements the Adobe XMP standard for capturing metadata attributes. XMP defines a property (called PropertyType in Tika) as some form of metadata captured about an annotated document. XMP also defines property values that are captured for each property of metadata. In Tika we call XMP property values ValueTypes. Let’s take a quick look at a snippet of the Tika Property class.
Listing 1 Tika’s Property class and its support for XMP-like metadata.
public final class Property { public static enum PropertyType { SIMPLE, STRUCTURE, BAG, SEQ, ALT } public static enum ValueType { BOOLEAN, OPEN_CHOICE, CLOSED_CHOICE, DATE, INTEGER, LOCALE, MIME_TYPE, PROPER_NAME, RATIONAL, REAL, TEXT, URI, URL, XPATH } // ... }
The PropertyType and ValueType Java enums allow Tika to define a metadata attribute’s cardinality (for example, is it a SIMPLE value or a sequence of them called SEQ for shorthand), controlled vocabulary (for example, a CLOSED_CHOICE or simple OPEN_CHOICE), and units (for example, a REAL or an INTEGER). Using Tika and its Property class, you are able to decide whether or not LAST_MODIFIED in the HttpHeaders metadata model is roughly equivalent in terms of units, controlled vocabulary and cardinality to that of LAST_SAVED in the MsOffice metadata model.
These capabilities are useful in comparing, validating them, and understanding metadata properties (recall from table 1 that these are important things to capture for each metadata element) and in dealing with heterogeneous metadata models and formats. Tika’s goal is to allow you to curate high-quality metadata in your software application.
Summary
We’ve helped you familiarize yourself with metadata models, Tika’s support for the different properties of metadata models, and most of the important challenges behind dealing with metadata models.