JavaBeat

  • Home
  • Java
    • Java 7
    • Java 8
    • Java EE
    • Servlets
  • Spring Framework
    • Spring Tutorials
    • Spring 4 Tutorials
    • Spring Boot
  • JSF Tutorials
  • Most Popular
    • Binary Search Tree Traversal
    • Spring Batch Tutorial
    • AngularJS + Spring MVC
    • Spring Data JPA Tutorial
    • Packaging and Deploying Node.js
  • About Us
    • Join Us (JBC)
  • Privacy

Thread Contexts in ManifoldCF

April 28, 2011 by Krishna Srinivasan Leave a Comment

This article is based on ManifoldCF in Action, to be published on Oct 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘wright0535’ and get 35% discount on eBooks and pBooks ]

also read:

  • Java Tutorials
  • Java EE Tutorials
  • Design Patterns Tutorials
  • Java File IO Tutorials

Thread Contexts

Introduction

To save work, ManifoldCF provides a mechanism for keeping around data that is local to a persistent thread. This thread-local storage mechanism is called a thread context. You will find that many of the classes and methods in ManifoldCF require a thread context as an argument. The type of the argument is org.apache.manifoldcf.core.interfaces.IThreadContext.

The methods supported by a thread context are relatively simple. If you look at the interface above, you will see that there is a save() method and a get() method for saving and later retrieving some keyed piece of data.

A thread context is usually created shortly after a persistent thread begins executing. In the case of a main thread or request thread, ManifoldCF creates the thread context as soon as ManifoldCF gets control. The way to create a thread context is as follows:

[code lang=”java”]IThreadContext context = ThreadContextFactory.make();[/code]

This seems pretty straightforward. But that’s not all there is to it. While it is sometimes acceptable to have more than one thread context per actual thread, it is never acceptable to trade thread contexts across threads. Most of the time, when you are writing connectors in ManifoldCF, there is rarely any need to worry about this situation, because most objects are created and used by a single thread only. But, occasionally, people need to create shared objects. When this happens, they are occasionally tempted to store thread contexts or (more typically) other objects or handles that were created using a specific thread context, in objects which may be used by more than one ManifoldCF persistent thread. Doing this is a great way to build software that fails in all sorts of non-obvious ways, and this practice should be avoided at all costs.

For example, listing 1 should immediately raise red flags in your mind.

Listing 1 A very bad way to construct a shared object

[code lang=”java”]public class MyObjectFactory
{
static Integer lock = new Integer(0);
static MyObject multithreadReference;
public static MyObject get(IThreadContext tc)
{
synchronized (lock) #A
{
if (multithreadReference == null) #B
multithreadReference = new MyObject(tc); #B
return multithreadReference;
}
}
}
public static class MyObject
{
IDBInterface database;
public MyObject(IThreadContext tc)
{
database = DBInterfaceFactory.make(tc,"mydatabasename", #C
"myusername","mypassword"); #C
}
… #D
}
#A Only one thread at a time
#B If object not yet created, create it
#C Create member variable “database”
#D Do stuff that needs “database”[/code]

In this example, the programmer attempts to use a shared object in multiple threads. The intent is that a thread uses the MyObjectFactory.get() method to get a reference to the shared object. The shared object (of class MyObject) is, however, initialized in a manner that requires a thread context! Even worse, looking inside the MyObject class, you can see it is using the thread context to create a secondary object, which is then being saved as a member variable. Someone looking only at the MyObject member variables might not even be able to figure this out. This is definitely not going to work in real life.

So, what is the proper way to construct code like this? Actually, there are two answers. One answer involves passing the thread context into each method of the MyObject class so that each method will create the database handle. See listing 2.

Listing 2 One solution to the problem of cross-thread shared objects

[code lang=”java”]public class MyObjectFactory
{
static Integer lock = new Integer(0);
static MyObject multithreadReference;
public static MyObject get()
{
synchronized (lock)
{
if (multithreadReference == null)
{
multithreadReference = new MyObject(); #A
}
return multithreadReference;
}
}
}
public static class MyObject
{
public MyObject()
{
}
protected IDBInterface makeDatabaseHandle(IThreadContext tc) #B
{ #B
return DBInterfaceFactory.make(tc,"mydatabasename", #B
"myusername","mypassword"); #B
} #B
… #C
}
#A Construct object without thread context
#B Protected method builds database handle
#C All methods have thread context argument[/code]

In this solution, the object is still shared across threads with the same semantics, but care has been taken to be sure that no thread-context-related data is stored as a class member. This, of course, means that every MyObject method that needs a database will have to receive a thread context argument.

Is there a better way? There may be—but only if the contract is changed somewhat. You see, as soon as the shared object becomes specific to a thread in any way, it can only be used by that one thread until that thread is done with it. So, instead of allowing multiple threads to use the same object simultaneously, we can imagine a shared object that only one thread at a time can use. See listing 3.

Listing 3 Introducing a setContext() method to allow cross-thread sharing

[code lang=”java”]public class MyObjectFactory
{
static Integer lock = new Integer(0);
static MyObject multithreadReference = new MyObject();
public static MyObject grab(IThreadContext tc) #A
throws InterruptedException #A
{
while (true)
{
synchronized (lock)
{
if (multithreadReference == null)
{
lock.wait();
continue;
}
MyObject rval = multithreadReference;
multithreadreference = null;
rval.setContext(tc); #B
return multithreadReference;
}
}
}
public static void release(MyObject myobject) #C
{
synchronized (lock)
{
myobject.setContext(null); #D
multithreadReference = myobject;
lock.notifyAll();
}
}
}
public static class MyObject
{
protected IDBInterface database = null;
public MyObject()
{
}
protected void setContext(IThreadContext tc) #E
{ #E
if (tc == null) #E
database = null; #E
else #E
database = DBInterfaceFactory.make(tc,"mydatabasename", #E
"myusername","mypassword"); #E
} #E
… #F
}
#A Method to get shared object
#B Initialize with the thread context
#C Method to release shared object
#D Clear the thread context, to prevent problems
#E Context set method initializes “database”
#F Methods use “database”[/code]

In this implementation, the caller is meant to use the methods MyObjectFactory.grab() and MyObjectFactory.release() to obtain the object temporarily and later release it. As part of obtaining the object, the object is initialized with the current thread context. The object is expected to be used by the same thread that performed the grab() operation, or there would be little point in the whole exercise.

Summary

We discussed thread-local storage mechanism called thread context, which is used by ManifoldCF for keeping around data that is local to a persistent thread.

Filed Under: Java Tagged With: ManifoldCF

Using ManifoldCF Styles

April 28, 2011 by Krishna Srinivasan Leave a Comment

This article is based on ManifoldCF in Action, to be published on Oct 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘wright0535’ and get 35% discount on eBooks and pBooks ]

also read:

  • Java Tutorials
  • Java EE Tutorials
  • Design Patterns Tutorials
  • Java File IO Tutorials

Introduction

To save work, ManifoldCF provides a mechanism for keeping around data that is local to a persistent thread. This thread-local storage mechanism is called a thread context. You will find that many of the classes and methods in ManifoldCF require a thread context as an argument. The type of the argument is org.apache.manifoldcf.core.interfaces.IThreadContext.

The methods supported by a thread context are relatively simple. If you look at the interface above, you will see that there is a save() method and a get() method for saving and later retrieving some keyed piece of data.

A thread context is usually created shortly after a persistent thread begins executing. In the case of a main thread or request thread, ManifoldCF creates the thread context as soon as ManifoldCF gets control. The way to create a thread context is as follows:

[code lang=”java”]IThreadContext context = ThreadContextFactory.make();[/code]

This seems pretty straightforward. But that’s not all there is to it. While it is sometimes acceptable to have more than one thread context per actual thread, it is never acceptable to trade thread contexts across threads. Most of the time, when you are writing connectors in ManifoldCF, there is rarely any need to worry about this situation, because most objects are created and used by a single thread only. But, occasionally, people need to create shared objects. When this happens, they are occasionally tempted to store thread contexts or (more typically) other objects or handles that were created using a specific thread context, in objects which may be used by more than one ManifoldCF persistent thread. Doing this is a great way to build software that fails in all sorts of non-obvious ways, and this practice should be avoided at all costs.

For example, listing 1 should immediately raise red flags in your mind.

Listing 1 A very bad way to construct a shared object

[code lang=”java”]public class MyObjectFactory
{
static Integer lock = new Integer(0);
static MyObject multithreadReference;
public static MyObject get(IThreadContext tc)
{
synchronized (lock) #A
{
if (multithreadReference == null) #B
multithreadReference = new MyObject(tc); #B
return multithreadReference;
}
}
}
public static class MyObject
{
IDBInterface database;
public MyObject(IThreadContext tc)
{
database = DBInterfaceFactory.make(tc,"mydatabasename", #C
"myusername","mypassword"); #C
}
… #D
}
#A Only one thread at a time
#B If object not yet created, create it
#C Create member variable “database”
#D Do stuff that needs “database”[/code]

In this example, the programmer attempts to use a shared object in multiple threads. The intent is that a thread uses the MyObjectFactory.get() method to get a reference to the shared object. The shared object (of class MyObject) is, however, initialized in a manner that requires a thread context! Even worse, looking inside the MyObject class, you can see it is using the thread context to create a secondary object, which is then being saved as a member variable. Someone looking only at the MyObject member variables might not even be able to figure this out. This is definitely not going to work in real life.

So, what is the proper way to construct code like this? Actually, there are two answers. One answer involves passing the thread context into each method of the MyObject class so that each method will create the database handle. See listing 2.

Listing 2 One solution to the problem of cross-thread shared objects

[code lang=”java”]public class MyObjectFactory
{
static Integer lock = new Integer(0);
static MyObject multithreadReference;
public static MyObject get()
{
synchronized (lock)
{
if (multithreadReference == null)
{
multithreadReference = new MyObject(); #A
}
return multithreadReference;
}
}
}
public static class MyObject
{
public MyObject()
{
}
protected IDBInterface makeDatabaseHandle(IThreadContext tc) #B
{ #B
return DBInterfaceFactory.make(tc,"mydatabasename", #B
"myusername","mypassword"); #B
} #B
… #C
}
#A Construct object without thread context
#B Protected method builds database handle
#C All methods have thread context argument[/code]

In this solution, the object is still shared across threads with the same semantics, but care has been taken to be sure that no thread-context-related data is stored as a class member. This, of course, means that every MyObject method that needs a database will have to receive a thread context argument.

Is there a better way? There may be—but only if the contract is changed somewhat. You see, as soon as the shared object becomes specific to a thread in any way, it can only be used by that one thread until that thread is done with it. So, instead of allowing multiple threads to use the same object simultaneously, we can imagine a shared object that only one thread at a time can use. See listing 3.

Listing 3 Introducing a setContext() method to allow cross-thread sharing

[code lang=”java”]public class MyObjectFactory
{
static Integer lock = new Integer(0);
static MyObject multithreadReference = new MyObject();
public static MyObject grab(IThreadContext tc) #A
throws InterruptedException #A
{
while (true)
{
synchronized (lock)
{
if (multithreadReference == null)
{
lock.wait();
continue;
}
MyObject rval = multithreadReference;
multithreadreference = null;
rval.setContext(tc); #B
return multithreadReference;
}
}
}
public static void release(MyObject myobject) #C
{
synchronized (lock)
{
myobject.setContext(null); #D
multithreadReference = myobject;
lock.notifyAll();
}
}
}
public static class MyObject
{
protected IDBInterface database = null;
public MyObject()
{
}
protected void setContext(IThreadContext tc) #E
{ #E
if (tc == null) #E
database = null; #E
else #E
database = DBInterfaceFactory.make(tc,"mydatabasename", #E
"myusername","mypassword"); #E
} #E
… #F
}
#A Method to get shared object
#B Initialize with the thread context
#C Method to release shared object
#D Clear the thread context, to prevent problems
#E Context set method initializes “database”
#F Methods use “database”[/code]

In this implementation, the caller is meant to use the methods MyObjectFactory.grab() and MyObjectFactory.release() to obtain the object temporarily and later release it. As part of obtaining the object, the object is initialized with the current thread context. The object is expected to be used by the same thread that performed the grab() operation, or there would be little point in the whole exercise.

Summary

We discussed thread-local storage mechanism called thread context, which is used by ManifoldCF for keeping around data that is local to a persistent thread.

Filed Under: Java Tagged With: ManifoldCF

Managing Output Connection Definitions in ManifoldCF

April 28, 2011 by Krishna Srinivasan Leave a Comment

This article is based on ManifoldCF in Action, to be published on Oct 2011. It is being reproduced here by permission from Manning Publications. Manning publishes MEAP (Manning Early Access Program,) eBooks and pBooks. MEAPs are sold exclusively through Manning.com. All pBook purchases include free PDF, mobi and epub. When mobile formats become available all customers will be contacted and upgraded. Visit Manning.com for more information. [ Use promotional code ‘java40beat’ and get 40% discount on eBooks and pBooks ]

also read:

  • Java Tutorials
  • Java EE Tutorials
  • Design Patterns Tutorials
  • Java File IO Tutorials

Introduction

When you click on the List Output Connections link on the navigation menu, you enter the area of the UI that manages output connection definitions. We will need an output connection definition of some kind in order to demonstrate web crawling. But we can certainly make do with an output connection that uses the null output connector. Nevertheless, let’s visit some UI pages, so we can discuss them in greater depth.

Output connection definition list fields

The presented list of output connection definitions provides a little general information about each connection definition. While incomplete, it helps you to keep track of exactly which connection definition is which. Figure 1 shows the UI page in question. The fields are described in table 1.

Standard output connection definition tabs

The three tabs that make up the standard output connection definition tab set are the Name tab, the Type tab, and the Throttling tab. These tabs set the basic parameters of every output connection definition. It’s also no accident that these tabs are where information displayed in the output connection definition list comes from.

The Name tab

Refer to figure 2 for a screen shot of the Name tab. This tab is used solely for setting a connection definition’s name and description.

As mentioned before, the name given must be unique among output connection definitions, while the description can be anything, although you will benefit enormously by choosing a description that is meaningful and helpful. The name field can contain any Unicode character but is limited to 32 characters in total. The description field also may consist of any Unicode characters, but can be up to 255 characters in length.

The Type tab

The output connection definition Type tab can be seen in figure 3. Use this tab to select the output connection type (which maps to the connector class by way of a database table).

If the connection type you are looking for is not in the dropdown, it is because your output connector has not been registered, and you will need to take care of that problem first.

The Throttling tab

Figure 4 shows what the Throttling tab looks like.

For an output connection definition, all you can do with this tab is to set the Max Connections (per JVM) connection parameter. This parameter limits the number of active, connected instances of the underlying connector class that have the same exact configuration. For example, you may have two Solr indexes and a connection definition for each one. Since the connection definition information differs between the two, ManifoldCF will keep two distinct pools of connected Solr connection instances and apply the corresponding limit to the number of connection instances managed by each pool.

So, what should you set this parameter to? In part, the answer depends on the semantics of the target system. If each target system connection has a cost, or there is a limit of some kind in the target system for the number of connections to it, then you will want to control the number of instances based on that constraint. If there is no need for limits of any kind, you may safely set this parameter as high as the number of working threads (30, by default). Setting the parameter higher than that does no harm but has little benefit, since there is a limit to the number of connection instances ManifoldCF can use at any given time in any case.

Table 2 describes the target system constraints for the output connectors currently part of ManifoldCF.

Output connection status

Any time you finish editing an output connection definition or if you click the View link from the output connection definition list, you will see a connection definition information and status summary screen, similar to figure 5.

This screen has two important purposes: first, letting you see what you actually have set up for your connection definition parameters, and, second, showing you the status of a connection created with your connection definition. The connection status usually represents the results of an actual interaction with the target system. It is designed to validate the connection definition parameters you’ve described. However, it can also be very helpful in diagnosing transient connection problems that might arise (for example, expired credentials or a crashed Solr instance). This makes the connection definition information and status summary screen perhaps the best invention since the Chinese food buffet, so please make it a point to visit this screen as part of your diagnostic bag of tricks.

Summary

Setting up connection definitions is a fundamental function of the ManifoldCF crawler UI. Each kind of connection definition has its own link in the navigation area. We discussed output connection definitions.

Filed Under: Java Tagged With: ManifoldCF

Follow Us

  • Facebook
  • Pinterest

As a participant in the Amazon Services LLC Associates Program, this site may earn from qualifying purchases. We may also earn commissions on purchases from other retail websites.

JavaBeat

FEATURED TUTORIALS

Answered: Using Java to Convert Int to String

What is new in Java 6.0 Collections API?

The Java 6.0 Compiler API

Copyright © by JavaBeat · All rights reserved