In this tutorial we will learn how to set up an additional GraphStream module providing a reader that allows to extract a graph from the Amazon.com database of products.
The idea
Most of the time, when you are on the page that describes a product on the site Amazon.com, several other items are recommended by the site. You see something like "customers that bought item A also bought B, C, D, etc.". Most of the time the advised items are semantically close, they are somewhat similar.
If we represent each item by a node and place an edge between items that are seen as "similar" according to the Amazon site, we can produce a graph of relationships between items.
Such graphs will then show some clusters of nodes semantically close.
Installation
The project is called "Amazon Crawler". You can download the last up-to-date version using SVN :
svn co https://graphstream.svn.sourceforge.net/svnroot/graphstream/GraphStreamAmazon/trunk AmazonCrawler
This modules needs a java library that gives access to the Amazon Web Services (AWS).
This library is called A2S and can be downloaded from the AWS site.
In addition, in order to use this library, you will also need to register an AWS account to obtain an "access key id" and an "associate tag". Without this you will not be able to access the Amazon database.
You will need the following jars provided in A2S:
- amazon-a2s-X-X-X-java-library.jar
- commons-codec-X.X.jar
- commons-httpclient-X.X.X.jar
- commons-logging-X.X.jar
- activation.jar
- jaxb-all-deps.jar
- jaxb-api.jar
- jaxb-impl.jar
- jaxb-xjc.jar
- jsrXXX_X.X.jar
- log4j-X.X.X.jar
Where the "X" are replaced by the version numbers of the jars given with A2S.
You will also need to add the following command line arguments to the "java" command :
-Djava.endorsed.dirs="/where/you/have/installed/amazon-a2s-library/third-party/jaxb/" -Dlog4j.configuration="/where/you/checkouted/GraphStreamAmazon/log4j.properties"
How it works
The module is a simple graph reader. Therefore it is used as usual, either by calling the read() command that will fetch all data at once (this can take quite a long time as we are operating across the Internet) or the begin(), nextEvents(), end() cycle.
See this tutorial for details on the use of graph readers.
You must however configure the reader. First, when creating it, you must provide two strings : your access key identifier and your associate tag. As seen above these two values are given to you by Amazon when you register an account.
The Amazon database is simply huge. Therefore we cannot explore it completely. You can browse it until a given number of nodes have been created. By default this number is set to 10000. You can change it using the setLimit(int) method.
The last argument is the starting point used to explore the database. This takes the form of an ASIN (the unique identifiers used by Amazon to identify the products). There exist several ways to get these ASINs, but the reader provides an utility method for fetching books ASINs : searchForBookWithTitle(String).
This method returns the first book that matches the given title (the title need not to be exacts, it can be a word or several).
This ASIN must be given to a special begin(String) method as we cannot use the begin(InputStream method (it throws an runtime exception). You can also use the read(String) method (but not the read(InputStream) method for the same reason).
Here is a simple example :
public class TestAmazonCrawler {
public static void main( String args[] ) {
if( args.length > 1 )
new TestAmazonCrawler( true, 400, args[0], args[1] );
}
public TestAmazonCrawler( int limit, String key, String tag ) {
Graph graph = new DefaultGraph();
GraphReaderAmazon gra = new GraphReaderAmazon( key, tag );
GraphReaderListenerHelper readerHelper = new GraphReaderListenerHelper( graph );
gra.addGraphReaderListener( readerHelper );
gra.setLimit( limit );
graph.display();
try {
String ASIN = gra.searchForBookWithTitle( "The Art Of Computer Programming" );
if( ASIN != null ) {
GraphReaderListenerWriter readerWriter = new GraphReaderListenerWriter( new GraphWriterDGS(), "amazon_"+ASIN+"_"+limit+".dgs" );
gra.addGraphReaderListener( readerWriter );
gra.begin( ASIN );
while( gra.nextEvents() ) {}
gra.end();
readerWriter.end();
}
}
catch( Exception e ) {
e.printStackTrace();
}
}
}