Tutorial : Text geotagging with OpenCalais

This tutorial will show how to use version 3.1 of the Calais API to identify geographic references in a text and display them on an OpenLayers map. I am going to use the Calais API with JSON output (which is a new feature of version 3.1), so that all the processing can be done easily in JavaScript entirely inside a web browser.

A running version of the demo application can be found here. The code can be downloaded here.

What is the Calais API?

From the documentation:

The Calais Web Service allows you to automatically annotate your content with rich semantic metadata, including entities such as people and companies and events and facts such as acquisitions and management changes.

Version 3.1, currently in beta, has added information regarding the latitude/longitude of a few types of entities: Cities, provinces/states and countries. It makes it possible to plot those entities on a map, in a way similar to what can be done with the Metacarta GeoTagger API.

calais

The application page

The application will be one unique page, communicating with the Calais Web Service through AJAX. Here is the HTML code for the page :

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title>OpenCalais Geo Tutorial</title>
	<link rel="stylesheet" type="text/css" href="style.css" />
<script src="http://maps.google.com/maps?file=api&v=2&key=xxx"></script>
<script type="text/javascript" src="OpenLayers.js"></script>
<script type="text/javascript" src="processing.js"></script>
</head>
<body onload="initMap();">
<div id="main">
<div id="form" >
<label for="text">Text:</label>
<textarea name="text" id="inputText" rows="12"
      cols="50" ></textarea>
<input type="button" name="button" value="Submit"
      onclick="submitToOC();" /></div>
<div id="result">
<label for="result">Result:</label>
<div id="resultText"></div>
</div>
<div id="map"></div>
</div>
</body>
</html>

Some observations :

  • The Google Maps API is included so the satellite imagery can be used inside OpenLayers. A key can be obtained here.
  • There are 3 areas : The text area input to enter the text to geotag, the result area, which will contain the same text but annotated with links to markers on the map, and the OpenLayers map, with markers at the places identified in the text by OpenCalais.
  • A suitable CSS can be found in the code archive.

Initialization of the map

When the page is loaded, the OpenLayers map gets initialized in the initMap function:

function initMap(){
  map = new OpenLayers.Map('map');
  var google = new OpenLayers.Layer.Google( "Google Hybrid" , {type: G_HYBRID_MAP });
  markerLayer = new OpenLayers.Layer.Markers("OpenCalais Places");
  featureMap = {};
  map.addLayers([google,markerLayer]);
  map.addControl(new OpenLayers.Control.LayerSwitcher());
  map.zoomToMaxExtent();
}

This is straightforward OpenLayers initialization code. 2 layers are created: The Google Hybrid Map Layer will serve as the background layer and the Marker Layer will be used for the markers created at the locations identified in the text by OpenCalais.

Proxy

Due to the cross-domain restrictions in browsers, in order to use the Calais Web Service, I will need to proxy my requests through the web server that serves the application. It will transparently forward the queries to the web service and the responses back to the browser. Since Google App Engine is easily installed and makes it possible to host the application for free, I decided to go with the URL Fetch API of the GAE to do that, but there are countless other ways to achieve the same thing.

An App Engine project needs to be created with one unique request handler acting as a proxy (for brevity the configuration plumbing is omitted but it can be looked up in the code archive) :

OCURL = "http://beta.opencalais.com/enlighten/rest/"
LICENSEID = "xxxxxxxxxxxxxxx"

class OCProxy(webapp.RequestHandler):
  def post(self):
    try:
      str_ocargs = self.encoded_args()
      result = urlfetch.fetch(OCURL,
            "licenseID=%s&%s" % (LICENSEID, str_ocargs) ,
            urlfetch.POST)
      self.write_response(result)
    except:
      self.response.out.write('"error"')

  def encoded_args(self):
      ocargs = {}
      for arg in self.request.arguments():
          ocargs[arg] = self.request.get(arg)
      return urllib.urlencode(ocargs)

  def write_response(self,result):
      if result.status_code == 200:
          self.response.out.write(result.content)
      else:
          self.response.out.write('"error"')

So the behavior is as follows:

  • All POST queries are forwarded to http://beta.opencalais.com/enlighten/rest/, which is the endpoint for the prerelease 3.1 version of the Calais non-SOAP API
  • All the arguments of the original browser queries are forwarded as is
  • The license key is added in the form of a licenseID argument. You need to register in order to get a valid key. Even if you have downloaded the source package, you will still need to replace the key with your own in the app.py file.
  • This handler is registered for the /opencalais-geo/ocproxy route

Querying the Calais API

The next step is to actually query the Calais service with a text entered by the user. When the user clicks on the “Submit” button, the submiToOC function gets called:

function submitToOC(){
  clearMarkers();
  var paramsXML = '<c:params xmlns:c="http://s.opencalais.com/1/pred/" ' +
      'xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">' +
      '<c:processingDirectives ' +
      'c:contentType="text/raw" ' +
      'c:outputFormat="application/json">' +
      '</c:processingDirectives>' +
      '<c:userDirectives />' +
      '<c:externalMetadata />' +
      '</c:params>';
  new OpenLayers.Ajax.Request('/opencalais-geo/ocproxy',
				{method:'post',
				    asynchronous: true,
				    contentType:'application/x-www-form-urlencoded',
				    onComplete: processOCResponse,
				    postBody: "content=" + escape($('inputText').value) +
				    "&paramsXML=" + escape(paramsXML())});
}

function clearMarkers(){
  for(var featureK in featureMap)
    featureMap[featureK].destroyPopup();
  markerLayer.clearMarkers();
  featureMap = {};
}

The Calais web service can be queried with 3 arguments:

  • licenseID: This argument is added by the proxy so it not sent from the browser
  • content: This is the text to be analyzed
  • paramsXML: The configuration parameters for the anaysis, in a XML format. For the purpose of this demo application, I am particularly interested in the contentType and outputFormat parameters, set respectively to text/raw (to perform the analysis on the text without any cleaning) and application/json (to receive the response in JSON format, instead of the standard RDF)

On completion of the query, the processOCResponse function will be called with the response from the Calais service.

Processing the response

Since the response is in JSON format, it can directly be evaled and put into a local variable. It is then an object with 2 kinds of attributes:

  • One attribute is doc, the value of which is an object that contains some meta information about the text. It also contains the text in the form analyzed by the service, in case it has been cleaned up. In my case, since contentType processing directive was set to text/raw when submitting the text, the analyzed text should be identical to the one that was sent.
  • The other attributes contain information detected in the text and are in the form of http://d.opencalais.com/genericHasher-1/1d1529b7-da5f-3884-8de0-c765b3b7d3a3, that is, a unique ID for an entity or relation that is then described in the value object for the attribute. The full list of possible information can be found here. When a piece of information is about a relation between entities, the value does not contain the full data about the entities but instead refers to their IDs.

The first step after evaluating the response is then to resolve the references to IDs. Since I am only concerned with entities (and only the geographic ones), this is not strictly necessary for this tutorial, but the Calais documentation details it and it can be useful if I wish to do something with the rest of the info. This is the purpose of the resolveReferences function. Along with that, information is grouped by type of information (entities or relations) and further, by the specific kind of entity or relation (eg City, Person, PersonPolitical) using respectively attributes _typeGroup and _type. This is done so that jsonObject.entities.City refers to an index of all city entities found in the text. This step is performed by the createHierarchy function.

function processOCResponse(jsonResponse){
  var jsonObject = null;
  eval('jsonObject = ' + jsonResponse.responseText);
  resolveReferences(jsonObject);
  processGeoReferences(createHierarchy(jsonObject));
}

function resolveReferences(flatdb) {
  for (var element in flatdb)
    for (var attribute in flatdb[element]) {
      var val = flatdb[element][attribute];
      if (typeof val == 'string')
    if (flatdb[val] != null)
      flatdb[element][attribute] = flatdb[val];
    }
}

function createHierarchy(flatdb) {
  var hdb = new Object();
  for (var element in flatdb) {
    var elementType = flatdb[element]._type;
    var elementGroup = flatdb[element]._typeGroup;
    if (elementGroup != null) {
      if (hdb[elementGroup] == null)
    hdb[elementGroup] = new Object();
      if (elementType != null) {
    if (hdb[elementGroup][elementType] == null)
      hdb[elementGroup][elementType] = new Object();
    hdb[elementGroup][elementType][element] = flatdb[element];
      } else
    hdb[elementGroup][element] = flatdb[element];
    } else
      hdb[element] = flatdb[element];
  }
  return hdb;
}

When the hierarchy is created, the result is passed to the processGeoReferences function, which does the work of annotating the text and putting pushpins on the map.

Preparing the data

For the purpose of this tutorial, I am only interested in entities which can potentially hold information about latitude and longitude. According to the documentation, these are cities, countries and provinces/states (but it is still possible that such an entity does not hold such information, in which case I won’t represent it on the map or annotate its location in the text).  Entities of these types are merged together in the mergeObjects function.

Each entity can appear multiple times in the text, possibly in different forms (eg US and United States). This is why there is an instances attribute array attached to each entity returned in the result: Each instance indicates where in the text the entity is present, by giving its offset from the start of the text and length. I will use this information to annotate the text with links, that, when clicked, center the map on the location of the entity. Since, to do that, I will need to modify the text, the offset information of instances may become invalid during the processing. To prevent that, I can perform the annotation in the order of decrementing offset so that the offset of instances that are still to be processed stay valid. This is why instances are sorted in the sortInstances function.

function processGeoReferences(dt){
  if(dt.entities && (dt.entities.Country || dt.entities.City || dt.entities.ProvinceOrState))
    analyzeData(dt.entities, dt.doc.info.document);
}

function analyzeData(entities,text){
  var sInstances =
    sortInstances(getInstances(mergeObjects([entities.Country,
					     entities.City,
					     entities.ProvinceOrState])));
  createTextAndMarkers(text,sInstances);
}

function mergeObjects(objects){
  var result = {};
  for(var i = 0 ; i< objects.length ; i++){
    if(!objects&#91;i&#93;)
      continue;
    for(var prop in objects&#91;i&#93;)
      result&#91;prop&#93; = objects&#91;i&#93;&#91;prop&#93;;
  }
  return result;
}

function getInstances(entities){
  var instances = &#91;&#93;;
  for(var entityK in entities){
    var entity = entities&#91;entityK&#93;;
    for(var i = 0 ; i < entity.instances.length ; i++){
      var tDetails = entity.instances&#91;i&#93;;
      instances.push({id:entityK,
	    offset: tDetails.offset,
	    length:tDetails.length,
	    exact:tDetails.exact,
	    detail:entity.resolutions&#91;0&#93;,
	    category: entity._type});
    }
  }
  return instances;
}

function sortInstances(instances){
  return instances.sort(function(a,b){
      return b.offset - a.offset;}); //DECR offset
}&#91;/sourcecode&#93;

We then get a list of sorted instances, ready to be used for annotating the text and displaying on the map in the <strong>createTextAndMarkers</strong> function.
<h3>Using the data</h3>
<a href="https://gvlt.files.wordpress.com/2008/10/picture-1.png"><img class="aligncenter size-full wp-image-146" title="Financial meltdown" src="https://gvlt.files.wordpress.com/2008/10/picture-1.png?w=500" alt="" width="500" height="349" /></a>

The next step inside function <strong>createTextAndMarkers</strong> is to go through all the instances (now sorted by decreasing offset) to check that the current instance has geographic coordinates. If not, the instance is just ignored. Then I create a marker and annotate the text with a link, that, when clicked, will center on the marker and popup the complete name of the identified place. The marker is setup so that a click on it also opens this popup. To make it easy, in the <strong>createFeatureWithMarker</strong> function, instead of a plain <strong>OpenLayers.Marker</strong>, an <strong>OpenLayers.Feature</strong> is used. This is basically a marker with data and instructions on how to present this data in a popup. Different icons are used for each type of place (city, province/state, country). Finally, the text is presented to the user and the extent of the map is changed in the <strong>updateZoom</strong> function to the smallest extent that makes all markers visible at once. If there is only one marker, since the Calais Web Service does not return an indication of scale, the zoom is reset to the whole earth.

function createTextAndMarkers(text,sInstances){
  var icons = createIcons();
  for(var i = 0 ; i < sInstances.length ; i++){

     var instance = sInstances&#91;i&#93;;
     if(instance.detail && instance.detail.longitude != undefined){
       text = text.substr(0,instance.offset) +
	     '<a href="javascript:zoomOnAndPopup(\'' + instance.id +'\')">' +
	     instance.exact + '</a>' +
	     text.substr(instance.offset + instance.length);
       if(!featureMap[instance.id]){
	     var feature =
	       createFeatureWithMarker(instance.detail.longitude,
				   instance.detail.latitude,
				   instance.detail.name,
				   icons[instance.category].clone());
	     markerLayer.addMarker(feature.marker);
	     featureMap[instance.id] = feature;
       }
     }
  }
  $("resultText").innerHTML =  "<" + "p class='ptop'>" +
    text.replace(/\n+/g,"</" + "p><" + "p>") + "</" + "p>";
  updateZoom();
}

function createIcons(){
  var iconSize =  new OpenLayers.Size(20,20);
  var iconOffset = new OpenLayers.Pixel(-(iconSize.w/2), -iconSize.h);
  return {Country: new OpenLayers.Icon("img/marker.png",iconSize,iconOffset),
      City : new OpenLayers.Icon("img/marker-blue.png",iconSize,iconOffset),
      ProvinceOrState : new OpenLayers.Icon("img/marker-gold.png",iconSize,iconOffset)
      };
}

function zoomOnAndPopup(id){
  var feature = featureMap[id];
  map.setCenter(feature.lonlat);
  displayPopup(feature);
}

function createFeatureWithMarker(lon,lat,info,icon){
  var lonlat = new OpenLayers.LonLat(lon,lat);
  var contentHTML = "<" + "div class='info' >" +
    info.split(",").join('<' + 'br/>') + "</" + "div>";
  var config = {popupContentHTML : contentHTML,
		icon:icon, popupSize : new OpenLayers.Size(250, 40)};
  var feature = new OpenLayers.Feature(markerLayer,lonlat,config);
  var marker = feature.createMarker();
  marker.events.register('click', feature,
			 function(){displayPopup(feature);});
  return feature;
}

function displayPopup(feature){
  feature.destroyPopup();
  map.addPopup(feature.createPopup(true),true);
}

function updateZoom(){
  if(markerLayer.markers.length > 1)
    map.zoomToExtent(markerLayer.getDataExtent());
  else if(markerLayer.markers.length == 1)
    map.setCenter(markerLayer.markers[0].lonlat,1);
}

Conclusion

I tested the geotagging with a couple of articles, including Election fever fires up U.S. retirees in Mexico, and S.Korea joins global rescue, crisis summit from Reuters (who own OpenCalais and, incidently, whose news maps are powered by Metacarta). These 2 articles are used in the screenshots. The result is good, I think. Still there are few instances where a place is recognized as a city (with the presence of a City entity in the response) but does not have geographic coordinates. It seems to usually be with minor cities. Maybe it could be useful to query an additional service like GeoNames to resolve them. Also, one other thing which could be added is an indication of scale in addition to the lat/lon, to know what kind of zoom level should be suitable to represent the entity in its entirety. Despite these minor quibbles, I am looking forwad to the final 3.1 release of the API.

The final application can be tried here.

Advertisements

8 thoughts on “Tutorial : Text geotagging with OpenCalais

  1. Tom Tague from Calais here.

    Guilhem – what an exceptionally great article! You take a new Calais capability, you create a great demonstration application – and you share the code and the steps you took to make it happen. This is really useful stuff and will help jump start anyone that wants to start experimenting.

    We’ll continue to evolve the geolocation capability of Calais for the foreseeable future. In particular we’ll be investing in more sophisticated disambiguation of place names.

    Soon each geography identified will also resolve to a de-referenceable URI. That URI will contain a little bit of information about the geography – but more importantly will point at other linked data assets (like Freebase, DBpedia, etc) that know a whole lot about the geography. Should allow some interesting applications.

    Thanks again for taking the time to not only do this – but to present it so that others can learn from your work.

    Regards,

  2. Great work and much needed. JSON / JavaScript and BONUS, Google App Engine tutorial and working code. Fantastic!

    This approach is THE way to go.

  3. Hello,

    I notice a bug when you click several times on a marker…
    The popup height size keeps growing (on IE 7 but not on FF)…

    Have you ever works on a solution for this kind of bug?

    Thanks.

  4. @Sylvain : Actually I had never tested on IE7 so I hadn’t noticed. The popup content is just a very basic div, so it may be possible there is a bug in the OpenLayers library. If you have a local version of this tutorial, you could try to update OpenLayers and check if the problem has disappeared.

    1. @Guilhem : Thank you for your reply. I have passed several hours on it and I noticed some bug reporting (resolved now) on the OpenLayers project about the createPopup function (popup grows bigger each time we display it).

      It seems we have the same problem with the createMarker function with a feature.
      I didn’t succeed in resolving this bug but I noticed that it doesn’t exist if we don’t specify a size for the popup.
      So I just changed the OpenLayers.Popup.WIDTH and the OpenLayers.Popup.HEIGHT properties in the OpenLayers.js (I know it’s a shame but I don’t have much time to find a solution now and I only want to open popup in a fixed size).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s