How I used the CIA World Factbook to test my product

In preparation for the release of the first version of PerfectLearn, testing is the order of the day. To make the testing process both more realistic and more enjoyable I decided to load an external dataset into PerfectLearn to see how it handled a non-trivial topic map.


Screencast showing the CIA World Factbook data after it has been imported into PerfectLearn.

After searching online for a couple of hours I finally settled on the CIA World Factbook which in its own words “provides information on the history, people, government, economy, geography, communications, transportation, military, and transnational issues for 267 world entities.” All in all, the World Factbook is an interesting dataset that the CIA has made available for personal use.

The first thing to do when confronted with a task like this is to try to get a basic understanding of the nature of the data.

The first thing to do when confronted with a task like this is to try to get a basic understanding of the nature of the data. After examining the contents of the decompressed factbook.zip file I concluded that the following files and directories were sufficient to extract the necessary information to build the initial topic map ontology with some supporting images for each country’s topic:

  • geos
    • *.html: HTML documents for the 267 world entities.
    • print/country/*.pdf: the corresponding PDF documents for the 267 world entities.
  • graphics
    • flags/large/*.gif: country flags in GIF format.
    • maps/newmaps/*.gif: country maps in GIF format.
  • wfbExt
    • sourceXML.xml: XML file mapping country names, codes, and the corresponding regions.
CIA World Factbook Directory

CIA World Factbook Directory

There really is much more data available in the World Factbook than what I am alluding to. For example, in the fields and rankorder directories there is all kinds of data related to country comparisons (within several categories) and the appendix directory contains information about international organizations and groups, international environmental agreements, and so forth. Furthermore, there are both physical and political maps and population pyramids (in BMP format!) for all of the countries and territories. That is, the World Factbook is comprehensive to say the least.

With an initial understanding of the data the next step is to extract the information that is relevant for the current purpose. The HTML files in the geos directory provide the majority of the actual content for the countries, territories, and regions. In addition, the wfbExt/sourceXML.xml file (an excerpt of which is provided below) provides a convenient mapping between the countries and accompanying regions. That is, each country record in the sourceXML.xml file includes “name”, “fips”, and “Region” attributes which effectively links countries with regions while also providing the country code (the fips field) for the individual countries (and territories). The sourceXML.xml file will be crucial in the next phase when we are actually importing data into the topic map. For now, however, we need to focus on extracting the text for each country’s topic.

sourceXML.xml file excerpt

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<country>
	<country name="Afghanistan" fips="AF" Region="South Asia" />
	<country name="Akrotiri" fips="AX" Region="Europe" />
	<country name="Albania" fips="AL" Region="Europe" />
	<country name="Algeria" fips="AG" Region="Africa" />
	<country name="American Samoa" fips="AQ" Region="Oceania" />
	<country name="Andorra" fips="AN" Region="Europe" />
	<country name="Angola" fips="AO" Region="Africa" />
	<country name="Anguilla" fips="AV" Region="Central America" />
    ...
</country>

To painlessly extract data from HTML I normally resort to Apache Tika. Apache Tika is a Java library that makes it easy to extract meta data and text from numerous different file types, including (but not limited to) PDFs, Word files, Excel files, PowerPoint files, and, in this case, HTML files.

All in all, only two (Groovy) scripts are required to extract the text from HTML files and import the data into PerfectLearn while at the same time creating the necessary relationships between the topics. What the first script, Extract.groovy (provided below), does is relatively straightforward. First of all, it imports the necessary Apache Tika classes (lines 7-11), defines the source and target paths for the directory with the original HTML files and the directory to write the text files with the extracted text (lines 14-19), followed by creating the target directory (line 25). Next, the extraction of text from the HTML files starts by iterating over all of the HTML files (in the source directory) and calling the extractContent function to actually extract the textual content from each of the HTML files which is subsequently written to a file in the processed directory (lines 27-39). The extractContent function is the most complex code in this script but all it does is request Tika to return the content of the document’s body as a plain-text string by removing all the HTML-related markup (lines 45-65) after which the extracted text is passed to the sanitize function (lines 71-79) to remove superfluous text and to inject some markup to ensure better legibility of the text when it’s finally rendered in PerfectLearn. As you can see, Tika is doing the vast majority of the heavy lifting in this script.

Extract.groovy

/*
Extract country text script (from accompanying HTML files)
By Brett Alistair Kromkamp
January 09, 2015
*/

import org.apache.tika.Tika
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.html.HtmlParser
import org.apache.tika.parser.ParseContext
import org.apache.tika.sax.BodyContentHandler

// ***** Constants *****
final def ORIGINAL_PATH = '/home/brettk/Source/groovy/perfectlearn-miscellaneous/cia-factbook/data/original/geos'
final def PROCESSED_PATH = '/home/brettk/Source/groovy/perfectlearn-miscellaneous/cia-factbook/data/processed/geos'

// ***** Setup *****
def originalDirectory = new File(ORIGINAL_PATH)
def processedDirectory = new File(PROCESSED_PATH)

// ***** Logic *****
println 'Starting extraction process.'

// Create 'processed' directory.
processedDirectory.mkdirs() // Non-destructive.

originalDirectory.eachFile { file ->
    if (file.isFile() && file.name.endsWith('.html')) {
        def textFileName = generateTextFileName(file.name.toString())

        // Create file with extracted text.
        def textFile = new File("$PROCESSED_PATH/$textFileName")
        textFile.withWriter { out ->
            def textContent = extractContent(file.text)
            println textFileName
            out.writeLine(textContent)
        }
    }
}

println 'Done!'

// ***** Helper functions *****

String extractContent(String content) {
    BodyContentHandler handler = new BodyContentHandler()
    Metadata metadata = new Metadata()
    InputStream stream

    def result = ''
    try {
        if (content != null) {
            stream = new ByteArrayInputStream(content.getBytes())
            new HtmlParser().parse(
                stream, 
                handler, 
                metadata, 
                new ParseContext())
            result = sanitize(handler.toString()).trim()
        } 
        return result
    } finally {
        stream.close()
    }
}

String generateTextFileName(String htmlFileName) {
    return htmlFileName.replaceAll(~/\.html/, '') + '.txt'
}

String sanitize(String content) {
    return content
        .replaceAll(~/(?m)^\s+/, '')
        .replaceAll(~/(?s)^Javascript.*Introduction ::/, 'Introduction ::')
        .replaceAll(~/(?s)EXPAND ALL.*/, '')
        .replaceAll(~/(?m)^([A-Z].*\s+)::.*/, '<h2>$1</h2>')
        .replaceAll(~/(?m)^([a-z])([a-z|\s]*):/, '<strong>$1$2</strong>: ')
        .replaceAll(~/(?m)^([A-Z])([a-z|\s|-]*):/, '<h3>$1$2</h3>')
}

The next script, Import.groovy (provided below), although longer than the previous one, is relatively straightforward, as well. The important thing to realize with this script is that its main function is to iterate over the, previously mentioned, sourceXML.xml file to create and store the countries, territories, and regions (as topics) in the topic map. First of all, the script imports the necessary Java libraries including the PerfectLearn topic map engine (lines 8-17), followed by setting up the necessary constants for paths, database-related parameters, and other miscellaneous values (lines 21-32). The next thing it does is instantiate the PerfectLearn topic map engine (line 38) and creates some required topics for the World Factbook topic map ontology (lines 42-56). Once the necessary topics have been created, the sourceXML.xml is loaded and the country/territory/region records are read into a list (lines 62-64) for subsequent iteration (line 69). On each iteration the following actions are performed:

  • The required region identifier, region name, country identifier, country name, and country code are extracted for subsequent use (lines 72-77).
  • The textual content for each country, territory, or region is retrieved from the appropriate text file that was generated by the Extract.groovy script (lines 81-86).
  • The background information, excerpt, and timeline year are retrieved for each country, territory, or region to create the necessary meta data for subsequent display in the timeline component (lines 90-105, and lines 256-272, 274-283, 285-290, for the getBackgroundExcerpt, getBackground, and getTimelineYear functions, respectively).
  • The country or territory topic is created and stored (lines 109-114).
  • The country or territory text occurrence is created and stored (lines 118-126).
  • The region topic is created and stored (lines 130-137).
  • The association (that is, relationship) between a country or territory and its concomitant region is stored (line 141).
  • Coordinates are extracted from the country’s textual content and if the second set of coordinates is present (for the capital city), the meta datum with the coordinates is created and stored for subsequent visualization in the map component. The convertToDdCoordinates function is called with the extracted coordinates to convert from a degrees-minutes-seconds format to a decimal degrees format which is the required format for Google Maps (lines 145-149, and lines 241-254 for the convertToDdCoordinates function).
  • A link (occurrence) is added for each country, territory, or region pointing back to the appropriate page in the CIA World Factbook website (lines 153-161).
  • The flag (occurrence) is added for each country (lines 165-179, and lines 292-301 for the copyFile function).
  • The map (occurrence) is added for each country or territory (lines 185-195, and lines 292-301 for the copyFile function).

Next, the associations to establish the appropriate relationships between the regions themselves and between the regions and the "world" (topic) are created and stored in the topic map (lines 201-219). Finally, the textual content for the world topic is retrieved (lines 223-226) and the accompanying occurrence is created and saved (lines 228-236).

Import.groovy

/*
Import CIA World Factbook into PerfectLearn Topic Map Engine
By Brett Alistair Kromkamp
January 15, 2015
*/

// Import necessary Java libraries including the PerfectLearn topic map engine.
import com.polishedcode.crystalmind.base.Utils
import com.polishedcode.crystalmind.base.Language;
import com.polishedcode.crystalmind.map.store.TopicStore;
import com.polishedcode.crystalmind.map.store.TopicStoreException;
import com.polishedcode.crystalmind.map.model.*

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;

// ***** Constants *****
// Setup necessary paths, database-related parameters, and other miscellaneous values.
final def COUNTRIES_PATH = '/home/brettk/Source/groovy/perfectlearn-miscellaneous/cia-factbook/data/original/wfbExt/sourceXML.xml'
final def MAPS_PATH = '/home/brettk/Source/groovy/perfectlearn-miscellaneous/cia-factbook/data/original/graphics/maps/newmaps'
final def FLAGS_PATH = '/home/brettk/Source/groovy/perfectlearn-miscellaneous/cia-factbook/data/original/graphics/flags/large'
final def PROCESSED_PATH = '/home/brettk/Source/groovy/perfectlearn-miscellaneous/cia-factbook/data/processed/geos'

final def DATABASE = 'pldb_1'
final def SHARD_INFO = "localhost;3306;${DATABASE}"
final def USERNAME = '********'
final def PASSWORD = '********'
final long TOPIC_MAP_IDENTIFIER = 64L
final def COUNTRIES_TOTAL = 268
final def UNIVERSAL_SCOPE = '*'

// ***** Logic *****
println 'Starting importing process.'

// Instantiate the PerfectLearn topic store.
TopicStore topicStore = new TopicStore(USERNAME, PASSWORD)

// Bootstrap required topics.
println 'Bootstrapping...'
def bootstrapTopics = [
	new Entity(identifier: 'country', name: 'Country', instanceOf: 'topic'),
	new Entity(identifier: 'region', name: 'Region', instanceOf: 'topic'),
	new Entity(identifier: 'world', name: 'The World', instanceOf: 'topic'),
	new Entity(identifier: 'part-of', name: 'Part Of', instanceOf: 'topic')
]

bootstrapTopics.each { bootstrapTopic ->
	Topic topic = new Topic(
		bootstrapTopic.identifier,
		bootstrapTopic.instanceOf,
		bootstrapTopic.name, 
		Language.EN)
	topicStore.putTopic(SHARD_INFO, TOPIC_MAP_IDENTIFIER, topic, Language.EN)
}

/*
Iterate over country records (in sourceXML.xml) by extracting the necessary
attributes to create countries, territories, and regions.
*/
def countriesContent = new File(COUNTRIES_PATH).text
def countriesXml = new XmlSlurper().parseText(countriesContent)
def countries = countriesXml.country

assert COUNTRIES_TOTAL == countries.size()

println 'Iterating over countries...'
for (country in countries) {
	// For each country/territory/region extract the region identifier, 
	// region name, country identifier, country name, and country code. 
	def regionIdentifier = Utils.slugify(country.@Region.text())
	if (regionIdentifier) {	
		def regionName = country.@Region.text()
		def countryIdentifier = Utils.slugify(country.@name.text())
		def countryName = country.@name.text() 
		def countryCode = country.@fips.text().toLowerCase()

		// Get topic's text.
		println "Getting topic's text..."
		def topicContentPath = "$PROCESSED_PATH/${countryCode}.txt"
		def topicContentFile = new File(topicContentPath)
		def topicContent = ''
		if (topicContentFile.exists()) {
			topicContent = topicContentFile.text
		}

		// Extract the country's background excerpt.
		println "Extracting the country's background excerpt..."
		if (topicContent) {
			def excerpt = getBackgroundExcerpt(topicContent)
			def background = getBackground(topicContent)

			// Add the appropriate timeline related meta data.
			println 'Adding the timeline metadata...'
			if (excerpt && background) {
				def timelineYear = getTimelineYear(background)
				def timelineMedia = "<blockquote>${excerpt.find(~/(?s)^\S*^(.*?)[.?!]\s/).trim()}<blockquote>".toString()
				if (timelineYear && timelineMedia && excerpt) {
					topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'timeline-event-startdate', timelineYear, countryIdentifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)
					topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'timeline-media', timelineMedia, countryIdentifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)
					topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'timeline-text', excerpt, countryIdentifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)
				}
			}
		}

		// Create and store the country or territory topic.
		println 'Creating and storing the country topic...'
		Topic countryTopic = new Topic(
			countryIdentifier,
			'country',
			countryName, 
			Language.EN)
		topicStore.putTopic(SHARD_INFO, TOPIC_MAP_IDENTIFIER, countryTopic, Language.EN)

		// Create and store the topic's text occurrence.
		println "Creating and storing the topic's text..."
		Occurrence occurrence = new Occurrence(countryIdentifier)
		occurrence.with {
			instanceOf = 'text'
			scope = UNIVERSAL_SCOPE
			language = Language.EN
			resourceData = topicContent.getBytes()	
		}
		topicStore.putOccurrence(SHARD_INFO, TOPIC_MAP_IDENTIFIER, occurrence)
		topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'label', countryName, occurrence.identifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)

		// Create and store the region topic.
		println 'Creating and storing the region topic...'
		if (!topicStore.topicExists(SHARD_INFO, TOPIC_MAP_IDENTIFIER, regionIdentifier)) {
			Topic regionTopic = new Topic(
				regionIdentifier,
				'region',
				regionName, 
				Language.EN)
			topicStore.putTopic(SHARD_INFO, TOPIC_MAP_IDENTIFIER, regionTopic, Language.EN)
		}

		// Create associations between countries and regions.
		println 'Creating associations between countries and regions...'
		topicStore.createAssociation(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'country', countryIdentifier, 'region', regionIdentifier)

		// Create coordinates metadatum for each country's capital.
		println "Creating coordinates for country's capital..."
		def coordinates = topicContent.findAll(~/(?m)(^[-+]?\d{1,2}\s*\d{1,2}\s*[A-Z]),\s*([-+]?\d{1,2}\s*\d{1,3}\s*[A-Z])/)
		if (coordinates[1]) {
			ddCoordinates = convertToDdCoordinates(coordinates[1])
			topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'map-coordinates', ddCoordinates, countryIdentifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)
		}

		// Add link occurrence to each topic pointing to the original CIA World Factbook country page. 
		println 'Adding CIA World Factbook country page link...'
		Occurrence linkOccurrence = new Occurrence(countryIdentifier)
		linkOccurrence.with {
			instanceOf = 'url'
			scope = UNIVERSAL_SCOPE
			language = Language.EN
			resourceRef = "https://www.cia.gov/library/publications/the-world-factbook/geos/${countryCode}.html"
		}
		topicStore.putOccurrence(SHARD_INFO, TOPIC_MAP_IDENTIFIER, linkOccurrence)
		topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'label', "$countryName CIA World Factbook Page", linkOccurrence.identifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)

		// Add flag (occurrence) to each topic and copy image to appropriate (web application resources) directory.
		println 'Adding flag...'
		def imageDirectoryName = "/home/brettk/www/static/$TOPIC_MAP_IDENTIFIER/images/$countryIdentifier"
		def imageDirectory = new File(imageDirectoryName)
		imageDirectory.mkdirs() // Non-destructive.

		def serverImageDirectoryName = "/static/$TOPIC_MAP_IDENTIFIER/images/$countryIdentifier"
		
		Occurrence flagOccurrence = new Occurrence(countryIdentifier)
		flagOccurrence.with {
			instanceOf = 'image'
			scope = UNIVERSAL_SCOPE
			language = Language.EN
			resourceRef = "$serverImageDirectoryName/${flagOccurrence.identifier}.gif"
		}
		topicStore.putOccurrence(SHARD_INFO, TOPIC_MAP_IDENTIFIER, flagOccurrence)
		topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'label', "$countryName (Flag)", flagOccurrence.identifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)

		copyFile("$FLAGS_PATH/${countryCode}-lgflag.gif", "$imageDirectoryName/${flagOccurrence.identifier}.gif")

		// Add map (occurrence) to each topic.
		println 'Adding map...'
		Occurrence mapOccurrence = new Occurrence(countryIdentifier)
		mapOccurrence.with {
			instanceOf = 'image'
			scope = UNIVERSAL_SCOPE
			language = Language.EN
			resourceRef = "$serverImageDirectoryName/${mapOccurrence.identifier}.gif"
		}
		topicStore.putOccurrence(SHARD_INFO, TOPIC_MAP_IDENTIFIER, mapOccurrence)
		topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'label', "$countryName (Map)", mapOccurrence.identifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)

		copyFile("$MAPS_PATH/${countryCode}-map.gif", "$imageDirectoryName/${mapOccurrence.identifier}.gif")
	}
}

// Create associations between regions.
println 'Creating associations between regions...'
def regionIdentifiers = [
	'africa',
	'central-america',
	'central-asia',
	'east-asia',
	'europe',
	'middle-east',
	'north-america',
	'oceania',
	'south-america',
	'south-asia'
]
for (outerRegionIdentifier in regionIdentifiers) {
	for (innerRegionIdentifier in regionIdentifiers.findAll { it != outerRegionIdentifier } ) {
		topicStore.createAssociation(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'region', outerRegionIdentifier, 'region', innerRegionIdentifier)
	}
	// Create associations between the world topic and the regions.
	topicStore.createAssociation(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'part-of', 'world', 'region', outerRegionIdentifier)
}

// Add the appropriate text occurrence ('xx.txt') to the 'world' topic.
println "Adding text occurrence to the 'World' topic..."
def worldTopicContentFileName = "${PROCESSED_PATH}/xx.txt"

def worldTopicContentFile = new File(worldTopicContentFileName)
def worldTopicContent = worldTopicContentFile.text

Occurrence worldOccurrence = new Occurrence('world')
worldOccurrence.with {
	instanceOf = 'text'
	scope = UNIVERSAL_SCOPE
	language = Language.EN
	resourceData = worldTopicContent.getBytes()
}
topicStore.putOccurrence(SHARD_INFO, TOPIC_MAP_IDENTIFIER, worldOccurrence)
topicStore.createMetadatum(SHARD_INFO, TOPIC_MAP_IDENTIFIER, 'label', 'world', worldOccurrence.identifier, Language.EN, '', DataType.STRING, UNIVERSAL_SCOPE)

println 'Done!'

// ***** Helper methods *****
def convertToDdCoordinates(String dmsCoordinates) { // Format: 17 49 S, 31 02 E
	// http://en.wikipedia.org/wiki/Geographic_coordinate_conversion
	def parts = dmsCoordinates.replace(',', '').split(' ')

	def ddLatitude = parts[0].toInteger() + (parts[1].toInteger() / 60) 
	if (parts[2] == 'S') {
		ddLatitude = 0 - ddLatitude
	}
	def ddLongitude = parts[3].toInteger() + (parts[4].toInteger() / 60)
	if (parts[5] == 'W') {
		ddLongitude = 0 - ddLongitude
	}
	return "($ddLatitude, $ddLongitude)"
}

def getBackgroundExcerpt(String content) {
	def result = content
		.find(~/(?s)<\/h3>.*<h2>Geography/)
		?.replaceAll(~/<\/h3>/, '')
		?.replaceAll(~/<h2>Geography/, '')
	if (result) {
		if (result.size() > 320) {
			result = result[0..320]
		}
		if (result[-1] != '.') {
			result = result << '...'
		}
	} else {
		result = ''
	}
	return result.toString()
}

def getBackground(String content) {
	def result = content
		.find(~/(?s)<\/h3>.*<h2>Geography/)
		?.replaceAll(~/<\/h3>/, '')
		?.replaceAll(~/<h2>Geography/, '')
	if (result == null) {
		result = ''
	}
	return result
}

def getTimelineYear(String content) {
	def bcYears = content.findAll(~/\d{4}\sB.C./).collect { it.replace(' .B.C.', '') }
	def adYears = content.findAll(~/\d{4}/)
	def years = adYears - bcYears
	return years[0]
}

def copyFile(String sourcePath, String targetPath) {
	Path source = Paths.get(sourcePath)
	Path destination = Paths.get(targetPath)

	try {
		Files.copy(source, destination);
	} catch (IOException e) {
		e.printStackTrace();
	}
}

// ***** Models *****

class Entity {
	String identifier
	String name
	String instanceOf
}

And that’s it, folks! In a follow-up article I will document how to improve the import process outlined in this article to make much better use of the resources provided by the World Factbook. However, on this first iteration, the current import process provides me with sufficient data to thoroughly test PerfectLearn with a non-trivial topic map.

Stay tuned for updates. Subscribe to the PerfectLearn newsletter.

Leave a Reply