The technology behind DirectScot

DirectScot is an ambitious project because it attempts to aggregate content from a wide variety of sources and present them in a coherent and usable form. The data sources used by DirectScot all implement a range of different standards for the interchange of information. This means a lot of work is needed to merge content into a single content store.

The technical components in DirectScot have been designed to address this problem and to allow it to be extended to other data sources in the future. They also reflect DirectScot’s ambition to implement best practice in terms of usability and discoverability of content. Consideration is given to clustering search results, providing guided navigation, offering dynamic previews, using the user’s context to filter results and providing onward journeys to the user.

Hosting architecture

DirectScot is hosted on the Amazon EC2 Cloud platform. The hosting architecture is still being assessed but currently consists of the following:

Technical architecture

The schematic below illustrates the flow of information between the Data Sources (Local Authority sites, Directgov, etc) and Data Consumers (users of DirectScot, 3rd parties using the API).

Conceptually DirectScot performs the following tasks:

  • Crawls / extracts raw data
  • Stores and modifies the data
  • Classifies and indexes the data
  • Provides search and discovery over that index

These core processes are represented in the following schematic and described in more detail below.

Data sources and collating data

DirectScot is designed to aggregate data from a variety of sources. These currently include:

  • Scottish Local Authorities
  • Directgov
  • Scottish Business Gateway

Local authority data

Local Authority information is extracted by scraping the local authority website. We hope to improve on this by working with our Local Authority partners to agree a standard API that will considerably simplify this process and make it more robust.

Apache Nutch is used to crawl Local Authority websites. We are still experimenting with crawl rates and frequencies but need to balance the desire to have up-to-date data with the load on the Local Authority sites.

For the prototype, we decided to extract all content and then filter and classify it into an internal store. This decision was driven by the long development iterations that would have resulted from having to re-crawl content when an extraction rule changed.

Directgov data

Directgov offer a nice API over their data which considerably simplifies the process of accessing and parsing their content.

http://innovate.direct.gov.uk/syndication/

The content provided by the API carries metadata that can be used directly within the classification system.

Scottish Business Gateway

We extract content from the Scottish Business Gateway by directly requesting a collection of hand chosen directories from the site at: http://www.business.scotland.gov.uk

Future data sources

Expanding the number of data sources in DirectScot to include other providers such as Police, Fire Services will be relatively straightforward.

Where content is easy to extract and categorise it will be imported directly into SharePoint. Where is it difficult to extract or needs extra classification it will be crawled and extracted using custom rules.

Data storage and management

Our primary Data Store and Content Management tool is SharePoint 2010. This was chosen for the flexibility in designing content types, the extensive management tools, the comprehensive versioning and auditing components and support for APIs. It is also a robust and scalable system which we felt comfortable using at the heart of the DirectScot platform.

Data store

We use SharePoint to hold a copy of all data extracted from DirectScot and the Scottish Business Gateway.

We also use the SharePoint content store to hold the metadata classifications used to categorise content. The classification system used in DirectScot is adapted from the Directgov categorisation schema merged with the Scottish Navigation List (SNL). The resultant list of DirectScot Topic Groups and Topics are a curated restricted vocabulary which retains links to the classification systems from which they are derived.

Content management 

As well as storing data we use SharePoint to manage content. SharePoint CMS tools allow us to remove some content which has been aggregated but is not relevant on DirectScot. We also revise content imported from DirectScot to ensure it’s relevant to a Scottish audience.

On fully devolved matters we add new content which is exclusively relevant to Scotland and this is done directly by authors into the SharePoint CMS.

Finally, we author and curate lists of services and tasks to facilitate the user experience.

APIs

We use SharePoint APIs (supported out of the box) to access data for indexing and also to serve content used on leaf nodes when accessing the data via the site or APIs.

Metadata

The content types and metadata used to describe content are included here for reference:

Content Type Classification Notes
National Information DG classification schema for articles Directgov API
Local Service DS classification system mapped to SNL Curated content
Local Application DS Classification Curated content
Local Information DS Classification Indexed web pages
National Application DS classification Curated content
National Campaign DS classification Curated content
Business Gateway Business Gateway Classification Indexed web pages

Extraction / classification

Content held in SharePoint is classified when it is imported. With the Directgov content we import, article metadata is included with the API. With Business Gateway we extract and categorise the content based on its location in the Business Gateway site.

Extraction and Classification of pages crawled on Local Authority sites is a more complicated process. We use an extraction pipeline which is tailored for each Local Authority but which relies on a few core systems shared by all pipelines.

URL
e.g. http://www.edinburgh.gov.uk/info/1054/rubbish_and_recycling
Contains the SNL ID and the SNL title

Breadcrumb navigation
e.g. Home > Rubbish and recycling > Rubbish – commercial waste
Contains the SNL titles in the breadcrumb

Page metadata
e.g. <meta name=”DC.subject” lang=”en” scheme=”eGMS.SNL” content=”Rubbish – commercial waste” />
Explicitly states the SNL category in metadata

Page Title
e.g. <title>Rubbish – commercial waste – City of Edinburgh Council</title>
Refers to the SNL category in the page title

Page Content
e.g. <h1>Rubbish – commercial waste</h1>
Refers to the SNL category in the content of the page

We expect to add further approaches as necessary to classify new crawled data sources.

Search engine

We are using Lucene as a Search Engine. Lucene is a well-supported open source system which is very developer friendly and highly suited to a prototype of this type. It is also a good candidate for a production system should it satisfy load and scalability testing.

We store all extracted and classified content in a single Lucene Search Catalogue in a schema which is highly optimised for the searches we anticipate users will perform. Searches are leveraged via Solr, an enterprise search server backed by Lucene.

Search services

Search is used in various ways throughout DirectScot and not just to provide a set of ranked search results. We have built a set of search services that allow us to provide journeys through content beyond paging through results.

For instance search allows us to determine similar bits of content via clustering which is used to provide ‘related content’ on leaf nodes; search also allows us to suggest alternative searches to users to push them to areas of rich results; and search also allows us to offer guided navigation which allows users to understand the content domain and explore it efficiently.

We use Solr to provide these search services. Solr is a well-supported open source system which complements Lucene very well.

API

We are keen to make the content we have aggregated available to third parties to allow them to provide value-added services to Scottish Citizens, and to that end have created a RESTful service to allow interested parties to review the content we have already aggregated in order to assess its usefulness and provide us with feedback to improve this service over time.

The API is delivered via a RESTful API emitting JSON data. We believe that this is the most convenient format for 3rd party developers to consume.

DirectScot website

The DirectScot site is a bespoke web applications which leverages the DirectScot Search Services and SharePoint 2010 and SQL Server 2008 Data Stores. The site is built in ASP.NET MVC3 which is running on IIS7 in a Windows Server 2008 R2 machine.

Tell us what you think. You can comment on this blog by leaving feedback here. Please also contribute your views as part of the consultation on DirectScot.

The DirectScot Team

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s