DirectScot is an ambitious project because it attempts to aggregate content from a wide variety of sources and present them in a coherent and usable form. The data sources used by DirectScot all implement a range of different standards for the interchange of information. This means a lot of work is needed to merge content into a single content store.
The technical components in DirectScot have been designed to address this problem and to allow it to be extended to other data sources in the future. They also reflect DirectScot’s ambition to implement best practice in terms of usability and discoverability of content. Consideration is given to clustering search results, providing guided navigation, offering dynamic previews, using the user’s context to filter results and providing onward journeys to the user.
DirectScot is hosted on the Amazon EC2 Cloud platform. The hosting architecture is still being assessed but currently consists of the following:
The schematic below illustrates the flow of information between the Data Sources (Local Authority sites, Directgov, etc) and Data Consumers (users of DirectScot, 3rd parties using the API).
Conceptually DirectScot performs the following tasks:
- Crawls / extracts raw data
- Stores and modifies the data
- Classifies and indexes the data
- Provides search and discovery over that index
These core processes are represented in the following schematic and described in more detail below.
Data sources and collating data
DirectScot is designed to aggregate data from a variety of sources. These currently include:
- Scottish Local Authorities
- Scottish Business Gateway
Local authority data
Local Authority information is extracted by scraping the local authority website. We hope to improve on this by working with our Local Authority partners to agree a standard API that will considerably simplify this process and make it more robust.
Apache Nutch is used to crawl Local Authority websites. We are still experimenting with crawl rates and frequencies but need to balance the desire to have up-to-date data with the load on the Local Authority sites.
For the prototype, we decided to extract all content and then filter and classify it into an internal store. This decision was driven by the long development iterations that would have resulted from having to re-crawl content when an extraction rule changed.
Directgov offer a nice API over their data which considerably simplifies the process of accessing and parsing their content.
The content provided by the API carries metadata that can be used directly within the classification system.
Scottish Business Gateway
We extract content from the Scottish Business Gateway by directly requesting a collection of hand chosen directories from the site at: http://www.business.scotland.gov.uk
Future data sources
Expanding the number of data sources in DirectScot to include other providers such as Police, Fire Services will be relatively straightforward.
Where content is easy to extract and categorise it will be imported directly into SharePoint. Where is it difficult to extract or needs extra classification it will be crawled and extracted using custom rules.
Data storage and management
Our primary Data Store and Content Management tool is SharePoint 2010. This was chosen for the flexibility in designing content types, the extensive management tools, the comprehensive versioning and auditing components and support for APIs. It is also a robust and scalable system which we felt comfortable using at the heart of the DirectScot platform.
We use SharePoint to hold a copy of all data extracted from DirectScot and the Scottish Business Gateway.
We also use the SharePoint content store to hold the metadata classifications used to categorise content. The classification system used in DirectScot is adapted from the Directgov categorisation schema merged with the Scottish Navigation List (SNL). The resultant list of DirectScot Topic Groups and Topics are a curated restricted vocabulary which retains links to the classification systems from which they are derived.
As well as storing data we use SharePoint to manage content. SharePoint CMS tools allow us to remove some content which has been aggregated but is not relevant on DirectScot. We also revise content imported from DirectScot to ensure it’s relevant to a Scottish audience.
On fully devolved matters we add new content which is exclusively relevant to Scotland and this is done directly by authors into the SharePoint CMS.
Finally, we author and curate lists of services and tasks to facilitate the user experience.
We use SharePoint APIs (supported out of the box) to access data for indexing and also to serve content used on leaf nodes when accessing the data via the site or APIs.
The content types and metadata used to describe content are included here for reference:
|National Information||DG classification schema for articles||Directgov API|
|Local Service||DS classification system mapped to SNL||Curated content|
|Local Application||DS Classification||Curated content|
|Local Information||DS Classification||Indexed web pages|
|National Application||DS classification||Curated content|
|National Campaign||DS classification||Curated content|
|Business Gateway||Business Gateway Classification||Indexed web pages|
Extraction / classification
Content held in SharePoint is classified when it is imported. With the Directgov content we import, article metadata is included with the API. With Business Gateway we extract and categorise the content based on its location in the Business Gateway site.
Extraction and Classification of pages crawled on Local Authority sites is a more complicated process. We use an extraction pipeline which is tailored for each Local Authority but which relies on a few core systems shared by all pipelines.
Contains the SNL ID and the SNL title
e.g. Home > Rubbish and recycling > Rubbish – commercial waste
Contains the SNL titles in the breadcrumb
e.g. <meta name=”DC.subject” lang=”en” scheme=”eGMS.SNL” content=”Rubbish – commercial waste” />
Explicitly states the SNL category in metadata
e.g. <title>Rubbish – commercial waste – City of Edinburgh Council</title>
Refers to the SNL category in the page title
e.g. <h1>Rubbish – commercial waste</h1>
Refers to the SNL category in the content of the page
We expect to add further approaches as necessary to classify new crawled data sources.
We are using Lucene as a Search Engine. Lucene is a well-supported open source system which is very developer friendly and highly suited to a prototype of this type. It is also a good candidate for a production system should it satisfy load and scalability testing.
We store all extracted and classified content in a single Lucene Search Catalogue in a schema which is highly optimised for the searches we anticipate users will perform. Searches are leveraged via Solr, an enterprise search server backed by Lucene.
Search is used in various ways throughout DirectScot and not just to provide a set of ranked search results. We have built a set of search services that allow us to provide journeys through content beyond paging through results.
For instance search allows us to determine similar bits of content via clustering which is used to provide ‘related content’ on leaf nodes; search also allows us to suggest alternative searches to users to push them to areas of rich results; and search also allows us to offer guided navigation which allows users to understand the content domain and explore it efficiently.
We use Solr to provide these search services. Solr is a well-supported open source system which complements Lucene very well.
We are keen to make the content we have aggregated available to third parties to allow them to provide value-added services to Scottish Citizens, and to that end have created a RESTful service to allow interested parties to review the content we have already aggregated in order to assess its usefulness and provide us with feedback to improve this service over time.
The API is delivered via a RESTful API emitting JSON data. We believe that this is the most convenient format for 3rd party developers to consume.
The DirectScot site is a bespoke web applications which leverages the DirectScot Search Services and SharePoint 2010 and SQL Server 2008 Data Stores. The site is built in ASP.NET MVC3 which is running on IIS7 in a Windows Server 2008 R2 machine.
Tell us what you think. You can comment on this blog by leaving feedback here. Please also contribute your views as part of the consultation on DirectScot.
The DirectScot Team