wiki:doc/importExport
Last modified 7 years ago Last modified on 02/24/10 17:18:22

Importing and exporting data, and integrating Eureka into your own site

Eureka supports multiple exchange file formats, although since the whole system is based on the IEEE LOM Standard this is by far the most complete and flexible way of exchanging data. The system allows automatic (harvesting) and manual import. Harvesting will be discussed in the spiders section. Manual import features are available through the «Import learning objects» option in the left menu. Here's a list of the features :

Supported metadata Import/Export? formats

IEEE LOM

Learning objects can be both imported and exported as IEEE LOM xml files that must validate the following schema: http://eureka.ntic.org/xsd/lom/

This XSD is copied from the XSD of the Extensible Markup Language (XML) Schema Definition Language Binding for Learning Object Metadata

You may want to also read IMS Meta-data Best Practice Guide for IEEE 1484.12.1 which is mostly applicable, and may be of some usefullness if you are new to the other standards refers to by the above.

The import process for a IEEE LOM Learning Object

The IEEE LOM import process can be triggered automatically or manually. In both cases, the same logic will be applied. The following is a detailed explanation on how Eureka processes the XML input file.

  1. The XML file is uploaded manually or automatically extracted by a system harvester.
  2. The XML file is opened and checks are performed to make sure it a well-formed document.
  3. The loaded document is then validated against the IEEE LOM 1484.12.1 XML Schema Definition (XSD) (a copy can be found here: http://eureka.ntic.org/xsd/lom/)
  4. When found to be valid, XPath queries are performed within the IEEE LOM XML namespace.
  5. The OAI ID (mandatory in the Eureka repository) is extracted from the document (LOM 3.1). If the unique OAI ID already exists in the database, the import process will stop. If it is not present, a unique identifier will be automatically generated.
  6. The process goes on for each and every LOM element.
  7. Langstring and Locale entries are automatically created, while making sure the country-specific locales contain a valid country entry.
  8. When the system reaches a Vocabulary based element it extracts the Source element. If Source equals "LOM v1.0", standard LOM defined vocabulary is used. Otherwise, the system tries to match the Source string with a local vocabulary. If is exists, it will assume the Value element belongs to this vocabulary. If the vocabulary could not be found localy, or if the term cannot be found in the vocabulary, Eureka will parse de Source string and check if it is a valid Internet URL. It will then attempt to download the remote XML file. The standard VDEX import process will then be applied and the import process will resume.
  9. When the system reaches a vCard element it will search it's database for a vCard the exactly matches name, email and organization. If it finds one, it will use the version already in it's database. If it doesn't find one, it will import the vCard in it's database.
  10. A reference to the primary Eureka theme in which the learning object will be placed is automatically added to a LOM 9.2 element. Double LOM 9.2 entries will be cleaned at the end of the process.

Dublin Core

Exporting

Learning objects can be exported in an XML representation of Dublin Core. This dublin-core metadata is accessible:

  • Through OAI-PMH
  • Is part of all RSS feeds.
  • Accessible as a XML file from the LOs page (oai_dc XML file)

Importing

  • From RSS feeds
  • Manually (oai_dc XML file)
  • From OAI-PHM harvester

Standard to reference VDEX vocabulary entries in Dublin Core metadata

Dublin Core (even qualified) does not have a mechanism to unambiguously reference VDEX vocabulary term. So the Eureka project had to come up with one.

Instead of extending Dublin Core Qualified, it was tought best to create a new non-standard Uniform Resource Name (URN) format for expressing VDEX references. This is simpler to specify, and applicable beyond the Dublin Core XML context, and even for non-XML representations.

The URN looks like this

urn:vdex:rfc_1738_encoded_vocabulary_id:rfc_1738_encoded_term_id

Note that once decoded, vocabulary_id SHOULD be a URL resolvable to valid IMS VDEX file.

Example: Term "urn:uuid:26828976645ef3945603759.34805177" of VDEX vocabulary http://eureka.ntic.org/vdex/meq_disciplines_collegiales.xml would be represented as

<dc:subject>
urn:vdex:http%3A%2F%2Feureka.ntic.org%2Fvdex%2Fmeq_disciplines_collegiales.xml:urn%3Auuid%3A26828976645ef3945603759.34805177
</dc:subject>

Parsing example in PHP

//Content of a dc:subject node
$valueFromDcSubject = trim($node_result->nodeValue);
$vdexHeader = 'urn:vdex:';
$vdexHeaderPos = stripos($valueFromDcSubject , $vdexHeader);
if($vdexHeaderPos!==false && $vdexHeaderPos == false) {//We have a vdex header 
and it's at the start
	$vdexEncodedIds = str_ireplace ( $vdexHeader, '', trim($node_result->nodeValue));
	//echo "We have a VDEX id: $vdexEncodedIds<br/>";
        $vdexIdArray = explode (':', $vdexEncodedIds);
	foreach ($vdexIdArray as &$id) {
		$id = rawurldecode($id);
	}
	//$vdexIdArray[0] is the VDEX URL
	//$vdexIdArray[1] is the term id
	
	if($vdexIdArray[0]=='http://eureka.ntic.org/vdex/meq_disciplines_collegiales.xml') 
	{
		//Do something
	}
}

CSV (Coma Separated Values)

Very basic learning objects can be imported from a CSV file.

TODO: Provide sample file

RSS 2.0

Learning objects can be exported in RSS 2.0 feeds. Those feeds can be obtained by:

  • Exporting a theme hierarchy as RSS
  • Requesting search results as RSS formats. See "Interfacing the Search Engine using HTTP" below.

Open Document (ISO/IEC 26300:2006)

A theme hierarchy and it's Learning objects can be exported as an Open Document 1.0 (ISO/IEC 26300:2006) text file. It can then be styled and printed.

IMS VDEX

All the vocabularies and themes hierarchies used in Eureka can be imported and exported as VDEX (Vocabulary Definition Exchange) XML files that must validate the schema at http://eureka.ntic.org/xsd/vdex/. Eureka will automatically try to import VDEX vocabulary, notably:

  • When importing LOM metadata vocabulary entries. Eureka will try to resolve the URL in the source element, and try to parse it as a VDEX file.
  • When downloading Dublin Core metadata, when dc:source contains data of the form urn:vdex:

Also, Eureka will try re-downloading the VDEX if it tries to import a term claiming to be from a vocabulary already in it's database.

While making your vocabularies used for classification (Lom element 9) available as VDEX isn't technically mandatory, if you do, Eureka will download them and make extensive use of them to improve the user experience.

VCard 3.0 (RFC2425 and RFC2426)

Eureka has extensive support (including photo and logo binary data) for importing and exporting contributor contact information as the industry standard VCard 3.0 see RFC2425 and RFC2426 for specifications.

Note that unlike other systems, Eureka makes every effort to keep a single VCard for every individual, even if there are multiple metadata sources with different information imported. This is to allow editing the person's VCard to add details or images, or update information. As such, it is essential that the name, email and organization is always spelled the same way.

Advice for using Eureka to help develop export functionality in your own application

Eureka can import anything it exports. As such, you can create useful example metadata (Vcard, vocabularies, learning objects) in Eureka and export them to see how it should look like. Be careful however, an example will not usually cover every case. Common errors when not reading the specifications and relying only on examples include:

  • VCard encoding
  • OAI-ID generation
  • Date representation in LOM metadata
  • System update contribution when using the Eureka spider

Please note the following constraints in Eurêka:

  • For all vocabulary values except in element 9 (Classification), either:
    • The value is taken from the LOM specification, in which case the <source> element must be the string "LOMv1.0".
    • The value is taken NOT taken from the LOM specification, in which case the <source> element must be either:
      • A URI that resolves as an URL to a valid VDEX file containing the vocabulary the value is taken from.
      • A URI that identifies a vocabulary already on the Eureka server that your want to import your resource in.
  • Eurêka will re-use vCards already in it's database if the name, email and organization exactly matches those in the vCard contained in a learning object being imported (see the section on vCards for the reasons). As such, it will automatically import vCards only once, and if you need to update your vCard, you must do so on the Eureka server.
  • For Eurêka to find that a learning object being imported should update an existing one, the leearning object MUST already have a valid oai-id.

Testing your implementation

If you must use someone else's Eureka server to test, make sure you do not pollute their database with bogus data.

Eureka has a test mode, accessed by selecting "Import learning objects"/"Import a LOM learning object" and checking the "DO NOT import, only validate" check box. Unfortunately doing this isn't entirely neutral to the database, because in order to run a full test, Eureka will go through the entire LOM import process described at the beginning of this document. As such, vocabularies and VCard are imported, and not automatically deleted after. So the recommended testing sequence is:

  1. Begin by using your favourite XML tool to validate the well-formedness of your XML export against the schema http://standards.ieee.org/reading/ieee/downloads/LOM/lomv1.0/xsd/lomLoose.xsd. Note that while Eureka can and will validate your data against the proper schema, it's error output cannot be as helpful as a local XML validator, so we strongly suggest you use one.
  2. If you maintain custom vocabularies, test VDEX import of each of your vocabularies, deleting between tests.
  3. Test VCard import of a few sample VCard, deleting between tests.
  4. Test a full LOM import using the "DO NOT import, only validate" check box. Make sure that the VDEX vocabularies are properly linked by the resulting file.

A few additional notes for testing:

  • Make sure you test your export with a learning object that has data in ALL the fields your application supports. This will make sure ALL elements are syntactically correct.
  • Also test exporting a completely empty learning object, to make sure the empty elements are handled properly.

Supported Harvesting methods

OAI-PMH v2.0 Client (SpiderOaiPmh?) and Server

Eureka fully complies with the Open Archives Initiative Protocol for Metadata Harvesting Protocol Version 2.0 (OAI-PMH 2.0). We are providing a full featured client and server for harvesting and exposing metadata. Client configuration is accessible using the «Spiders management» under the «Import Learning Object» menu item. As for the server, OAI-PMH harvester can reach Eureka HTTP endpoint at http://example.org/oai-pmh/ (where example.org should be replaced by the hostname of your server). Read the Open Archives Initiative documentation

The server portion features

As suggested by the OAI recommendation, because lists of learning objects can be very long, Eureka has a an implementation of the resumptionToken. Thus, a harvester will need to pass the token along with the subsequent requests to get the next portions of the lists.

The OAI protocol handles all ID, date and update issues in an elegant manner.

ECL Search / Expose

We provide an ECL connector for EduSource? search federator, using ECL Protocol 0.3.03. The search engine can be found at http://edusource.licef.teluq.uquebec.ca/ese/en/index.jsp

Eureka HTML Spider protocol (SpiderHTML)

For those who can't afford or do not have the skillset to implement the full OAI-PMH protocol, Eureka also offers a simpler (but less capable) spider protocol.

What you have to do to allow Eureka to spider your LOs

  • Generate VALID IEEE 1484.12.3 XML metadata for each of your Learning Objects.
  • Make sure each of your resources have an OAI-ID compliant with http://www.openarchives.org/OAI/2.0/guidelines-oai-identifier.htm
  • Generate a single web page containing all A HREF links to the XML of every LO you want to have spidered. Normally, this XML is generated on the fly. However, it can also be generated as static files in a single directory, and eureka pointed to the HTML directory listing (but this isn't recommended)

What you have to do to get Eureka remove a LO from the database

  • To cause the spider to retrieve a new version of one of your LO, you must either:
    • Increment the version in LOM 2.1 Version, if the version is numeric.
    • Increment or add an additional (and more recent) LOM 2.3.3 Date, LOM 3.2.3 Date, LOM 8.2 Date
  • To cause the spider to remove one of your LO from the database, you must either:
    • Set LOM 2.2 Status to "unavailable"
    • Make sure that the URL where the Spider downloaded the LO metadata returns a http 404

What the spider will do

  • Fetch the HTML page at the URL associated with the Spider.
  • Extract all hyperlinks in the remote HTML page, and fetch the data they point to.
    • Validate against the schema that the data is valid 1484.12.3. if it isn't it will skip the resource.
    • If LOM 2.2 Status is "unavailable" don't import. If the resource was already in the database, delete it.
    • Look for any OAI-ID under LOM 3.1, if there isn't one it will skip the resource.
      • If the OAI-ID does not match a LO already in the database, the LO is imported
      • If the OAI-ID matches a LO already in the database, it will extract some metadata to determine if the LO was updated. Unlike OAI-PMH, this is a very ambiguous process.
        • If there is a LOM 2.1 Version, and it is numeric, and larger than the one already in the database, the resource is replaced.
        • The spider will find the most recent date in LOM 2.3.3 Date, LOM 3.2.3 Date, LOM 8.2 Date. If any of them is more recent than any of those from the LO in the DB, the resourse is replaced. For the resources to be updated when they are modified, it is IMPERATIVE that one of the above dates or the version be incremented in the LOM metadata. In practice, that generally means adding or updating a "System" contribution in LOM element 3.2 each time metadata is modified.
  • For each LO that was not visited in the last spider run, check if there is still XML data at the URL where it was originally downloaded. If that URL returns a permanent http error (such as a 404), the LO is removed from the database.

Warning: Since the spider heavily relies on date comparisons, make sure you have a stable or patched version of PEAR:Date. See PEAR:Date Warning in wiki:doc/install

Eureka RSS and Atom feed spider (SpiderFeed?)

Eureka can spider RSS Feeds and import Dublin Core metadata. Note that this is even more limited than SpiderHTML, because there are far less assumptions that can realistically be made about RSS feeds.

What you have to do to allow Eureka to spider your LOs using RSS

  • Generate Dublin Core metadata and include it as part of your feed items.
  • Make sure each of your resources have at least one dc:identifier in the form a a URL

What you have to do to get Eureka remove a LO from the database

Post-processing cannot be done on RSS feeds, so eureka has no way to remove your metadata from it's database.

What the spider will do

  • Fetch feed at the URL associated with the Spider.
  • For each item
    • Look for any dc:identifier, if there isn't any it will skip the resource.
      • If none of the identifiers in the form of a valid URL match an identifier of any LO already in the database, the LO is imported
      • If any of the identifiers in the form of a valid URL match an identifier of a LO already in the database (ANY identifier, not just OAI-IDs), it will extract some metadata to determine if the LO was updated. Be carefull, Unlike OAI-PMH, this is a very ambiguous process.
        • If the matching LO was not originally spidered from this spider, abort to avoid accidentally clobering a valid LO that happened to have a matching identifier.
        • The spider will find the most recent dc:date. If any of them is more recent than any of those from the LO in the DB (LOM 2.3.3 Date, LOM 3.2.3 Date, LOM 8.2 Date), the resource is replaced. For the resources to be updated when they are modified, it is IMPERATIVE that one of the above dates or the version be incremented in the DC metadata.

Warning: Since the spider heavily relies on date comparisons, make sure you have a stable or patched version of PEAR:Date. See PEAR:Date Warning in wiki:doc/install

Supported web services

RESTful XML Interface for Learning Objects Validation

Eureka provides a powerful, yet simple RESTful XML Interface for dynamically validating that a given learning object exists and validates a specific application profile. The interface uses simple HTTP queries and responds via a simple XML format.

Sample LomInfo? Query / Response

Query

The HTTP GET or POST parameters are returned in the response XML for you to understand the inner workings.

Response

<?xml version="1.0"?>
<LOMInfo xmlns="http://eureka.coeus.ca/xsd/lominfo">
  <Request>
    <Parameter Name="lom_oai_id" Value="oai:eureka.ntic.org:42dd3538c3a467.70046325"/>
    <Parameter Name="validation_profile" Value="NORMETICv1.1,MANDATORY"/>
  </Request>
  <Exists>true</Exists>
  <ValidatesProfile>true</ValidatesProfile>
</LOMInfo>

Interfacing the Search Engine using HTTP

The HTTP GET / POST parameters for interfacing the Search Engine are documented from inside your Eureka server. To access this documentation, click on "Search" in the main menu, then "Advanced Search", and finally, click on the small link "Get help for integrating this search on my web site" at the bottom of the page. This will get you to http://your_server_url/search.php?action=help

On this page, not only do you have the full documentation of every parameter, but you can also test them.

Supported metadata Vocabulary Import/Export? formats

VDEX

The VDEX format is fully supported.

Furthermore, a VDEX vocabulary can be downloaded on top of a previous version of itself as an update. Any terms not yet present will be added to the vocabulary.

Text hierarchy

A hierarchical text file, with the number of tabs representing nesting can be imported

test1
	test2
		test3
	test4

Term id's are unique GUIDs, and each term has a label corresponding to the string in the file. Importing the text above would result in the following hierarchy:

/ Your voc / test1
/ Your voc / test1 / test2
/ Your voc / test1 / test2 / test3
/ Your voc / test1 / test4 

This functionnality is accessible from the vocabulary editor once a vocabulary has been created.

CSV

Accessible from the Vocabulary editor once you created your vocabulary, you can import terms from a CSV file where the first column contains the term identifier, and the second column contains the initial label.

Feature: Vocabulary term substitutions

VDEX relationships can be used to support substitutions when a term isn't found in a specific vocabulary.

Suppose we are trying to instantiate term_id_1 in voc_1, but term_id_1 doesn't exist in voc 1.

However if term_id_1 exists in voc_2, and term_id_1 has a "USE" or "exact" relationship with term_id_2 in voc_1, an attempt to instantiate term_id_1 in voc_1 will return an instance of term_id_2 in voc_1.

The relationships type must be specific for this to be done:

  • ISO 2788 Term Relationships for Monolingual Thesauri as Used by the IMS Vocabulary Definition Exchange Specification
    • Term "USE" (only source->destination)
  • ISO 5964 Degrees of Equivalence for Multilingual Thesauri as Used by the IMS Vocabulary Definition Exchange Specification
    • Term "exact" (bidirectionally)