Processing Elements - Code Examples

This is a collection of code examples that highlight certain aspects of processing files with GRID.

Harvesting

The code examples below, refer to Harvesting which is the process of downloading content of an URI pointing to a remote location (aka. harvest the content) and send it to GRID for processing. These code snippets can be used for sending local files, however then harvestURI needs to be substituted with "URI-of-file-to-send-for-processing".

Note: The methods "fetchContentFromRemoteURI(...)" and "createMetadataForRemoteURI(...)" are assumed to exist in the scope of the code snippet and must be implemented inside the client that used GRID (= your code). Further the code snippets assume that "processingService" and "sourceService" are fields that hold a reference to an initialized client stub to the corresponding SOAP-service.

The type HarvestedContent is used to wrap downloaded content and has the signature below. It is not part of the harvesting API and serves as a helper class within the examples. Implement something similar on your side.

1
2
3
4
5
6
class HarvestedContent {
    byte[] contentSHA1Hash;
    InputStream contentStream;
    Date lastModified;
    String contentTag;
}

Simplified Standard Harvesting Flow

5
11
12
13
14
15
16
17
18
19
20
21
22
23
29
30
31
32
33
34
35
for (URI harvestURI : urisToHarvest) {
    // Handle content download (ignoring lastModified and contentTag with the request,
    // see full example to see how conditional downloads can be handled)
    HarvestedContent content = fetchContentFromRemoteURI(harvestURI, null, null);
  
    // Extract metadata
    Metadata metadata = createMetadataForRemoteURI(harvestURI, content);
  
    // Prepare job and assign process source
    UUID jobId = processingService.prepareJob();
    FileIdentifier contentIdentifier = new FileIdentifier(content.contentSHA1Hash);
    URL putURL = processingService.assignProcessSourceWithContent(
            jobId, contentIdentifier, harvestURI, content.lastModified, content.contentTag, metadata);
  
    // Perform the upload if required
    if (putURL != null) {
        httpPutData(putURL, content.contentStream);
    }
  
    processingService.startJob(jobId);
}

Standard Harvesting Flow

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
48
49
50
51
52
53
54
55
56
57
58
59
60
61
for (URI harvestURI : urisToHarvest) {
    Date lastModified = null;
    FileIdentifier contentIdentifier = null;
    String contentTag = null;
  
    // Decide whether content has to be downloaded
    Source source = sourceService.getSourceForURL(harvestURI);
    if (source != null) {
        lastModified = source.getSourceInformation().getLastModified();
        contentTag = source.getSourceInformation().getContentTag();
  
        Metadata md = source.getMetadata();
        Meta sha1Hash = md.get("sourceContentSHA1"), md5Hash = md.get("sourceContentMD5");
        contentIdentifier = new FileIdentifier(sha1Hash.getBinaryValue(), md5Hash.getBinaryValue());
    }
  
    // Handle conditional content download
    HarvestedContent content = fetchContentFromRemoteURI(harvestURI, lastModified, contentTag);
    // Content may be 'null' if it's already known by GRID and did not change with
    // regards to 'lastModified' and 'contentTag'.
    if (content == null) continue;
  
    // Extract metadata
    Metadata metadata = createMetadataForRemoteURI(harvestURI, content);
  
    // Prepare job and assign process source
    UUID jobId = processingService.prepareJob();
    SourceIdentifier identifier = processingService.assignProcessSource(
            jobId, harvestURI, lastModified, contentTag, metadata);
  
    // Decide whether content must be uploaded
    FileIdentifier harvestedIdentifier = new FileIdentifier(content.contentSHA1Hash);
    boolean uploadIsRequired = contentIdentifier == null || !contentIdentifier.equalsSHA1(harvestedIdentifier);
  
    // Perform the upload if required
    if (uploadIsRequired) {
        try {
            // See if the content exists already based on an upload of another source.
            processingService.assignExistingContentToProcessSource(identifier, harvestedIdentifier);
        } catch (IllegalRequestException e) {
            URL uploadURL = processingService.assignContentToProcessSource(identifier);
            httpPutData(uploadURL, content.contentStream);
        }
    }
  
    processingService.startJob(jobId);
}

Single file with multiple sources

The following example serves the special case that a single file may exist on multiple source URIs. Harvesters can report this case to the GRID by adding the same file multiple times with differing sources and metadata. The example below assumes that there is a method detectAllAvailableSourceURIs(..) that implements the source URI collection.

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
34
for (URI primaryContentUri : urisToHarvest) {
    // Prepare job and source
    UUID jobId = processingService.prepareJob();
    HarvestedContent content = fetchContentFromRemoteURI(primaryContentUri, null, null);
    FileIdentifier contentIdentifier = new FileIdentifier(content.contentSHA1Hash);
  
    List<URI> sourceURIs = detectAllAvailableSourceURIs(primaryContentUri);
    for (URI sourceURI : sourceURIs) {
        // Assign URI and Metadata
        Metadata metadata = createMetadataForRemoteURI(sourceURI, content);
        SourceIdentifier sourceIdentifier = processingService.assignProcessSource(
                jobId, sourceURI, content.lastModified, content.contentTag, metadata);
  
        // Assign binary content
        try {
            // See if the content exists already.
            processingService.assignExistingContentToProcessSource(sourceIdentifier, contentIdentifier);
        } catch (IllegalRequestException e) {
            URL uploadURL = processingService.assignContentToProcessSource(sourceIdentifier);
            httpPutData(uploadURL, content.contentStream);
        }
    }
  
    // Start job
    processingService.startJob(jobId);
}

Of course the snippet is a simplification as it does not handle conditional downloads or any further "upload-required" checks. With the API enhancements in GACL-1.3 the upload handling can be simplified greatly with "assignProcessSourceWithContent(..)" where the "upload-required" handling is already included:

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
28
for (URI primaryContentUri : urisToHarvest) {
    // Prepare job and source
    UUID jobId = processingService.prepareJob();
    HarvestedContent content = fetchContentFromRemoteURI(primaryContentUri, null, null);
    FileIdentifier contentIdentifier = new FileIdentifier(content.contentSHA1Hash);
  
    List<URI> sourceURIs = detectAllAvailableSourceURIs(primaryContentUri);
    for (URI sourceURI : sourceURIs) {
        // Assign URI and Metadata
        Metadata metadata = createMetadataForRemoteURI(sourceURI, content);
        URL uploadURL = processingService.assignProcessSourceWithContent(
                jobId, contentIdentifier, sourceURI, content.lastModified, content.contentTag, metadata);
  
        // Assign binary content
        if (uploadURL != null) httpPutData(uploadURL, content.contentStream);
    }
  
    // Start job
    processingService.startJob(jobId);
}