This is a collection of code examples that highlight certain aspects of processing files with GRID.
The code examples below, refer to Harvesting which is the process of downloading content of an URI pointing to a remote location (aka. harvest the content) and send it to GRID for processing. These code snippets can be used for sending local files, however then harvestURI needs to be substituted with "URI-of-file-to-send-for-processing".
Note: The methods "fetchContentFromRemoteURI(...)" and "createMetadataForRemoteURI(...)" are assumed to exist in the scope of the code snippet and must be implemented inside the client that used GRID (= your code). Further the code snippets assume that "processingService" and "sourceService" are fields that hold a reference to an initialized client stub to the corresponding SOAP-service.
The type HarvestedContent is used to wrap downloaded content and has the signature below. It is not part of the harvesting API and serves as a helper class within the examples. Implement something similar on your side.
1 2 3 4 5 6 | class HarvestedContent { byte [] contentSHA1Hash; InputStream contentStream; Date lastModified; String contentTag; } |
5 11 12 13 14 15 16 17 18 19 20 21 22 23 29 30 31 32 33 34 35 | for (URI harvestURI : urisToHarvest) { // Handle content download (ignoring lastModified and contentTag with the request, // see full example to see how conditional downloads can be handled) HarvestedContent content = fetchContentFromRemoteURI(harvestURI, null , null ); // Extract metadata Metadata metadata = createMetadataForRemoteURI(harvestURI, content); // Prepare job and assign process source UUID jobId = processingService.prepareJob(); FileIdentifier contentIdentifier = new FileIdentifier(content.contentSHA1Hash); URL putURL = processingService.assignProcessSourceWithContent( jobId, contentIdentifier, harvestURI, content.lastModified, content.contentTag, metadata); // Perform the upload if required if (putURL != null ) { httpPutData(putURL, content.contentStream); } processingService.startJob(jobId); } |
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | for (URI harvestURI : urisToHarvest) { Date lastModified = null ; FileIdentifier contentIdentifier = null ; String contentTag = null ; // Decide whether content has to be downloaded Source source = sourceService.getSourceForURL(harvestURI); if (source != null ) { lastModified = source.getSourceInformation().getLastModified(); contentTag = source.getSourceInformation().getContentTag(); Metadata md = source.getMetadata(); Meta sha1Hash = md.get( "sourceContentSHA1" ), md5Hash = md.get( "sourceContentMD5" ); contentIdentifier = new FileIdentifier(sha1Hash.getBinaryValue(), md5Hash.getBinaryValue()); } // Handle conditional content download HarvestedContent content = fetchContentFromRemoteURI(harvestURI, lastModified, contentTag); // Content may be 'null' if it's already known by GRID and did not change with // regards to 'lastModified' and 'contentTag'. if (content == null ) continue ; // Extract metadata Metadata metadata = createMetadataForRemoteURI(harvestURI, content); // Prepare job and assign process source UUID jobId = processingService.prepareJob(); SourceIdentifier identifier = processingService.assignProcessSource( jobId, harvestURI, lastModified, contentTag, metadata); // Decide whether content must be uploaded FileIdentifier harvestedIdentifier = new FileIdentifier(content.contentSHA1Hash); boolean uploadIsRequired = contentIdentifier == null || !contentIdentifier.equalsSHA1(harvestedIdentifier); // Perform the upload if required if (uploadIsRequired) { try { // See if the content exists already based on an upload of another source. processingService.assignExistingContentToProcessSource(identifier, harvestedIdentifier); } catch (IllegalRequestException e) { URL uploadURL = processingService.assignContentToProcessSource(identifier); httpPutData(uploadURL, content.contentStream); } } processingService.startJob(jobId); } |
The following example serves the special case that a single file may exist on multiple source URIs. Harvesters can report this case to the GRID by adding the same file multiple times with differing sources and metadata. The example below assumes that there is a method detectAllAvailableSourceURIs(..) that implements the source URI collection.
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 34 | for (URI primaryContentUri : urisToHarvest) { // Prepare job and source UUID jobId = processingService.prepareJob(); HarvestedContent content = fetchContentFromRemoteURI(primaryContentUri, null , null ); FileIdentifier contentIdentifier = new FileIdentifier(content.contentSHA1Hash); List<URI> sourceURIs = detectAllAvailableSourceURIs(primaryContentUri); for (URI sourceURI : sourceURIs) { // Assign URI and Metadata Metadata metadata = createMetadataForRemoteURI(sourceURI, content); SourceIdentifier sourceIdentifier = processingService.assignProcessSource( jobId, sourceURI, content.lastModified, content.contentTag, metadata); // Assign binary content try { // See if the content exists already. processingService.assignExistingContentToProcessSource(sourceIdentifier, contentIdentifier); } catch (IllegalRequestException e) { URL uploadURL = processingService.assignContentToProcessSource(sourceIdentifier); httpPutData(uploadURL, content.contentStream); } } // Start job processingService.startJob(jobId); } |
Of course the snippet is a simplification as it does not handle conditional downloads or any further "upload-required" checks. With the API enhancements in GACL-1.3 the upload handling can be simplified greatly with "assignProcessSourceWithContent(..)" where the "upload-required" handling is already included:
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 28 | for (URI primaryContentUri : urisToHarvest) { // Prepare job and source UUID jobId = processingService.prepareJob(); HarvestedContent content = fetchContentFromRemoteURI(primaryContentUri, null , null ); FileIdentifier contentIdentifier = new FileIdentifier(content.contentSHA1Hash); List<URI> sourceURIs = detectAllAvailableSourceURIs(primaryContentUri); for (URI sourceURI : sourceURIs) { // Assign URI and Metadata Metadata metadata = createMetadataForRemoteURI(sourceURI, content); URL uploadURL = processingService.assignProcessSourceWithContent( jobId, contentIdentifier, sourceURI, content.lastModified, content.contentTag, metadata); // Assign binary content if (uploadURL != null ) httpPutData(uploadURL, content.contentStream); } // Start job processingService.startJob(jobId); } |