General Questions on the API usage

Why is there a "Public ApiDocs" and how does it relate to the SOAP and REST interfaces?

The Access Layer is implemented using "Java Enterprise Edition" technology. All interfaces that are published as SOAP or REST do have direct Java delegates that can be viewed inside the "Public ApiDocs".

WSDL and WADL documents are built dynamically out of the interface definitions and therefore the "Public ApiDocs" do always contain the latest and most accurate information on the interfaces offered for the web.

IMPORTANT NOTE: Interfaces that conform to the "Profile Level 0" can be found inside the packages " com.trendmicro.grid.acl.l0 " and " com.trendmicro.grid.acl.l0.datatypes ".

More information on this topic can be obtained inside the module documentation for "WS Server Api".
Where do I find WADL, WSDL and co ?

When starting TinyJEE, the application server prints all servlet entry points to stdout. By opening these paths inside a Browser most information can easily be gathered.

For your reference, here's an incomplete list of WSDL and WADL addresses:

A complete list of WADL and WSDL urls including a quick reference can be found at WSServer Api .


How can I check whether the GRID knows something about a dedicated file ?

Look at the interfaces:

REST:
 - http://host:port/rs/level-0/files/isKnownGood/{sha1}
 - http://host:port/rs/level-0/files/isKnownGood/{sha1}-{md5}

SOAP:
 - http://host:port/ws/level-0/internal/files
    - isFileTaggedWith(FileIdentifier, tags)
    - getFileInformation(FileIdentifier)

The information provided by this interfaces typically contains update times, reference counts and tags. The later can be used to decide whether a file is of a specific type or "infected" by a virus.

Note: REST services may return with HTTP 204 if the implementing service method returns a 'null' value. Please check the apidocs for more information when 'null' values are returned.
How do I obtain information about a "software product"?

The terminology for "software product" is called "package" inside the interfaces. A "Package" is a term that is applied to any grouping of files including "software products".


How can I ask for a list of files or packages that fall under a certain grouping, e.g. "all system libraries"?

This requires 2 steps:

  1. Check the documentation about tagging inside the GRID
  2. Look at the interfaces related to tagging:
    SOAP:
     - http://host:port/ws/level-0/internal/files
       - isFileTaggedWith(FileIdentifier, tags)
       - getFilesTaggedWith(tags, ...)
       - getMatchingFiles(tagQueryExpression, ...)
       - isPackageTaggedWith...(..., tags, ...)
       - getPackageNamesTaggedWith(tags, ...)
       - getMatchingPackageNames(tagQueryExpression, ...)

Java Client Example:

int pageNumber = 0;
ListPage<FileIdentifier> fileIds  = null;
do {
    fileIds = fileService.getFilesTaggedWith(
        new String[] {"systemlibrary"}, pageNumber);
    pageNumber = fileIds.getPageNumber() + 1;

    //TODO: Do something with the IDs.

} while(!fileIds.isLastPage());

How can I find dependencies between packages?

Look at the interfaces:

SOAP:
 - http://host:port/ws/level-0/packages
    - Parents:  getReferencingPackageNames...(...)
    - Children: getFilesContainedInPackage...(...)

What limits do apply to the data inside the GRID

There are no general limits regarding the number of recursions, linked files, amount of packages etc.

Names, Metadata, Tags and some more values do fall under certain limits that are enforced by the access layer. The following snipped shows the compiled-in (and configurable) limits:

	/**
	 * Defines the maximum amount of characters that may be used in the various
	 * name fields (Note: The size was chosen based on the maximum index size
	 * supported by MSSQL, it should not be increased).
	 */
	int MAX_NAME_LENGTH = 432;

	/**
	 * Defines the maximum amount of characters that may be used in the various
	 * display name fields.
	 */
	int MAX_DISPLAY_NAME_LENGTH = 256;

	/**
	 * Defines the maximum amount of characters that all tags may consume when
	 * serialized to a whitespace delimited string (hard limit, compiled-in).
	 */
	int MAX_TAG_STRING_LENGTH = 1024 * 1024;

	/**
	 * Defines the actually applied limit for the tag string length.
	 * Configurable; via command line parameter "-Dgacl.max.tag.string.length=value".
	 */
	int TAG_STRING_LENGTH = Math.min(MAX_TAG_STRING_LENGTH,
			Integer.getInteger("gacl.max.tag.string.length", 64 * 1024));

	/**
	 * Defines the maximum amount of bytes that the serialized metadata element
	 * may consume. (Note: "serialized metadata" means the length of the UTF-8
	 * encoded XML representation excluding any unnecessary whitespaces)
	 * (hard limit, compiled-in).
	 */
	int MAX_SERIALIZED_METADATA_LENGTH = 1024 * 1024;

	/**
	 * Defines the actually applied limit for the serialized metadata length.
	 * Configurable; use command line parameter "-Dgacl.max.serialized.metadata.length=value".
	 */
	int SERIALIZED_METADATA_LENGTH = Math.min(MAX_SERIALIZED_METADATA_LENGTH,
			Integer.getInteger("gacl.max.serialized.metadata.length", 256 * 1024));

	/**
	 * Defines the actually applied limit for incoming batch request.
	 * <p/>
	 * If the limit is reached, web-services will not not continue parsing a
	 * request and returning an error in order to protect the server from DOS attacks.
	 * <p/>
	 * Configurable; use command line parameter "-Dgacl.max.incoming.request.batch.size=value".
	 */
	int MAX_INCOMING_REQUEST_BATCH_SIZE = Integer.
			getInteger("gacl.max.incoming.request.batch.size", 100);

	/**
	 * Defines the maximum amount of characters that may be used in the remote
	 * (public) URI field (using ASCII URI encoding).
	 */
	int SOURCE_MAX_REMOTE_URI_LENGTH = 2 * 1024;

	/**
	 * Defines the maximum amount of characters that may be used in the internal
	 * URI field (using ASCII URI encoding).
	 */
	int SOURCE_MAX_INTERNAL_URI_LENGTH = 1024;

	/**
	 * Defines the maximum amount of characters that may be used in the content
	 * tag field.
	 */
	int SOURCE_MAX_CONTENT_TAG_LENGTH = 128;

	/**
	 * Defines the maximum amount of characters that may be used in the domain
	 * name field.
	 */
	int SOURCE_DOMAIN_MAX_NAME_LENGTH = 256;


In addition, processing has a per session limit to avoid out-of-memory errors:

	// The max prepared source count is MAX_PREPARED_JOBS * MAX_SOURCES_PER_JOB per session.
	//
	// This limit applies only to jobs & sources that were prepared but not yet started.
	// Exceeding the limit will log and discard the eldest job or source but it will not
	// cause an error. Therefore it is in general safe to prepare jobs and not further
	// process them as the data is either cleaned when the session times out or when more
	// than the declared limit of jobs are prepared.
	// However it's the responsibility of the client to not exceed the limit with jobs or
	// sources that must not get lost.
	//
	// Attention, every source may consume several KB of RAM, depending directly on the
	// size of the metadata element. Adjust these values with care and only when needed.

	/**
	 * Defines the actually applied limit for the amount of prepared jobs within a single
	 * session.
	 * Configurable; via command line parameter "-Dgacl.max.prepared.jobs=value".
	 */
	public static final int MAX_PREPARED_JOBS = Math.max(1,
			Integer.getInteger("gacl.max.prepared.jobs", 256));

	/**
	 * Defines the actually applied limit for the amount of sources that may be assigned to
	 * a single job.
	 * Configurable; via command line parameter "-Dgacl.max.sources.per.job=value".
	 */
	public static final int MAX_SOURCES_PER_JOB = Math.max(1,
			Integer.getInteger("gacl.max.sources.per.job", 16));

Categorization specific Questions

How does the categorization work conceptually?

Views

Categories are generally organized in tree-structures called views. Views can be regional to satisfy regional differences in the classification for an item falling under a certain category.

Categories

Categories are represented by "category definitions" that consist mainly of a name and a tag query expression that is used to query tagged packages or files from the GRID. At no time a category is directly assigned to a "package" or "file", instead the wiring is always performed through evaluation of the "tag expression".

Example - Category "Games":

Category {
    name = "Games";
    tagQueryExpression = "(gamevendor -development -productivity) (gameapplication)";
}
Technical aspect: As the class "Category" is a sub-class to "CategoryView", every category may have child categories. The interfaces differentiate between "plain categories" and "views" to allow the retrieval of a single definition without caring about child categories or the views it belongs to.



The classes "Category" and "CategoryView" are referring to category definitions and may appear in multiple locations. In contrast to this a Category is uniquely identified by its name and there may not be 2 category definitions having the same name but different "tag query expression".

Tag Query Expressions

Tag query expression can be used to get files and packages names that satisfy the given query. Whenever categorization is used, either packages names or files must be queried at some point in time to support any further operation.

Example query:

(mustbe1Group1 -mustnotbe1Group1) (mustbe1Group2 -mustnotbe1Group2 mustbe2Group2)

The syntax is designed to be easy to parse and translate to a corresponding SQL query as well as powerful enough to satisfy the categorization needs.

Note: Details on the syntax definition can be found inside the module WSServerApi under the pageTag Queries.


How do I obtain a category view?

A view can be obtained via a SOAP interface and is generally speaking a tree of category definitions:

SOAP:
- http://host:port/ws/level-0/categories
  - String[] viewName getCategoryViewNames(Locale)
  - CategoryView getCategoryView(Locale, String viewName)
Note: As views are regional, a Locale (e.g. en_US) is required to return the correct view for the region of the requester.
How do I obtain all packages that fall under a certain category?

There are generally speaking 2 possibilities, one involves some client side logic, the other is completely GRID controlled.

GRID controlled variant:

  1. First step is to get a category definition from the GRID, using one of CategoryView getCategoryView(Locale, viewName) or Category getPlainCategory(Locale, categoryName).
  2. Next is to use the "package" or "files" related services that can understand the "tagQueryExpression" that is contained inside the category definition.

Java Client Example:

Locale locale = Locale.getDefault();
Category games = categoryService.getPlainCategory(locale, "Games");
int pageNumber = 0;
ListPage<String> packageNames = null;
do {
    packageNames = packageService.getMatchingPackageNames(
        games.getTagQueryExpression(),
        games.getTagQueryExpressionVersion(), pageNumber);
    pageNumber = packageNames.getPageNumber() + 1;

    //TODO: Do something with the names.

} while(!packageNames.isLastPage());

Custom variant:

Generally speaking the custom variant is pretty much the same, however it advertises that a client application is capable of understanding the query format for categories and is also aware of the underlying "tag pool" to build custom categories.

Harvesting specific Questions

How can I query whether a source URL is known by the GRID and not outdated?

To query details on a source URL, the Access Layer offers a couple of interfaces with some being more lightweight than others.

Withing the normal Harvesting workflow, lot's of queries are issues against the GRID; retrieving only as much information as required is highly recommended.

Start with the following interfaces that are especially related to source URLs, processing and file information

SOAP:
- http://host:port/ws/level-0/internal/sources
  - (All methods)]]
- http://host:port/ws/level-0/internal/processing
  - (All methods)]]
- http://host:port/ws/level-0/processing
  - getFileInformation(fileId)

How do I check for ETAGS and other HTTP & URL related information?

The Access Layer offers one particular data type called "SourceInformation" that mainly consists of the values "Last-Modified" and a custom "ContentTag" (e.g. usable as ETAG) that can be used inside the harvester to decide whether a remote URL should be processed further.

Java Client Example:

URI remoteURI = URI.create("http://remoteHost/path/to/file/to/harvest");
HttpUrlConnection huc = remoteURI.toURL().openConnection();

SourceInformation info = sourceService.getSourceInformationForURL(remoteURI);
if (info != null) {
    String contentTag = info.getContentTag();
    if (contentTag != null)
        huc.setRequestProperty("If-None-Match", contentTag);
    Date lastModified = info.getLastModified();
    if (lastModified != null)
        huc.setIfModifiedSince(lastModified.getTime());
}

if (huc.getResponseCode() == HttpUrlConnection.NOT_MODIFIED)
    continue; // continue with next

How do I exchange URI specific information with the GRID processing modules?

Information that is specific to the source URI can be exchanged with the processing modules by converting it into Metadata packages as defined inside the module Metadata Handler.

Once converted, the Metadata package can be attached to the source when adding or updating it inside the GRID system. See more details below..

Note: Source information can be exchanged in both directions as interfaces allow to read & write metadata on sources.
How do I exchange site (domain) specific information with the GRID processing modules?

When dealing with information that is common to a complete domain (e.g. microsoft.com), use the data type "SourceDomain" and look for the domain related retrieval and update services inside the SOAP service: http://host:port/ws/level-0/internal/sources

Technically, the way it works, is similar to sources.


How do I send files for processing?

Sending a single file or multiple linked files for processing, always involves 5 major steps:

  1. Preparing a new Job using the call "jobId prepareJob()"
  2. Assigning the remote source and metadata inside the GRID using "sourceId assignProcessSource(...)"
  3. Transfer the file content via "HTTP-PUT" using the "transferURL" returned by "transferURL assignContentToProcessSource(sourceId)"
  4. Start the job using "startJob(jobId)"

Java Client Example:

URI remoteURI = URI.create("http://remoteHost/path/to/setup.exe");
File localFile = new File("setup.exe");
String contentTag = "xyfre3sfds442";
Metadata metadata = ...;

// Step 1
UUID jobId = processingService.prepareJob();

// Step 2
SourceIdentifier sourceId = processingService.assignProcessSource(
    jobId, remoteURI, new Date(localFile.lastModified()), contentTag, metadata);

// Step 3
URL transferURL = processingService.assignContentToProcessSource(sourceId);

HttpUrlConnection huc = transferURL.openConnection();
huc.setDoOutput(true);
huc.setMethod("PUT");

FileUtils.copy(localFile, huc.getOutputStream());

// Step 4
processingService.startJob(jobId);