Module - CIFS Datasource (CIFS Datasource)

Implements the file content storage system using connected CIFS (SMB) servers or the local filesystem.

Implementations of FileRepository

  • CIFSFileRepository: Uses a the JCIFS client library to connect to any CIFS/SMB server using hostname, path and auth credentials (domain, username, password). A CIFS connection is configured inside "config/tinyjee-configuration.xml". The first valid connection will be used if more than one connections is enabled in the active context.
  • LocalFileRepository: Uses the local filesystem to implement the file storage in exactly the same way as the CIFS repository does. The config property "localRepositoryPath" is used to set the path used by this implementation. (Note: By default this implementation is NOT selected. See API for details.)

General Configuration Options:

This module is purely configured through the file "config/tinyjee-configuration.xml". All available options can be seen there including documentation.
The following list is a subset of options highlighted here:

  • filenameHashKeyAlgorithm: Can be one of MD5 or SHA1 and is used to create a file name and path by applying the hash function on the content that is stored.
    Note: The file storage system does in general reference content by its signature.
  • filenamePathSchema: Allows to define how file names and paths are assembled. By default the ACL splits the hash key that was generated on the content by splitting it in equal parts of 4 characters length using hexadecimal encoding. The resulting paths look like "A012/C1B4/5412/.../4DEF" which is 7 levels of folders with 64k entries per level. This default behaviour guarantees that no more than 64k files or folders are in one level, however it easily becomes quite slow on most file systems.

    Other, better performing path schemes can easily be used instead of the symmetric default, when assuming that the hash algorithms ensure a proper distribution of hash values. Setting the value of this property to "00/00/00/" produces paths like "A0/12/C1/B454129874F4..4DEF" which is only 3 levels of folders with 256 entries per level.
    The last level does not have a defined maximum that would be in a range that file systems can handle in general, however assuming the proper distribution, 256^3 already offers room for 16,777,216 folders. Assuming a growth of roughly 64000 new files per day this scheme satisfies the needs for the next 700 years when assuming max. 1000 files per folder ((16,8m * 1000) / 64k / 365).

    Benchmark tests have shown that the 3 folder path scheme produces around 10 times better file access performance than the default and it reaches it's maximum slowdown in around 260 days of operation (or when 16,8m files were stored, which can easily be validated in benchmarks).

    Note: The ACL creates internal access URIs based on the configured path scheme, however when accessing the content, it does not use these URIs directly. Instead the hash value is parsed and the current scheme is applied. This means changing the path scheme later requires that existing files are manually renamed.

    It's also important to mention that there are other GRID modules that access files either using the URIs or using an own implementation that generates paths out of hash values.
    As a result, changing this value requires CAUTION and COMMUNICATION with the whole GRID team.