Cheminformatics Fingerprint Formats Wiki
The FPF project goals are to define and promote two file formats for storing and exchanging cheminformatics fingerprint data sets. For details on each format see FPS and FPB.
Many cheminformatics tools generate fingerprints but there is no easy way to share that data between tools. In some sense the data is so simple that people would rather write a special purpose format than push for a more standardized format. As a result, almost every research group has its own format.
Use Cases
The FPF project is meant to address the following use cases.
Use case #1: Web service for 3-nearest similarity searches
A research group has a set of structures with associated experimental values. The information is updated every month. They want to provide a web service which takes an input structure and finds the 3 nearest compounds with better than 70% Tanimoto similarity, and reports the similarity score and structure information.
There are no ready-made tools for this, excepting groups which have a chemistry-aware database. The pieces to build it are available from several projects, but large parts need to be filled in. Ideally there should be a set of command-line and library functions for working with FPF files.
Use case #2: Data provenance
It's time to write the paper. Where did this data set come from? Which program generated it and with which options? Was it the one which used the buggy SMARTS definitions?
Most formats don't track this information. While it's impossible to be perfect, some information would help.
Use case #3: New fingerprint types
A researcher develops a new fingerprint scheme and wants to compare the applicability to existing fingerprints, including the linear hash in OpenBabel and the topological hash in RDKit.
Currently that requires quite a bit of work to figure out how to use each program to read the input files and generate the fingerprints which can be used for the analysis.
Seperating fingerprint generation from fingerprint use also makes it easier to implement things like N*N similarity clustering on a cluster without needing to install the chemistry tools on all the machines.
Use case #4: Comparison of search algorithms
An algorithms developer comes up with a new scheme for fast similarity wants to compare the effectiveness of the algorithm against other implementations.
While there are many published papers on this topic, few of the algorithms and data sets are available for direct comparison. I don't think the problem is the lack of a common format, but I suggest it might help.
Fields in fingerprint records
The fingerprints should be releatively small (under 10,000 bits, dense fingerprints which are best stored as a byte array of bit flags. Fingerprints may be of any positive length. Count fingerprints and large sparse fingerprints are specifically not covered.
Each fingerprint has an associated identifier, which should be unique short strings. That is, they should be under 10-20 characters (although nothing precludes longer strings), they should be composed of the ASCII printable characters and must not contain whitespace characters.
Overview of the formats
The "FPS" format is a stream-oriented text format. Some specific design goals are:
- easy to understand, generate, and parse
- line and text oriented
- can be streamed, with constant memory overhead
- similar to existing formats in this field
- fingerprints represented as hex-encoded strings
- trivially portable
The "FPB" format is a cross-platform binary format. Some specific design goals are:
- cross-platform binary format
- block oriented
- more compact than the FPS format
- fast load times, with easily determined seek locations
- supports word and page alignment, for CPU- and OS-specific optimizations
- fingerprints represented as a sequence of bytes
- allow fingerpints sorted by popcount, for Baldi optimization
- open to future extensibility
Software Download
A Python library to work with the FPF formats is available from the Mercurial repository. It currently supports making FPS output from the RDKit, OpenBabel and OpenEye libraries.
Mailing list
To join the FPF mailing list, send an email to sympa@fpf.kenai.com with the subject "subscribe fpf-discuss".





