There is a big problem with current sequence databases - they all use different identifiers and accession numbers. This means cross-referencing databases is a complicated exercise, especially as the quantity of databases increases, or if you need to compare “wild” DNA sequences.
Seqhash is a simple algorithm to produce consistent identifiers for any genetic sequence. The basic premise of the Seqhash algorithm is to hash sequences with the hash being a robust cross-database identifier. Sequences themselves shouldn’t be used as a database index (often, they’re too big), so a hash based off of a sequence is the next best thing. Usability wise, you should be able to Seqhash any rotation of a sequence in any direction and get a consistent hash.
For example, plasmids from different providers such as Addgene, iGEM, or FreeGenes could have slightly different names as they are passed from person to person to organization. A Seqhash for the plasmids in those repositories would ensure stable cross-platform identification of plasmid material.
A Seqhash is separated into 3 different elements divided by underscores. It looks like the following:
v1_DCD_4b0616d1b3fc632e42d78521deb38b44fba95cca9fde159e01cd567fa996ceb9
The first element is the version tag (v1 for version 1). If there is ever a Seqhash version 2, this tag will differentiate seqhashes. The second element is the metadata tag, which has 3 letters. The first letter codes for the sequenceType (D for DNA, R for RNA, and P for Protein). The second letter codes for whether or not the sequence is circular (C for Circular, L for Linear). The final letter codes for whether or not the sequence is double stranded (D for Double stranded, S for Single stranded). The final element is the blake3 hash of the sequence (once rotated and complemented).
The Seqhash algorithm makes several opinionated design choices, primarily to make working with Seqhashes more consistent and easy. The Seqhash algorithm only uses a single hash function, Blake3, and only operates on DNA, RNA, and Protein sequences. These identifiers will be seen by human beings, so versioning and metadata is attached to the front of the hashes so that a human operator can quickly identify problems with hashing.
If the sequence is DNA or RNA, the Seqhash algorithm needs to know whether or not the nucleic acid is circular and/or double stranded. If circular, the sequence is rotated to a deterministic point. If double stranded, the sequence is compared to its reverse complement, and the lexiographically minimal sequence is taken (whether or not the min or max is used doesn’t matter, just needs to be consistent). If the sequence is a Protein sequence, the Seqhash algorithm supports circularity in the rare case that you are working with a circular protein.
If the sequence is RNA, the sequence will be converted to DNA before hashing. While the full Seqhash will still be different between RNA and DNA (due to the metadata string), the hash afterwards will be the same. This makes it easy to cross reference DNA and RNA sequences. This becomes important in deduplication efforts for large sequence databases.
For DNA or RNA sequences, only ATUGCYRSWKMBDHVNZ characters are allowed. For Proteins, only ACDEFGHIKLMNPQRSTVWYUO*BXZ characters are allowed in sequences. Selenocysteine (Sec; U) and pyrrolysine (Pyl; O) are included in the protein character set - usually U and O don’t occur within protein sequences, but for certain organisms they do, and it is certainly a relevant amino acid for those particular proteins. B (Asx), X (Xaa), and Z (Glx) are included in the UniProtKB/SwissProt data dump, so these are included as valid amino acid sequences, even if their occurrence is historical baggage from a different time.
The reference implementation for Seqhash is in the Golang Poly package. Seqhash is also available for Python through pypi. If you are interested in Seqhash in your own language, please contact me. It is a fairly simple algorithm to implement.
Keoni Gandall
PS: PS: I had this idea for a while but didn’t know what the algorithm was called for deterministic points until I saw this Ginkgo Bioworks blog post.