Does linux DC++ read precalculated hashes from a file?

Asked by Vasco da Gama on 2012-03-06

Hello,

i would like to know if you can store precalculated hashes for your files on disk like it is often done with md5sums oder sha1sums.

E.g. for a file myvideo.mkv a file myvideo.mkv.tth or in a folder a file tthsums with all hashes for the files within the folder.

Greets,

Vasco

Question information

Language:
English Edit question
Status:
Solved
For:
LinuxDC++ Edit question
Assignee:
No assignee Edit question
Solved by:
Vasco da Gama
Solved:
2012-03-07
Last query:
2012-03-07
Last reply:
2012-03-07
Steven Sheehy (steven-sheehy) said : #1

Not sure I understand your use case for needing such data, but you can find all the calculated hashes in $HOME/.dc++/HashIndex.xml

Vasco da Gama (iqb) said : #2

My use case is the following: i have a lot of data (say 30 TB) and i want to share them via dc and have a website for nicer search/browse capabilities. I don't want to hash this data twice and i don't like to parse this xml-file. I would prefer to store the hashes beside the files so i can move them around without rehashing and i can use the hashes easily for other purposes without the need to parse xml.

I hope this clarified my use case. But i asume from your answer that there is no such mechanism in linuxdcpp.

Steven Sheehy (steven-sheehy) said : #3

The point of hashing is a read only operation to construct a key that uniquely identifies that file and leaves the shared folder as it was found. The approach you're suggesting would violate the read only aspect of that and cause a large amount of files to be read and written on the shared folder to store the hash. It makes much more sense to store the TTH in a centralized file like the HashIndex.xml.

I don't see why parsing the xml is a problem. If you're capable of writing a website then you should be capable of parsing a xml file using your language of choice and an xml library. You could even parse it using a simple one line command:

gawk -F '\"' '/<File/ {printf "%s %s\n", $2, $6}' HashIndex.xml

Vasco da Gama (iqb) said : #4

I dont want linuxdcpp to write a file into each folder. I want to calculate the hashes on my own, store them into a file and linuxdcpp should use them (if they exist) or calculate them if no such file exist.
It may be the case that linuxdcpp runs on a box with low computing power and accesses a lot of data via nfs and serves them via dc. In this case it would be much better if the hashes could be pre-computed on the nfs-servers that have a lot more power (accumulated) than the dc server.

Steven Sheehy (steven-sheehy) said : #5

You could manually craft this HashIndex.xml to add the filename, size, timestamp, tth into it (with linuxdcpp not running). You'd have to add a <Hash...> entry and a <File...> entry in the xml for each file. On next startup linuxdcpp will get a list of all files in the shared folders and hash files that are missing. If the file is already hashed, it will confirm that the file size is the same and that the timestamp is newer than the file's last write time before attempting to re-hash it. So if you craft it properly it will do just as you suggested.

Or you could just run linuxdcpp on the server (directly or via remote X) and hash it there. Then copy the HashIndex.xml and HashData.dat over to the slower machine when it's done. Either option is probably not ideal for you, but it's workable. Your use case is not exactly very common to warrant such a feature developed since most people don't hash files themselves, have 30TB of data, fast NFS servers, etc.

Vasco da Gama (iqb) said : #6

Although i would like have the hashing and the dc file serving be separated (thinking in the unix toolbox style ;-)) i can understand why that is not the case.

So thanks Steven for your kind answers and happy proging!