Two paths to the files hashing to the same value

Asked by Rostik Slipetskyy

Theoretically, two inputs to the hash function can result in the same output. Suppose that path accountA/containerA/objectA produces the same hash as accountB/containerB/objectB.

Based on the hash, partition is calculated. Since hashes are the same, the partition will be the same. And since datadir directory is created based on hash, these two objects will end up in the same directory.

Then, when an object is uploaded, it is stored in datadir and receives the name which equals to the timestamp. All the other files in that directory with older timestamp are deleted. So if objectA was uploaded before objectB, upon upload of objectB, objectA will be deleted.

Am I wrong somewhere in my reasoning?

Question information

Language:
English Edit question
Status:
Answered
For:
OpenStack Object Storage (swift) Edit question
Assignee:
No assignee Edit question
Last query:
Last reply:
Revision history for this message
John Dickinson (notmyname) said :
#1

If you find two (account, container, object) sets that hash to the same value (even accounting for the salt we use), then, to swift, they are referencing the same object. You will have simply found two names to the same logical object.

This applies to accounts and account, container pairs, too.

--John

On May 6, 2011, at 1:05 PM, Rostik Slipetskyy wrote:

> New question #156307 on OpenStack Object Storage (swift):
> https://answers.launchpad.net/swift/+question/156307
>
> Theoretically, two inputs to the hash function can result in the same output. Suppose that path accountA/containerA/objectA produces the same hash as accountB/containerB/objectB.
>
> Based on the hash, partition is calculated. Since hashes are the same, the partition will be the same. And since datadir directory is created based on hash, these two objects will end up in the same directory.
>
> Then, when an object is uploaded, it is stored in datadir and receives the name which equals to the timestamp. All the other files in that directory with older timestamp are deleted. So if objectA was uploaded before objectB, upon upload of objectB, objectA will be deleted.
>
> Am I wrong somewhere in my reasoning?
>
> --
> You received this question notification because you are a member of
> Swift Core, which is an answer contact for OpenStack Object Storage
> (swift).

Revision history for this message
Rostik Slipetskyy (rostik-2000) said :
#2

> to swift, they are referencing the same object

But for two users that are uploading their files to swift and incidently happen to get the same hash value, these are two different objects. Isn't it a problem?

Revision history for this message
Rostik Slipetskyy (rostik-2000) said :
#3

The probability of hash function giving one output to two different inputs is small, but if swift is ever used as a back-end for gmail or facebook, sooner or later such a situation might happen.

In order to test what will happen then, I have created a dummy implementation of md5 function which always results in the same hash value. Afterwards, I have changed swift.common.utils.py in line 26 to include my dummy version of md5 instead of hashlib.md5.

*** TEST ***

Actions below test for a situation when hash("accountA/files/file1 + HASH_PATH_SUFFIX") == hash("accountB/files/file2 + HASH_PATH_SUFFIX"):

1) python st -v -A http://10.0.0.2:11000/v1.0 -U accountA:adminA -K passadmina upload files file1
2) python st -v -A http://10.0.0.2:11000/v1.0 -U accountB:adminB -K passadminb upload files file2
3) python st -v -A http://10.0.0.2:11000/v1.0 -U accountA:adminA -K passadmina download files file1

As the result of action 2, file2 will overwrite file 1. As the result of action 3, user from accountA will successfully download the content of the file uploaded by user from accountB.

*** POSSIBLE FIX ***
The path to the file should not depend only on the output of the hash function, but also from some other information that is different for different accounts, for example account name.

Revision history for this message
gholt (gholt) said :
#4

You may wish to read: http://en.wikipedia.org/wiki/Birthday_Paradox and the Probability Table contained within. You would have to have 26,000,000,000,000,000 objects before you had even a 0.000001% chance of collision.

The problem with using the account name in the path is that you would have to encode the account name to avoid path naming problems and the account name without encoding is 256 bytes. That can take quite a strain on your inode space.

Revision history for this message
Rostik Slipetskyy (rostik-2000) said :
#5

The probability is indeed too low to cause any problems.

However, imagine that a malicious user generates a number of paths "account/container/object" that will hash to the same value by changing "container/object" suffix. (It is possible to perform this attack for md5, see for example http://www.win.tue.nl/hashclash/ChosenPrefixCollisions/).

Afterwards, the user uploads different files with the names that will have the same hash to OpenStack. The second file overwrites the first. You may think that such a malicious user harms only himself. BUT! The provider can be sued by that malicious user for data loss (we presume an existence of a legal agreement between a provider and a customer).

Of course, malicious user will need to know HASH_PATH_SUFFIX in order to perform this attack. But it is still possible using insider knowledge (for example by dishonest staff at provider's side disclosing HASH_PATH_SUFFIX value or even initiating the attack).

Revision history for this message
John Dickinson (notmyname) said :
#6

Yes, a collision could happen, but I believe swift is protected in 2 ways:

One is that the hash_path_suffix is just that, a suffix. The attack mentioned in your linked paper deals with generating suffixes for given prefixes. Since the hash_path_suffix is appended to anything that the user can provide to be hashed, unless I am misunderstanding the linked paper, the mentioned attack doesn't apply. In fact, this is exactly why we append the hash_path_suffix instead of prepend it.

Second, the hash_path_suffix is designed to be secret, and we caution deployers to keep it secret in the docs. As you noted, this does not prevent attacks from an insider. However, if an insider has access to the hash_path_suffix, that insider is very likely to not need to compute md5 collisions to cause chaos. The hash_path_suffix should be kept as secret as an /etc/shadow file or the secret key of a key pair.

Using a suffix also prevents attackers from generating many hashes and selecting those that are "close" in an effort to fill up a single storage volume in the swift cluster. Without knowledge of the hash_path_suffix, it is very difficult for an attacker to find "close" hashes.

--John

Revision history for this message
gholt (gholt) said :
#7

This attack vector would require knowing both /<account>/ prefixes and the HASH_PATH_SUFFIX with control only over the middle part with limited data size, which that paper does not describe (it requires being able to change both paths' suffixes). Assuming such an attack vector was feasible it would require the inside knowledge. Knowing that would indicate current or previous access to internal systems. With current access, there are much easier ways to cause problems. With current or previous access and the fact that the name would have to be quite odd looking to work would easily indicate the malicious intent.

All that said, if you have a solution in mind that isn't more harmful to the system than the remote possibility of the threat... It's all a matter of risk management really. :)

Revision history for this message
Rostik Slipetskyy (rostik-2000) said :
#8

> we caution deployers to keep it secret in the docs (...) The hash_path_suffix should be kept as secret as an /etc/shadow file or the secret key of a key pair.

I just rechecked the Administrator manual (http://docs.openstack.org/cactus/openstack-object-storage/admin/os-objectstorage-adminguide-cactus.pdf), and what I found with respect to swift_hash_path_suffix was:

"# random unique string that can never change (DO NOT LOSE)"

Maybe I have missed other places where the necessity to keep swift_hash_path_suffix secret were mentioned, but it might make sense to indicate it in Admin manual and also at etc/swift.conf-sample from the distribution. Just as a precaution measure.

Revision history for this message
John Dickinson (notmyname) said :
#9
Revision history for this message
Rostik Slipetskyy (rostik-2000) said :
#10

> This attack vector would require knowing both /<account>/ prefixes and the HASH_PATH_SUFFIX with control only over the middle part with limited data size, which that paper does not describe.

Just a comment. It is still possible to vary the middle part to achieve the same hash value. For example in http://www.win.tue.nl/hashclash/Nostradamus/ there are examples of different PDF files which predict different outcomes of US elections. All of the provided files have the same Md5 hash. If one makes byte-to-byte comparison of two such files, he will note that the prefixes and suffixes are identical.

The above considerations on "keeping hash path value secret" and "risk management" still stay valid. But maybe eventually it might be advisable to refuse from using MD5 even despite its "general availability, good distribution, and adequate speed".

Revision history for this message
Chuck Thier (cthier) said :
#11

There are 2 problems with that attack vector:

1. The attacker has to have total control over both documents in order to make the hashes collide. At the very worst, an attacker could only make two of his own objects have hash collisions (which is still doubtful). There are currently no preimage attacks on MD5 that would allow a user to take an arbitrarily hashed object, and fabricate a new object that generates the same hash.

2. All of the attacks you mention require a sizable chunk of data that is well beyond the 1024 bytes that we allow in the path to the object.

All that said, if you are truely paranoid enough about it, it wouldn't be hard to change the code to use a different hashing mechanism, and not much more difficult to make it configurable. Patches welcome :)

Can you help with this problem?

Provide an answer of your own, or ask Rostik Slipetskyy for more information if necessary.

To post a message you must log in.