Valerie Aurora ([info]valhenson) wrote,
@ 2007-11-19 08:27:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
The code monkey's guide to cryptographic hashes for content-based addressing

At long last, I've written and published the "compare-by-hash for programmers" article everyone's always been asking for. You can read it chopped into 17 pieces and partially obscured by floating ads here:

http://www.linuxworld.com/news/2007/111207-hash.html

(My editor says: Please please complain about this! No one believes me when I say this is bad!) Or you can read it one piece with full size tables, etc. here:

http://www.linuxworld.com/cgi-bin/mailto/x_linux.cgi?pagetosend=/export/home/httpd/linuxworld/news/2007/111207-hash.html

I'm always looking for new article suggestions, especially for the Kernel Hacker's Bookshelf (search down the page for the entry). Writing is fun!

The part of the article that Edward Tufte would be most proud of are the two tables about hash function life cycles:

Stages in the life cycle of cryptographic hash functions
Stage Expert reaction Programmer reaction Non-expert ("slashdotter") reaction
Initial proposal Skepticism, don't recommend use in practice Wait to hear from the experts before adding to OpenSSL SHA-what?
Peer reviewal Moderate effort to find holes and garner an easy publication Used by a particularly adventurous developers for specific purposes Name-drop the hash at cocktail parties to impress other geeks
General acceptance Top-level researchers begin serious work on finding a weakness (and international fame) Even Microsoft is using the hash function now Flame anyone who suggests the function may be broken in our lifetime
Minor weakness discovered Massive downloads of turgid pre-prints from arXiv, calls for new hash functions Start reviewing other hash functions for replacement Long semi-mathematical posts comparing the complexity of the attack to the number of protons in the universe
Serious weakness discovered Tension-filled CRYPTO rump sessions! A full break is considered inevitable Migrate to new hash functions immediately, where necessary Point out that no actual collisions have been found
First collision found Uncork the champagne! Interest in the details of the construction, but no surprise Gather around a co-worker's computer, comparing the colliding inputs and running the hash function on them Explain why a simple collision attack is still useless, it's really the second pre-image attack that counts
Meaningful collisions generated on home computer How adorable! I'm busy trying to break this new hash function, though Send each other colliding X.509 certificates as pranks Tell people at parties that you always knew it would be broken
Collisions generated by hand Memorize as fun party trick for next faculty mixer Boggle Try to remember how to do long division by hand

Life cycles of popular cryptographic hashes (the "Breakout" chart)
Function199019911992199319941995199619971998199920002001200220032004200520062007
Snefru                                    
MD4                                    
MD5                                    
MD2                                    
RIPEMD                                    
HAVAL-128                                    
SHA-0                                    
SHA-1                                    
RIPEMD-128                                    
RIPEMD-160                                    
SHA-2 family                                    
KeyUnbrokenWeakenedBroken
The Hash Function Lounge has an excellent list of references for the dates.




(Post a new comment)

nice
[info]frisky070802
2007-11-19 05:14 pm UTC (link)
I like the article and I especially like the stages descriptive text.

Fred

(Reply to this)


[info]taniwha_nz
2007-12-01 04:48 am UTC (link)
So what happened in 2004? Did the price of alcohol fall thus placing swathes of "experts" in the Balmer peak for extended periods?

(Reply to this) (Thread)


[info]valhenson
2007-12-02 05:56 pm UTC (link)
2004 was a pretty astounding year; I have some vague conspiracy theories about that not worth the electrical potential I use to think them. The real reason is that some brilliant PRC researchers found techniques that exploited common design elements of several of the top hash functions - and published them all in the same paper. Another suspicion I have is that most other researchers were not interested in bothering to fully break hash functions other than SHA-*, since little glory would accrue to them for proving what "everyone" (i.e., cryptographers) already knew. For example, I suspect Joux could have broken them too, but he was busy finding a collision in SHA-0 (also reported in 2004).

(Reply to this) (Parent)


[info]bsittler
2007-12-17 12:53 am UTC (link)
that's a great visualization. i'm shocked that git was designed using sha-1 and has apparently still not been changed.

(Reply to this) (Thread)


[info]valhenson
2007-12-22 02:45 am UTC (link)
So the case where only trusted users can put data in the system is usually pretty annoying. When the hash isn't broken, they tell you that the system works because no one can generate collisions. When the hash is broken, they tell you that it doesn't matter if you can generate collisions because the users won't do that deliberately. To which my reply is, "Then why don't you use a cheaper hash?" The only one that gets this right is rsync - the hashes are pretty wimpy, but exactly as good as they need to be (and much faster to calculate).

So it's fine that git uses SHA-1 and will continue to use SHA-1. Git uses SHA-1 as a cheap way of creating UUIDs for objects, not for any security reason. But there's absolutely no point in using the latest greatest cryptographically secure hash if you don't think there's a reason to upgrade it when it's broken. Git would also work just fine if it was using MD5, or even MD4 - and it would go lots faster.

(Reply to this) (Parent)(Thread)


[info]bsittler
2007-12-22 03:05 am UTC (link)
I guess I worry a bit about accidental collisions. I know they should be vanishingly unlikely if the objects I'm storing in git are all of similar types, but it seems like that might break down once I start putting a lot of audio and photo captures in there -- for instance, if I used git to keep a complete revision history of my home directory. It's not a case of deliberately causing collision, it's more that I worry I might accidentally find out my backup is missing some day because that crucial .wav file hashed the same as the .png from a friend's amateur astronomy CCD exercise. I realize the probability of a random collision is very small, but who knows when one widely-used file format and another widely-used file-format will happen to fall afoul of whatever hash function was popular when the backup system was designed...

(Reply to this) (Parent)(Thread)


[info]bsittler
2007-12-22 03:08 am UTC (link)
Oh, and exactly these sorts of problems did arise when using CRC32, of course. Otherwise I'd be completely blissful in my ignorance, rather than needlessly fearful ;)

(Reply to this) (Parent)


[info]valhenson
2007-12-22 04:55 am UTC (link)
You're preaching to the choir. Compare-by-hash is somewhat defensible when it's used for a specific purpose by educated persons who trust each other. When it becomes invisible systems software, those qualifications go away.

One way I think about it that we see certain sequences of bits far more often than expected by probability alone (and others far less often).
Some of the sequences of bits that will appear more often than by chance within a few years are sequences that have colliding SHA-1 hashes. These sequences will be replicated and stored and generally spread through computer systems because they are so interesting and unique. And then there's someone quite reasonably using git to maintain their home directory who downloads the pair of sequences with the same SHA-1 hash. Yeah, probably git hashes in a lot of other information as well as the data, but it's just not that hard to construct something that collides in git once you have the original vanilla collisions.

(Reply to this) (Parent)(Thread)


[info]bsittler
2007-12-22 04:05 pm UTC (link)
I now think I will not use git, monotone or mercurial to back up my home directory. Thanks :)

(Reply to this) (Parent)

ouch.
[info]bsittler
2008-12-31 05:34 am UTC (link)
you probably saw this already, but [info]ephermata and friends broke ssl (also on /.) using weaknesses in md5. as noted there, combined with dns hijacking, this is basically the theoretical end of web security as we know it. if only the browser hackers had heeded your article when you published, we might be in a better position to work around this problem today.

(Reply to this) (Thread)

Re: ouch.
[info]ioerror
2009-01-29 09:39 am UTC (link)
I hear, according to the experts, no one really cares about certificates anyway. :-/

(Reply to this) (Parent)


Create an Account
Forgot your login or password?
Login w/ OpenID
English • Español • Deutsch • Русский…