Sunday, August 28, 2011

Automatic spelling corrections on Github

English has never been one of my strong points (as is fairly obvious by reading my blog), so my latest side project might surprise you a bit. Inspired by the results of tarsnap’s bug bounty and the first pull request received for a new project(slashem - a type safe rogue like DSL for querying solr in scala) I decided to write a bot for github to fix spelling mistakes.



The code its self is very simple (albeit not very good, it was written after I got back from clubbing @ JWZ’s club [DNA lounge]). There is something about a lack of sleep which makes perl code and regexs seem like a good idea. If despite the previous warnings you still want to look at the code https://github.com/holdenk/holdensmagicalunicorn is the place to go. It works by doing a github search for all the README files in markdown format and then running a limited spell checker on them. Documents with a known misspelled word are flagged and output to a file. Thanks to the wonderful github api the next steps is are easy. It forks the repo and clones it locally, performs the spelling correction, commits, pushes and submits a pull request.



The spelling correction is based on Pod::Spell::CommonMistakes, it works using a very restricted set of misspelled words to corrections.



Writing a “future directions” sections always seems like such a cliche, but here it is anyways. The code as it stands is really simple. For example it only handles one repo of a given name, and the dictionary is small, etc. The next version should probably also try and only submit corrections against the conical repo. Some future plans extending the dictionary. In the longer term I think it would be awesome to attempt detect really simple bugs in actual code (things like memcpy(dest,0,0)).



You can follow the bot on twitter holdensunicorn .



Comments, suggestions, and patches always appreciated. - holdenkarau (although I’m going to be AFK at burning man for awhile, you can find me @ 6:30 & D)

19 comments:

Peter Petermann said...

saw someone i never heard of before fix spelling in one of my OS projects
googled a bit arround, and found this post - before that i wasn't sure if that was a bot or some english teacher gone mad..

thanks for the patches your tool created!

Holden Karau said...

@Peter: I'm glad you found the patch useful :) If you have any suggestions on how to improve it that would be ++awesome :D

Rich Jones said...

Ha! I have made something similar, although I haven't announced it yet.

https://github.com/Miserlou/WhitespaceBot

Great minds? :)

Rich Jones said...

Ha! I have made something similar, although I haven't announced it yet:

https://github.com/Miserlou/WhitespaceBot

Great minds? :)

Anonymous said...

Very nice, but I think you should change the website address on holdensmagicalunicorn's github account to point to an explanation about the bot (such as this blog post).

Anonymous said...

You wrote:

"The code its self is very simple..."

What would your automatic spelling correction bot fix in this sentence fragment?

Anonymous said...

Your very last sentence: "In the longer term I think it would be awesome to attempt detect really simple bugs in actual code (things like memcpy(dest,0,0))."

It would be cool to run llvm clang over all C/Obj-C/C++ projects and use its diagnostic abilities to generate a patch that applies "fix-it hints". See http://clang.llvm.org/diagnostics.html

John C said...

By "its self" I think he means that his Perl code is self aware.

Xanni said...

conical -> canonical?

Anonymous said...

"agument" => "augument"

...What's augument?

Thomas Taschauer said...

Somehow similar to one of my projects: Bloki - http://bloki.tomtasche.at/ :)

Let's fix all those nasty misspellings! ;)

Have a great day
Tom

Anonymous said...

"conical -> canonical?"

That's really rather ionic!

Anonymous said...

This was great, thanks!

Lucas De Marchi said...

I have a project hosted at github to find misspellings on source code.

It's reported to successfully work on large projects like the Linux kernel, FreeBSD, oFono, ConnMan etc. It uses its own database for common misspelled words (based on the one found on wikipedia). The first time it was used for the Linux Kernel, it generated a patch of thousands of lines, which was accepted by Linus.

I wonder if you could use it to correct also the source code in github projects instead of only the README files. Here it is:
https://github.com/lucasdemarchi/codespell

Aaron Davies said...

The code its self is very simple (albeit not very good, it was written after I got back from clubbing @ JWZ’s club [DNA lounge]). There is something about a lack of sleep which makes perl code and regexs seem like a good idea.

oh the irony....

A Real Github User said...

Please leave my github repositories, and any repos I care about, alone. I don't want your bot bothering me with garbage. If I get a pull request from a bot like this, you bet it would be reported to github as spam...

YaZug said...

inital => initial

also how would I run it and point it at my github repos?

David Precious said...

@"A Real Github User" - you'd count submissions which are intended to be helpful, albeit automatically generated, as spam?

You could block the bot on GitHub if you didn't wish to receive any more pull requests from it, but I'm not convinced that GitHub would agree that it's spam.

Anonymous said...

A Real Github User: The rest of us are also real GH users. You might consider a spelling correction patch, garbage. But keep in mind that many would consider your project garbage if it contains common misspellings.

Free Blog Counter