Lets Search Future - Code to Learn

Yeah Search is the achilles heel of an engineer. Most of the problem solving begins with searching the dataset. In early this decade an engineer Doug Cutting dreamt a complete open source search infrastructure. In this tryst he created a family of java projects. But the work is far from complete. Its a ongoing effort and if you work on any of them you will gain immense knowledge in that field. Search has three main parts. Crawling the web, creating an index of crawled data and Searching the index given a query. Following two Apache projects handles these functionality.

Apache Nutch : This is the web crawler. Crawling sounds simple but given the vastness of internet with hundred different factors like webservers, content, backlinking etc. Its a huge task. You can play around nutch and implement some new cool feature. Here is the JIRA. And you can also use nutch to create some special search engine. For example, twitter search is broken. it sucks. write a tweet search engine which sucks less.
Apache Lucene : Lucene is the best collection of information retrieval algorithms. It is used for indexing the information and retrieving the results. Checkout the Jira and see if something interests you. Lucene also has subproject called Solr which is a complete search server such that index saved in it is available through HTTP rest api and it saves the index in a relational db. Many of us have done work on lucene and should be able to guide you in case you face any blockers.

One indexing and search project that a friend of mine suggested long back was indexing all the research papers(download here) and making a recommendation engine for example if i am searching for Cassandra paper, in search also include results for Amazon Dynamo related papers. FYI, These two links are awesome papers on distributed systems. I am saving distributed systems for the last ;-) Cheers,