Sunday, May 6, 2007

PLink - The People's Search Engine

PLink - Search Engine, Blogging, Content Management, and more.

Using some of the libraries I was playing with around the end of the semester, I've put together a DIGG style web application.

http://www.plink-search.com


Note: Search results aren't spectacular yet this will change as more sites are added and more users are voting.

It's meant to be used as a general-purpose search engine (think Google), rather than as a news aggregator like DIGG (the entries aren't ordered by date, or deleted after a period of time). It's hoped that a user based voting system can help promote good search results - hopefully better than the methods, like link counting, that Google uses.
Regardless of whether enough people start using PLink to give it a true test (who knows if my server can handle that even), it was a fun experience making an application like this. It brought together a lot of stuff I’ve done in the past. If you’re curious, I think I’ll go into a bit more detail – well regardless of whether you’re interested actually.

Search Engine 101:

Indexer:

I used the same networking libraries that I used in my previous project Coezilla, this allows for multiple clients to index websites and to report them back to the central server. This is neat, but turned out to be somewhat redundant, given the quality of the computers I’m running it on - the bottleneck ended up being writing to MySQL and to the disk. Still, this approach is good because it allows you to move some processing away from the main server computer (which, if you're making a search engine like me, is already pretty bogged down).

Parser:

This was one of the harder things to get off the ground. HTML is not guaranteed to be properly formed, and to apply all the transformations to the data that I wanted; I needed (preferably) to get stuff parsed into a Document Object Model (DOM). I ended up using the CyberNeko HTML Parser, in conjunction with the Xerces XML parser. This worked great. Once stuff is in a DOM, you just traverse the elements in the HTML document like a tree.

NLP:

Natural language processing is a complex field. But, I’ll give you a two-minute primer.
I didn’t want to get too complex, in favor of letting voting determine search results. I started by collecting all the text data in the page I was indexing into one big chunk. Having done this, I extracted data one word at a time and applied the following transformations to it.

Stop Word Removal: Lots of words in the English language are really common and don’t provide much information about a web-page (the, at, him, etc.), it’s alright to just remove these words. You do this by using a big list of such words.

Stemming: Lot’s of different words have a root in common, e.g., cat, cats, cat’s. You can apply a stemming algorithm to extract this common root, rather than treating them as different words. Eliminating as many terms as possible can reduce a lot of the complexity in search engines. I used a stemmer called Snowball (which is an implementation of the Porter stemming algorithm).

While applying these two steps, I collected the words into a big alphabetical list, keeping track of the frequency of each word. This list is used to determine the importance of each word to the given web-page.

Applying a Distribution:

Web-pages have significantly variable amounts of information on them. I applied the Poisson distribution to the frequencies in the alphabetical list of words. This helps ensure that we can compare a ten thousand word website to a one hundred word website.

Once all these steps are applied to a given web-page, it is shipped off to the central server, where it is written to disk… using an,

Inverse-Index:

If you give me a big ordered list of words, and point me to the middle, I know I can eliminate either the upper or lower part of the list depending on the word I’m searching for. This process is extremely efficient, and is used to drive searching. Rather than maintain a big list of web-pages (like a phone book), I maintain a big list of words (like a dictionary). If you search for a given word, I retrieve a list of all the pages that have this word in them – this list of words is ordered by each sites importance to the given word. Using this efficient data structure millions of words can be searched, relatively instantaneously.

Voting:

When you vote with Plink, it allows you to shift sites rankings around on given words – in these lists that I mentioned previously. It is hoped that, with enough people doing this, it would lead to good search results.
Well, there you have it, now go out and make your own search engine, or, better yet, go sign-up for a PLink account… and start voting.

http://www.plink-search.com

No comments: