Recursive search

by asfarley

Recursive search tree: topic ‘ruby’

Goal: construct a tree populated with topics related to a keyword of interest. Algorithm:

  1. Every day:
  2. Search Google for keyword.
  3. Follow the top 10 links. Get all HTML. Drop everything but text (javascript, formatting, other tags).
  4. Collate text documents & split into words.
  5. Remove “noise” words (discussed later) and occurrences of keyword
  6. Sort words by frequency of occurrence.
  7. Select the top n most frequency co-occurring terms and repeat.

Try it out here.

Undesired search results can be blocked in subsequent searches by clicking the red ‘x’ in each node. A JQuery event updates the noise list on the server end.

This was accomplished Rails, gems Nokogiriwheneversanitizeelif, open-uri, net/http, the Google Search API and JQuery.

The project’s code (also my homepage) is here on Github.

Noise is modelled by searching for 10 preselected, unrelated topics on Google. The algorithm described above is applied to the collated results (without the noise removal step) and the result is considered to be a general model of internet text background noise. The top most frequently occurring words are stored and blocked from appearing in recursive topic search trees.