| US 7,599,931 B2 | ||
| Web forum crawler | ||
| Bin Shi, Beijing (China); Gu Xu, Beijing (China); and Wei-Ying Ma, Beijing (China) | ||
| Assigned to Microsoft Corporation, Redmond, Wash. (US) | ||
| Filed on Mar. 03, 2006, as Appl. No. 11/368,261. | ||
| Prior Publication US 2007/0208703 A1, Sep. 06, 2007 | ||
| Int. Cl. G06F 7/00 (2006.01); G06F 17/30 (2006.01) | ||
| U.S. Cl. 707—6 [707/4] | 14 Claims |

| 1. A system with a processor and memory for crawling a site having pages, each page having a reference that identifies the
page, each reference having tokens, comprising:
a grouping component that identifies groups of pages with similar content;
a pattern component that identifies a reference pattern of a group based on the references of the pages of the group, the
reference pattern being identified by analyzing the tokens of the references of the pages of the group to identify sequences
of tokens indicating a pattern of tokens within the references; and
a decision component that, after encountering a reference that matches a reference pattern when crawling the site, decides
whether to access the page of the encountered reference based on characteristics of the pages of the group of the matching
reference pattern
wherein the components are implemented as computer-executable instructions stored in the memory for execution by the processor.
|