| US 7,519,621 B2 | ||
| Extracting information from Web pages | ||
| Ralph Harik, Mountain View, Calif. (US) | ||
| Assigned to PageBites, Inc., Mountain View, Calif. (US) | ||
| Filed on May 04, 2004, as Appl. No. 10/838,982. | ||
| Prior Publication US 2005/0251536 A1, Nov. 10, 2005 | ||
| Int. Cl. G06F 7/00 (2006.01) | ||
| U.S. Cl. 707—200 [707/3] | 10 Claims |

| 1. A computer-implemented method for identifying webpage content, the method comprising: receiving from a memory storage device
a string of HTML source code that includes tags;
determining the sequence in which tags occur in the string;
using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated
with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining
including;
applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement
that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;
removing from further consideration sub-sequences that do not satisfy the first set of criteria;
grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap
and are similar, as determined by a measure based on edit distance;
calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score
being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;
identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in
a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;
removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences
from the group having a highest associated score among sub-sequences currently selected; and
returning and storing in the memory storage device the sub-sequences that were not removed from further consideration.
|