US 7,519,621 B2
Extracting information from Web pages
Ralph Harik, Mountain View, Calif. (US)
Assigned to PageBites, Inc., Mountain View, Calif. (US)
Filed on May 04, 2004, as Appl. No. 10/838,982.
Prior Publication US 2005/0251536 A1, Nov. 10, 2005
Int. Cl. G06F 7/00 (2006.01)
U.S. Cl. 707—200  [707/3] 10 Claims
OG exemplary drawing
 
1. A computer-implemented method for identifying webpage content, the method comprising: receiving from a memory storage device a string of HTML source code that includes tags;
determining the sequence in which tags occur in the string;
using the sequence to identify one or more sub-sequences in which tags occur in the string, each sub-sequence being associated with a portion of the string that starts with the first tag of the sub-sequence and ends with the last tag of the sub-sequence;
determining whether the identified sub-sequences define webpage content constituting an entire webpage listing, the determining including;
applying a first set of criteria to filter the identified sub-sequences, the first set of criteria including a requirement that an identified sub-sequence be repeated in tandem, either exactly or approximately, in the string;
removing from further consideration sub-sequences that do not satisfy the first set of criteria;
grouping the remaining sub-sequences into groups, wherein sub-sequences are grouped together in a group when they do not overlap and are similar, as determined by a measure based on edit distance;
calculating a score for each group, the score for a group being associated with each sub-sequence in the group, the score being indicative of the likelihood that sub-sequences in the group define webpage content constituting entire webpage listings;
identifying overlapping sub-sequences between different groups, wherein identifying includes selecting each sub-sequence in a group and comparing the selected sub-sequence against sub-sequences of other groups for one or more overlapping word tokens;
removing from further consideration all identified overlapping sub-sequences between different groups except sub-sequences from the group having a highest associated score among sub-sequences currently selected; and
returning and storing in the memory storage device the sub-sequences that were not removed from further consideration.