meofthepagei
theWebo
thedistributio
ofsubjectcharacteristicsa
dtheide
tificatio
algorithmI
thispapertheco
creteworkasfollows1SpiderpartBysetseedsthroughthedesig
ofwebsitedow
loadasmuchaspossiblea
dwiththewholesitei
li
ewithuserrequireme
ts2Pagepreprocessi
gprocessi
cludi
gWordparticili
gHTMLparsi
ga
dpagede
oisi
g3Todetermi
ethereleva
ceofthethemei
cludi
gthefeatureextractio
stagea
dtherightvalueI
thefeatureextractio
stagethroughthecombi
atio
ofdocume
tfreque
cy
ewfeaturestoachievedime
sio
alityreductio
a
dimprovi
gtheclassificatio
accuracyresultsValuei
therightphasecombi
edwithi
formatio
gai
TFIDFalgorithma
dthetraditio
alvectorspacemodelalgorithmhavebee
moresuitableforthethemeofthereleva
ceoftherighttodetermi
ethevalueofthecalculatio
4Fi
allyi
MYECLIPSEplatformtorealizeasimple
etworksystemreptilesa
dreptilesabriefa
alysisoftheeffectoftheoperatio
reachedasatisfactoryresultKeywordspagea
alysisTFIDFalgorithmspacevectoralgorithm
II
f目录
第一章绪论1
11选题背景和研究意义112搜索引擎的发展113国内外研究现状314本文的主要工作和论文结构5
第二章网络爬虫工作原理7
21网络爬虫在搜索引攀中的地位722网络爬虫的基本原理9221主题网络爬虫的体系结构9222系统模块功能说明1023内容提取112r