在写脚本前,建议先写伪代码,伪代码格式不是固定的,随自己喜好,主要就是在思维及算法落地前,把整个轮廓理清,可以有效降低都快写完了,发现前面有错误,结果导致整个脚本全部更新一遍这种事
??????? 一就是抓取彩票数据,这个使用爬虫实现,分别抓取开奖日期、红球区、篮球区,因为考虑双色球的数据量比较庞大,所以这次使用数据库进行存储,选用的是免费又好用的mysql数据库,数据库接口文件使用MySQLdb,这个我以后会单独写一个说明,当然你也可以用文档存储,或者选择别的数据库比如oracle或者nosql的mangodb
<span style="color: #800000">'''
<span style="color: #800000">
打开网址
读取内容
内容解析
根据源码得到需爬取内容
1、开奖日期:2018年8月26日
2、红球
<li class="ball_red">03
<li class="ball_red">07
<li class="ball_red">08
<li class="ball_red">14
<li class="ball_red">25
<li class="ball_red">32
3、篮球
<li class="ball_blue">06
打开数据库连接
爬取内容写入数据库
共3个字段
1、开奖日期
2、红球,红球使用分号“;”分隔,方便调用和导出
3、篮球
create table tow_color_ball(open_date varchar(10),red_n varchar(20),blue_n varchar(2))
<span style="color: #800000">'''
<span style="color: #0000ff">import
<span style="color: #000000"> urllib
<span style="color: #0000ff">import<span style="color: #000000"> urllib2
<span style="color: #0000ff">import<span style="color: #000000"> re
<span style="color: #0000ff">import<span style="color: #000000"> numpy as np
<span style="color: #0000ff">import<span style="color: #000000"> operator
<span style="color: #0000ff">import<span style="color: #000000"> MySQLdb
<span style="color: #008000">#<span style="color: #008000"> 连接mysql
<span style="color: #0000ff">def<span style="color: #000000"> conn_db():
db = <span style="color: #800000">'<span style="color: #800000">pythondb<span style="color: #800000">'<span style="color: #000000">
host = <span style="color: #800000">'<span style="color: #800000">localhost<span style="color: #800000">'<span style="color: #000000">
iuser = <span style="color: #800000">'<span style="color: #800000">xxx<span style="color: #800000">'<span style="color: #000000">
passwd = <span style="color: #800000">'<span style="color: #800000">xxxxxx<span style="color: #800000">'<span style="color: #000000">
conn = MySQLdb.connect(db = db,host = host,user = iuser,passwd =<span style="color: #000000"> passwd)
cursor =<span style="color: #000000"> conn.cursor()
<span style="color: #0000ff">return<span style="color: #000000"> cursor
<span style="color: #008000">#<span style="color: #008000"> 处理网页获取页面源码
<span style="color: #0000ff">def<span style="color: #000000"> get_html_values(url):
url_open =<span style="color: #000000"> urllib.urlopen(url)
url_read =<span style="color: #000000"> url_open.read()
<span style="color: #0000ff">return<span style="color: #000000"> url_read
<span style="color: #008000">#<span style="color: #008000"> 处理源码,获取日期、红球、篮球
<span style="color: #0000ff">def<span style="color: #000000"> manage_html(html_values):
red_no_re = re.compile(<span style="color: #800000">'<span style="color: #800000">(?<=&;li class="ball_red">)[0-9]+(?=&;/li>)<span style="color: #800000">'<span style="color: #000000">)
blue_no_re = re.compile(<span style="color: #800000">'<span style="color: #800000">(?<=&;li class="ball_blue">)[0-9]+(?=&;/li>)<span style="color: #800000">'<span style="color: #000000">)
date_re = re.compile(<span style="color: #800000">'<span style="color: #800000">(?<=开奖日期:)[0-9]+年[0-9]+月[0-9]+日<span style="color: #800000">'<span style="color: #000000">)
red_no_list =<span style="color: #000000"> re.findall(red_no_re,html_values)
red_numbers = <span style="color: #800000">'<span style="color: #800000">;<span style="color: #800000">'<span style="color: #000000">.join(red_no_list)
blue_number =<span style="color: #000000"> re.search(blue_no_re,html_values)
blue_number =<span style="color: #000000"> blue_number.group()
date_value =<span style="color: #000000"> re.search(date_re,html_values)
date_value =<span style="color: #000000"> date_value.group()
<span style="color: #0000ff">return<span style="color: #000000"> date_value,red_numbers,blue_number
<span style="color: #008000">#<span style="color: #008000"> 可恶的日期,竟然是YYYY年MM月DD日,需要改成YYYY-MM-DD
<span style="color: #0000ff">def<span style="color: #000000"> manage_date(date_value):
date_value = date_value.replace(<span style="color: #800000">'<span style="color: #800000">年<span style="color: #800000">',<span style="color: #800000">'<span style="color: #800000">-<span style="color: #800000">').replace(<span style="color: #800000">'<span style="color: #800000">月<span style="color: #800000">',<span style="color: #800000">'<span style="color: #800000">-<span style="color: #800000">').replace(<span style="color: #800000">'<span style="color: #800000">日<span style="color: #800000">',<span style="color: #800000">''<span style="color: #000000">)
<span style="color: #0000ff">return<span style="color: #000000"> date_value
<span style="color: #008000">#<span style="color: #008000"> 处理页面编号,每次编号-1,也就是说end_page要小于url中的页码
<span style="color: #0000ff">def<span style="color: #000000"> get_page(url,end_page):
url_num = re.search(<span style="color: #800000">'<span style="color: #800000">(?<=/)[0-9]+(?=.)<span style="color: #800000">'<span style="color: #000000">,url)
url_num =<span style="color: #000000"> url_num.group()
<span style="color: #0000ff">if int(end_page) ><span style="color: #000000"> int(url_num):
<span style="color: #0000ff">return <span style="color: #800000">'<span style="color: #800000">end<span style="color: #800000">'<span style="color: #000000">
url_num_1 = int(url_num) - 1<span style="color: #000000">
url =<span style="color: #000000"> url.replace(url_num,str(url_num_1))
<span style="color: #0000ff">return<span style="color: #000000"> url
<span style="color: #008000">#<span style="color: #008000"> 查看库中是否已存在开奖日期,防止重复写入
<span style="color: #0000ff">def<span style="color: #000000"> check_open_date(open_date):
conn =<span style="color: #000000"> conn_db()
check_sql = <span style="color: #800000">'<span style="color: #800000">select 1 from tow_color_ball where open_date = %r<span style="color: #800000">' %<span style="color: #000000">open_date
conn.execute(check_sql)
excur =<span style="color: #000000"> conn.fetchall()
conn.close()
<span style="color: #008000">#<span style="color: #008000">如过未查到excur == ()
<span style="color: #0000ff">return<span style="color: #000000"> excur
<span style="color: #008000">#<span style="color: #008000"> 写入数据库
<span style="color: #0000ff">def<span style="color: #000000"> write_db(date_value,blue_number):
conn =<span style="color: #000000"> conn_db()
in_sql = <span style="color: #800000">"<span style="color: #800000">insert into tow_color_ball(open_date,red_n,blue_n) values(%r,%r,%r)<span style="color: #800000">" %<span style="color: #000000">(date_value,blue_number)
conn.execute(in_sql)
conn.execute(<span style="color: #800000">'<span style="color: #800000">commit<span style="color: #800000">'<span style="color: #000000">)
conn.close()
<span style="color: #008000">#<span style="color: #008000"> 彩票主程序,用来爬取彩票号码
<span style="color: #0000ff">def<span style="color: #000000"> ball_main(url,end_page):
<span style="color: #0000ff">while<span style="color: #000000"> True:
html_values =<span style="color: #000000"> get_html_values(url)
date_value,blue_number =<span style="color: #000000"> manage_html(html_values)
date_value =<span style="color: #000000"> manage_date(date_value)
data_check =<span style="color: #000000"> check_open_date(date_value)
<span style="color: #0000ff">if data_check ==<span style="color: #000000"> ():
write_db(date_value,blue_number)
url =<span style="color: #000000"> get_page(url,end_page)
<span style="color: #0000ff">if url == <span style="color: #800000">'<span style="color: #800000">end<span style="color: #800000">'<span style="color: #000000">:
<span style="color: #0000ff">print <span style="color: #800000">'<span style="color: #800000">url_page已到达end_page,获取完成<span style="color: #800000">'
<span style="color: #0000ff">return<span style="color: #000000"> 0
<span style="color: #008000">#<span style="color: #008000"> 二项分布算法
<span style="color: #0000ff">class<span style="color: #000000"> binomial_class(object):
<span style="color: #0000ff">def <span style="color: #800080">init<span style="color: #000000">(self,case_count,real_count,p):
self.case_count =<span style="color: #000000"> case_count
self.real_count =<span style="color: #000000"> real_count
self.p =<span style="color: #000000"> p
<span style="color: #0000ff">def<span style="color: #000000"> multiply_fun(self,xlist):
n = 1
<span style="color: #0000ff">for x <span style="color: #0000ff">in<span style="color: #000000"> xlist:
n *=<span style="color: #000000"> x
<span style="color: #0000ff">return<span style="color: #000000"> n
<span style="color: #0000ff">def<span style="color: #000000"> fact_fun(self,n):
<span style="color: #0000ff">if n ==<span style="color: #000000"> 0:
<span style="color: #0000ff">return 1<span style="color: #000000">
n += 1<span style="color: #000000">
fact_list = [i <span style="color: #0000ff">for i <span style="color: #0000ff">in range(1<span style="color: #000000">,n)]
fact_num =<span style="color: #000000"> self.multiply_fun(fact_list)
<span style="color: #0000ff">return<span style="color: #000000"> fact_num
<span style="color: #0000ff">def<span style="color: #000000"> c_n_x(self):
fact_n =<span style="color: #000000"> self.fact_fun(self.case_count)
fact_x =<span style="color: #000000"> self.fact_fun(self.real_count)
fact_n_x = self.fact_fun(self.case_count -<span style="color: #000000"> self.real_count)
c_n_x_num = float(fact_n) / (fact_x *<span style="color: #000000"> fact_n_x)
<span style="color: #0000ff">return<span style="color: #000000"> c_n_x_num
<span style="color: #0000ff">def<span style="color: #000000"> binomial_fun(self):
c_n_k_num =<span style="color: #000000"> self.c_n_x()
pi = (self.p * self.real_count) ((1 - self.p) * (self.case_count -<span style="color: #000000"> self.real_count))
binomial_num = c_n_k_num <span style="color: #000000"> pi
<span style="color: #0000ff">return<span style="color: #000000"> binomial_num
<span style="color: #008000">#<span style="color: #008000"> 从库里获取彩票信息
<span style="color: #0000ff">def<span style="color: #000000"> get_ball_infomation(start_dt,end_dt):
conn =<span style="color: #000000"> conn_db()
sql = <span style="color: #800000">"<span style="color: #800000">select red_n,blue_n from tow_color_ball where date_format(open_date,'%%Y-%%m-%%d') >= %r and date_format(open_date,'%%Y-%%m-%%d') <= %r<span style="color: #800000">" %<span style="color: #000000">(start_dt,end_dt)
conn.execute(sql)
excur =<span style="color: #000000"> conn.fetchall()
conn.close()
case_array =<span style="color: #000000"> np.array(excur)
row_count =<span style="color: #000000"> case_array.shape[0]
col_count = case_array.shape[1<span style="color: #000000">]
red_ball_array =<span style="color: #000000"> case_array[:,0]
blue_ball_array = case_array[:,1<span style="color: #000000">]
<span style="color: #0000ff">return<span style="color: #000000"> red_ball_array,blue_ball_array,row_count,col_count
<span style="color: #008000">#<span style="color: #008000"> 统计每个号码球的出现次数,这个应该在数据库里做,先放这,以后改
<span style="color: #0000ff">def<span style="color: #000000"> every_ball_count(ball_array):
ball_list =<span style="color: #000000"> []
<span style="color: #0000ff">for ball_char <span style="color: #0000ff">in<span style="color: #000000"> ball_array:
ball_list += ball_char.split(<span style="color: #800000">'<span style="color: #800000">;<span style="color: #800000">'<span style="color: #000000">)
ball_count =<span style="color: #000000"> {}
<span style="color: #0000ff">for ball_num <span style="color: #0000ff">in<span style="color: #000000"> ball_list:
<span style="color: #0000ff">if ball_num <span style="color: #0000ff">in<span style="color: #000000"> ball_count:
ball_count[ball_num] += 1
<span style="color: #0000ff">else<span style="color: #000000">:
ball_count[ball_num] = ball_count.get(ball_num,0) + 1
<span style="color: #0000ff">return<span style="color: #000000"> ball_count
<span style="color: #008000">#<span style="color: #008000"> 数据分析主函数,样本量必须大于等于7,否则不进行处理
<span style="color: #0000ff">def<span style="color: #000000"> analysis_main(start_dt,end_dt):
red_ball_array,col_count =<span style="color: #000000"> get_ball_infomation(start_dt,end_dt)
<span style="color: #0000ff">if row_count < 7<span style="color: #000000">:
<span style="color: #0000ff">print <span style="color: #800000">'<span style="color: #800000">样本量不足以支持分析<span style="color: #800000">'
<span style="color: #0000ff">return 1<span style="color: #000000">
red_count_dict =<span style="color: #000000"> every_ball_count(red_ball_array)
blue_count_dict =<span style="color: #000000"> every_ball_count(blue_ball_array)
<span style="color: #0000ff">for red_case <span style="color: #0000ff">in<span style="color: #000000"> red_count_dict:
red_rate = binomial_class((red_count_dict[red_case] + 1),(row_count + 1),0.5<span style="color: #000000">)
red_count_dict[red_case] =<span style="color: #000000"> red_rate.binomial_fun()
<span style="color: #0000ff">for blue_case <span style="color: #0000ff">in<span style="color: #000000"> blue_count_dict:
blue_rate = binomial_class((blue_count_dict[blue_case] + 1),0.5<span style="color: #000000">)
blue_count_dict[blue_case] =<span style="color: #000000"> blue_rate.binomial_fun()
sorted_red_count = sorted(red_count_dict.iteritems(),key=operator.itemgetter(1),reverse=<span style="color: #000000">True)
sorted_blue_count = sorted(blue_count_dict.iteritems(),reverse=<span style="color: #000000">True)
<span style="color: #0000ff">print<span style="color: #000000"> sorted_blue_count[0]
<span style="color: #0000ff">print <span style="color: #800000">'<span style="color: #800000">选择红球是:<span style="color: #800000">'<span style="color: #000000">
n = 1
<span style="color: #0000ff">for key,value <span style="color: #0000ff">in<span style="color: #000000"> sorted_red_count:
<span style="color: #0000ff">if n == 7<span style="color: #000000">:
<span style="color: #0000ff">break
<span style="color: #0000ff">print <span style="color: #800000">'<span style="color: #800000">%s,%s<span style="color: #800000">' %<span style="color: #000000">(key,str(value))
n += 1
<span style="color: #0000ff">print <span style="color: #800000">'<span style="color: #800000">选择蓝球是<span style="color: #800000">'
<span style="color: #0000ff">print <span style="color: #800000">'<span style="color: #800000">%s,%s<span style="color: #800000">' %(sorted_blue_count[0][0],str(sorted_blue_count[0][1<span style="color: #000000">]))
<span style="color: #0000ff">if <span style="color: #800080">name == <span style="color: #800000">'<span style="color: #800000">main<span style="color: #800000">'<span style="color: #000000">:
n = <span style="color: #800000">''
<span style="color: #0000ff">while n != <span style="color: #800000">'<span style="color: #800000">1<span style="color: #800000">' <span style="color: #0000ff">or n != <span style="color: #800000">'<span style="color: #800000">2<span style="color: #800000">'<span style="color: #000000">:
input_n = raw_input(<span style="color: #800000">'''<span style="color: #800000">
请选择需要进行的功能
1、爬取页面的球号
2、进行球号分析
输入quit退出
请选择: <span style="color: #800000">'''<span style="color: #000000">)
<span style="color: #0000ff">if input_n == <span style="color: #800000">'<span style="color: #800000">1<span style="color: #800000">'<span style="color: #000000">:
url = raw_input(<span style="color: #800000">'''<span style="color: #800000">
请输入需要爬取的地址(此为开始地址,因此建议选择页码较大的地址)
输入: <span style="color: #800000">'''<span style="color: #000000">)
end_page = raw_input(<span style="color: #800000">'''<span style="color: #800000">
输入结束页码
(注意:如果结束页码大于输入地址的页码,则不会爬取任何页面)
输入: <span style="color: #800000">'''<span style="color: #000000">)
ball_main(url,end_page)
<span style="color: #0000ff">elif input_n == <span style="color: #800000">'<span style="color: #800000">2<span style="color: #800000">'<span style="color: #000000">:
analysis_main(<span style="color: #800000">'<span style="color: #800000">2018-08-15<span style="color: #800000">',<span style="color: #800000">'<span style="color: #800000">2018-09-09<span style="color: #800000">'<span style="color: #000000">)
<span style="color: #0000ff">elif input_n == <span style="color: #800000">'<span style="color: #800000">quit<span style="color: #800000">'<span style="color: #000000">:
exit(0)
看过之前初级爬虫的同学应该对这个很熟悉,要爬的是1的地方,观察位置2和位置3,不难看出,每期占用一个页面,那么只要利用翻页,每次页码-1即可,下面看看1位置的源代码