【机器学习07】贝叶斯算法

分享 未结 6 694
你的瓦刀
你的瓦刀 站长 2020年7月22日 15:51 编辑
点击群号免费加入尼特社区交流群:813128395
<p id="descriptionP"><p></p><h2>文章目录</h2><h3>&nbsp; 1什么是贝叶斯公式<br>&nbsp; 2贝叶斯应用的例子<br>&nbsp; 3拼写检查器</h3><p></p><h3>&nbsp; &nbsp; 3.1拼写检查器示例</h3><h3>&nbsp; &nbsp; 3.2拼写检查器实现<br>&nbsp; 4垃圾邮件过滤</h3><p></p><div><br></div><div></div><p></p><h2>1什么是贝叶斯公式</h2><p>&nbsp; &nbsp; 贝叶斯公式:</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;P(A|B)=[P(B|A)*P(A)] / P(B)</p><p>&nbsp; &nbsp; 他能解决什么问题</p><blockquote>&nbsp; &nbsp; <span style="color: rgb(255, 0, 0);">正向概率</span>:假设袋子里面有N个白球,M个黑球,你伸手进去摸一把, 摸出黑球的概率是多大<br>&nbsp; &nbsp; <span style="color: rgb(255, 0, 0);">逆向概率</span>:如果我们事先并不知道袋子里面黑白球的比例,而是闭着眼睛 摸出一个(或好几个)球,观察这些取出来的球的颜色之后,那么我们可 以就此对袋子里面的黑白球的比例作出什么样的推测</blockquote><p><br></p><h2>2贝叶斯应用的例子</h2><p>&nbsp; &nbsp; &nbsp;前提:一个学校男生60%,女生40%。男生总穿长裤,女生长裤裙子各一半。</p><blockquote>&nbsp; &nbsp; &nbsp;正向概率:随便找个学生,问他(她)穿裤子的概率和穿裙子的概率多大<br>&nbsp; &nbsp; &nbsp;逆向概率:走过来一个穿裤子的,问男女的概率</blockquote><p>&nbsp; &nbsp; &nbsp;问题:穿长裤的人里面有多少女生</p><p>&nbsp; &nbsp; &nbsp; &nbsp; 设学校总人数为U个<br>&nbsp; &nbsp; &nbsp; &nbsp; 穿裤子的男生:U∗P(Boy)∗P(Pants∣Boy)<br>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;其中,P(boy)=60%&nbsp; P(Pants∣Boy)=100%<br>&nbsp; &nbsp; &nbsp; &nbsp; 穿裤子的女生:U∗P(Girl∣Pants)=U∗P(Girl)∗P(Pants∣Gril)</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;其中,P(Girl)=40%&nbsp;&nbsp;P(Pants∣Girl)=50%&nbsp;&nbsp;<br>&nbsp; &nbsp; &nbsp; &nbsp; U∗P(Pants)=U∗P(Boy)∗P(Pants∣Boy)+U∗P(Girl)∗P(Pants∣Girl)<br>&nbsp; &nbsp; &nbsp; &nbsp;求得:</p><p>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; P(Girl∣Pants)=[P(Girl)∗P(Pants∣Gril)] / [P(Boy)∗P(Pants∣Boy)+P(Girl)∗P(Pants∣Girl)]</p><p><br></p><h2>3拼写检查器</h2><h3>3.1拼写检查器示例</h3><p>&nbsp; &nbsp; &nbsp; &nbsp;贝叶斯方法计算:P(h)∗P(D∣h),P_{(h)}$P(h) 是特定猜测的先验概率</p><blockquote>&nbsp; &nbsp; &nbsp; &nbsp;比如用户输入tlp,会检测到底是tip还是top 这时候最大似然估计不能做出决定性的判断时,先验概率就会告诉哪个的可能性更高</blockquote><blockquote>&nbsp; &nbsp; &nbsp; &nbsp;模型比较理论<br>&nbsp; &nbsp; &nbsp; &nbsp;最大似然:最符合观测数据的(即P(D|H)最大的)最具有优势<br>&nbsp; &nbsp; &nbsp; &nbsp;奥卡姆剃刀:P(h)较大的模型有较大优势<br>&nbsp; &nbsp; &nbsp; &nbsp;如果平面上有 N 个点,近似构成一条直线,但绝不精确地位于一条直线 上。这时我们既可以用直线来拟合(模型1),也可以用二阶多项式(模 型2)拟合,也可以用三阶多项式(模型3),特别地,用 N-1 阶多项式 便能够保证肯定能完美通过 N 个数据点。那么,这些可能的模型之中到 底哪个是最靠谱的呢?<br>&nbsp; &nbsp; &nbsp; &nbsp;选一个中间的答案,项数不能太多否则会过拟合<br>&nbsp; &nbsp; &nbsp; &nbsp;奥卡姆剃刀:越是高阶的多项式越是不常见</blockquote><p></p><h3>3.2拼写检查器实现</h3><p>&nbsp; &nbsp; &nbsp;代码思想很好理解<br></p><p></p><blockquote>&nbsp; &nbsp; &nbsp;a.首先读取并处理文件文件(全部小写),并将其转化成字典的格式(NWORDS)。单词为键 出现的次数为值<br>&nbsp; &nbsp; &nbsp;b.通过input输入需要判断的单词,并调用correct<br>&nbsp; &nbsp; &nbsp;c.在crrect中,判断四种情:<br>&nbsp; &nbsp; &nbsp; &nbsp;1.(known)是否是NWORDS中的单词,如不满足到2<br>&nbsp; &nbsp; &nbsp; &nbsp;2.(known(edits1(word)))经过补充,缺失,转换 后的单词是否属于NWORDS,如不满足到3<br>&nbsp; &nbsp; &nbsp; &nbsp;3.(known_edits2(word)再重复2的操作,如果还不满足那么到4<br>&nbsp; &nbsp; &nbsp; &nbsp;4.([word])将这个单词返回<br>&nbsp; &nbsp; d.将返回的set对象中的值从字典中取出,比较出现次数的大小。然后返回次数最多的单词</blockquote>&nbsp; &nbsp; &nbsp;<a href="https://pan.baidu.com/s/1tJUaxBJpRW8M2GcizER7Kw" target="_blank">点击下载数据集</a><p></p><p>&nbsp; &nbsp; 代码如下:</p><pre><code>import re, collections<br><br><br>def words(text):<br> return re.findall('[a-z]+', text.lower())<br><br><br>def train(features):<br> model = collections.defaultdict(lambda: 1)<br> for f in features:<br> model[f] += 1<br> return model<br><br><br>def edits1(word):<br> n = len(word)<br> return set([word[0:i]+word[i+1:] for i in range(n)] +<br> [word[0:i] + word[i+1] + word[i] + word[i+2:] for i in range(n-1)] +<br> [word[0:i]+c + word[i + 1:] for i in range(n) for c in alphabet] +<br> [word[0:i]+c + word[i:] for i in range(n + 1) for c in alphabet])<br><br><br>def known_edits2(word):<br> return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)<br><br><br>def known(words):<br> x = set(w for w in words if w in NWORDS)<br> for w in words:<br> print(w)<br> return x<br><br>def correct(word):<br> candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]<br> # print(known([word]),'1', known(edits1(word)), known_edits2(word))<br> return max(candidates, key=lambda w: NWORDS[w])<br><br><br>if __name__ == "__main__":<br> NWORDS = train(words(open('big.txt').read()))<br> alphabet = 'abcdefghijklmnopqrstuvwxy'<br> while True:<br> str1 = input('请输入要书写的单词')<br> print(correct(str1))</code></pre><p></p><p></p><h2><br></h2><h2>4.垃圾邮件过滤</h2><p>&nbsp; &nbsp; &nbsp; &nbsp; <span style="font-weight: bold;">问题</span>:给定一封邮件,判定它是否属于垃圾邮件。D 来表示这封邮件,注意 D 由 N 个单词组成。用 h+ 表示垃圾邮件,h- 表示正常邮件。</p><blockquote>&nbsp; &nbsp; &nbsp; &nbsp; P (h+∣D)=P&nbsp;(h+)∗P&nbsp;(D∣h+) / P (D)<br>&nbsp; &nbsp; &nbsp; &nbsp; P(h−∣D)=P&nbsp;(h−)∗P&nbsp;(D∣h−) / P (D)</blockquote><p>&nbsp; &nbsp; &nbsp; &nbsp;<span style="color: rgb(255, 0, 0);">先验概率</span>:<span style="font-weight: bold;">P(h+) 和 P(h-) </span>这两个先验概率都是很容易求出来的,<span style="font-weight: bold;">只需要计算一个邮件库里面垃圾邮件和正常邮件的比例就行了。</span></p><p>&nbsp; &nbsp; &nbsp; D 里面含有 N 个单词 d1, d2, d3...dn ,</p><blockquote>&nbsp; &nbsp; &nbsp; P(D∣h+)=P(d1,d2,..,dn∣h+)</blockquote><p>&nbsp; &nbsp; &nbsp; <span style="font-weight: bold;">P(d1,d2,..,dn∣h+)</span>就是说在垃圾邮件当中出现跟我们目前这封邮件一模一样的一封邮件的概率是多大!它可以扩展为:</p><blockquote>&nbsp; &nbsp; &nbsp; P(d1,d2,..,dn∣h+) = P(d1∣h+)∗P(d2∣d1,h+)∗P(d3∣d2,d1,h+)∗..</blockquote><p>&nbsp; &nbsp; &nbsp;假设 di 与 di-1 是完全条件无关的(<span style="color: rgb(255, 0, 0);">朴素贝叶斯假设特征之间是独立,互不影响</span>)<br></p><blockquote>&nbsp; &nbsp; &nbsp;P(d1,d2,..,dn∣h+) = P(d1∣h+)∗P(d2∣h+)∗P(d3∣h+)∗..</blockquote><p>&nbsp; &nbsp; &nbsp;对于P(d1|h+) * P(d2|h+) * P(d3|h+) * …只要统计 di 这个单词在垃圾邮件中出现的频率即可<br></p><p>(<a href="https://blog.csdn.net/weixin_38280090/article/details/94850089" target="_blank">参考文章</a>)</p></p>
收藏(0)  分享
相关标签: 贝叶斯 贝叶斯算法 朴素贝叶斯 拼写检查器 垃圾邮件过滤 机器学习 人工智能 笔记 教程
注意:本文归作者所有,未经作者允许,不得转载
6个回复