您的当前位置:首页正文

参考文献(人工智能)

2022-08-18 来源:好走旅游网
 参考文献

参考文献(人工智能)

曹晖

目的:对参考文献整理(包括摘要、读书笔记等),方便以后的使用。 分类:粗分为论文(paper)、教程(tutorial)和文摘(digest)。

0介绍 ........................................................................................................................................................ 1 1系统与综述 ............................................................................................................................................ 1 2神经网络 ................................................................................................................................................ 2 3机器学习 ................................................................................................................................................ 2 3.1 联合训练的有效性和可用性分析................................................................................................ 2 3.2 文本学习工作的引导 ................................................................................................................... 2 3.3 ★采用机器学习技术来构造受限领域搜索引擎 ........................................................................ 3 3.4 联合训练来合并标识数据与未标识数据 .................................................................................... 5 3.5 在超文本学习中应用统计和关系方法 ........................................................................................ 5 3.6 在关系领域发现测试集合规律性................................................................................................ 6 3.7 网页挖掘的一阶学习 ................................................................................................................... 6 3.8 从多语种文本数据库中学习单语种语言模型 ............................................................................ 6 3.9 从因特网中学习以构造知识库 ................................................................................................... 7 3.10 未标识数据在有指导学习中的角色.......................................................................................... 8 3.11 使用增强学习来有效爬行网页 .................................................................................................. 8 3.12 ★文本学习和相关智能AGENTS:综述 .................................................................................... 9 3.13 ★新事件检测和跟踪的学习方法............................................................................................ 15 3.14 ★信息检索中的机器学习——神经网络,符号学习和遗传算法 ........................................ 15 3.15 用NLP来对用户特征进行机器学习 ........................................................................................ 15 4模式识别 .............................................................................................................................................. 16 4.1 JAVA中的模式处理 ..................................................................................................................... 16

0介绍

信息检索主要分为如下几类: 神经网络 神经网络 机器学习 机器学习、文本学习 模式识别 模式识别

1系统与综述

本节主要介绍几个人工智能工具以及对人工智能的综述。 神经网络 神经网络 机器学习 机器学习,文本学习 模式识别 模式识别

1

参考文献

2神经网络

3机器学习

3.1 联合训练的有效性和可用性分析

标题:Analyzing the Effectiveness and Applicability of Co-training

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Analyzing the Effectiveness and Applicability of Co-training.ps 作者:Kamal Nigam, Rayid Ghani

备注:Kamal Nigam (School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, knigam@cs.cmu.edu)

Rayid Ghani (School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213 rayid@cs.cmu.edu)

摘要:Recently there has been significant interest in supervised learning algorithms that combine labeled and unlabeled data for text learning tasks. The co-training setting [1] applies to

datasets that have a natural separation of their features into two disjoint sets. We demonstrate that when learning from labeled and unlabeled data, algorithms explicitly leveraging a natural independent split of the features outperform algorithms that do not. When a natural split does not exist, co-training algorithms that manufacture a feature split may out-perform algorithms not using a split. These results help explain why co-training algorithms are both discriminative in nature and robust to the assumptions of their embedded classifiers.

3.2 文本学习工作的引导

标题:Bootstrapping for Text Learning Tasks

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Bootstrap for Text Learning

Tasks.ps

作者:Rosie Jones, Andrew McCallum, Kamal Nigam, Ellen Riloff

备注:Rosie Jones (rosie@cs.cmu.edu, 1 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213)

Andrew McCallum (mccallum@justresearch.com, 2 Just Research, 4616 Henry Street, Pittsburgh, PA 15213)

Kamal Nigam (knigam@cs.cmu.edu)

Ellen Riloff (riloff@cs.utah.edu, Department of Computer Science, University of Utah, Salt Lake City, UT 84112)

摘要:When applying text learning algorithms to complex tasks, it is tedious and expensive to hand-label the large amounts of training data necessary for good performance. This paper presents bootstrapping as an alternative approach to learning from large sets of labeled data. Instead of a large quantity of labeled data, this paper advocates using a small amount of seed information and a

2

参考文献

large collection of easily-obtained unlabeled data. Bootstrapping initializes a learner with the seed information; it then iterates, applying the learner to calculate labels for the unlabeled data, and incorporating some of these labels into the training input for the learner. Two case studies of this approach are presented. Bootstrapping for information extraction provides 76% precision for a 250-word dictionary for extracting locations from web pages, when starting with just a few seed locations. Bootstrapping a text classifier from a few keywords per class and a class hierarchy provides accuracy of 66%, a level close to human agreement, when placing computer science research papers into a topic hierarchy. The success of these two examples argues for the strength of the general bootstrapping approach for text learning tasks.

3.3 ★采用机器学习技术来构造受限领域搜索引擎

标题:Building Domain-specific Search Engines with Machine Learning Techniques

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Building Domain-Specific

Search Engines with Machine Learning Techniques.ps

作者:Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore

备注:Andrew McCallum (mccallum@justresearch.com, Just Research, 4616 Henry Street Pittsburgh, PA 15213)

Kamal Nigam (knigam@cs.cmu.edu, School of Computer Science, Carnegie Mellon University Pittsburgh, PA 15213)

Jason Rennie (jr6b@andrew.cmu.edu) Kristie Seymore (kseymore@ri.cmu.edu)

摘要:Domain-specific search engines are growing in popularity because they offer increased accuracy and extra functionality not possible with the general, Web-wide search engines. For example, www.campsearch.com allows complex queries by age-group, size, location and cost over summer camps. Unfortunately these domain-specific search engines are difficult and time-consuming to maintain. This paper proposes the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific search engines. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, identifying informative text segments, and populating topic hierarchies. Using these techniques, we have built a demonstration system: a search engine for

computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. ...

采用多项Naive Bayes文本分类模型。

分类Cj用文档频率P(Cj)和词频(对于词典V里的每个词wt,P(wt|Cj)为其在Cj中出现的频率)表示。

文档di用其中的词目无序集合表示。给定一个文档和分类,用Bayes规则判别文档是否属于该分类,而且naive Bayes假设文档中的各个词目出现概率是正交的(即无关的)。

wdt,k表示文档d中的第k个词目,则分类公式为:

i 3

参考文献

P(cj|di)P(cj)P(di|cj)P(cj)P(wdi,k|cj)

k1di通过对训练集D的学习可以获取P(cj)和P(wt|cj)。

P(wt|cj)的计算:用wt在分类cj中出现的次数除以它在文集D中的中出现次数。为

了避免出现0概率,使用Laplace smoothing(每个词目至少出现一次?)。

N(wt,di):wt在文档di中的出现次数。

P(cj|di){0,1}:若文档di在分类cj中出现则为1,否则为0。

则有下述公式:

P(wt|cj)1dDN(wt,di)P(cj|di)iVs1dDN(ws,di)P(cj|di)

iV其中V是词典里不同词目的总数。 公式的说明:

分子 = 1 + 词目wt在分类cj的总出现次数;

分母 = 词典里的不同词目个数 + 分类cj中的词目出现总数。

P(cj)1dDP(cj|di)iCD

其中C是分类总数。 公式的说明:

分子 = 1 +分类cj中的文档总数;

分母 = 分类总数 + 文档总数。

根据实验表明:当训练集很大时,分类性能很好(Lewis 1998)。更完备的模型参见Mitchell (1997)和McCallum & Nigam (1998)。

参考文献:

Lewis, D.D. 1998. Information extraction using hidden Markov models. Master’s thesis, UC San Diego.

McCallum, A., and Nigam, K. 1998. A comparison of event models for naive Bayes text

4

参考文献

classification. In AAAI-98 Workshop http://www.cs.cmu.edu/~mccallum.

on Learning for Text Categorization.

3.4 联合训练来合并标识数据与未标识数据

标题:Combining Labeled and Unlabeled Data with Co-training

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Combining Labeled and Unlabeled Data with Co-Training.ps

作者:Andrew McCallum, Kamal Nigam, Jason Rennie, Kristie Seymore

备注:Avrim Blum (School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891, avrim+@cs.cmu.edu)

Tom Mitchell (School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891, mitchell+@cs.cmu.edu)

摘要:We consider the problem of using a large unlabeled sample to boost performance of a learning algorithm when only a small set of labeled examples is available. In particular, we consider a problem setting motivated by the task of learning to classify web pages, in which the description of each example can be partitioned into two distinct views. For example, the description of a web page can be partitioned into the words occurring on that page, and the words occurring in hyperlinks that point to that page. We assume that either view of the example would be sufficient for learning if we had enough labeled data, but our goal is to use both views together to allow inexpensive unlabeled data to augment a much smaller set of labeled examples. Specifically, the presence of two distinct views of each example suggests strategies in which two learning algorithms are trained separately on each view, and then each algorithm's predictions on new unlabeled examples are used to enlarge the training set of the other. Our goal in this paper is to provide a PAC-style analysis for this setting, and, more broadly, a PAC-style framework for the general problem of learning from both labeled and unlabeled data. We also provide empirical results on real web-page data indicating that this use of unlabeled examples can lead to significant improvement of hypotheses in practice.

3.5 在超文本学习中应用统计和关系方法

标题:Combining Statistical and Relational Methods for Learning in Hypertext Domains

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Combining Statistical and Relational Methods for Learning in Hypertext Domains.ps 作者:Se'an Slattery and Mark Craven

备注:Se'an Slattery and Mark Craven (School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213?891, USA, email: hfirstnamei.hlastnamei@cs.cmu.edu)

摘要:We present a new approach to learning hypertext classifiers that combines a statistical text-learning method with a relational rule learner. This approach is well suited to learning in hypertext domains because its statistical component allows it to characterize text in terms of word frequencies, whereas its relational component is able to describe how neighboring documents are related to each other by hyperlinks that connect them. We evaluate our approach by applying it to tasks that involve learning definitions for (i) classes of pages, (ii) particular relations that exist between pairs of pages, and (iii) locating a particular class of information in the internal structure

5

参考文献

of pages. Our experiments demonstrate that this new approach is able to learn more accurate classifiers than either of its constituent methods alone.

3.6 在关系领域发现测试集合规律性

标题:Discovering Test Set Regularities in Relational Domains

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Discovering Test Set Regularities in Relational Domains.ps 作者:Sean Slattery and Tom Mitchell

备注:Sean Slattery SEAN.SLATTERY@CS.CMU.EDU Tom Mitchell TOM.MITCHELL@CS.CMU.EDU

Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA

摘要:Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the training set alone. For example, we have previously shown how FOIL, a relational learner, can learn to classify Web pages by discovering training set regularities in the words occurring on target pages, and on other pages related by hyperlinks. Here we show how the classification accuracy of FOIL on this task can be improved by discovering additional regularities on the test set pages that must be classified. Our approach can be seen as an extension to Kleinberg's Hubs and Authorities algorithm that analyzes hyperlink relations among Web pages. We present evidence that this new algorithm leads to better test set precision and recall on three binary Web classification tasks where the test set Web pages are taken from different Web sites than the training set.

3.7 网页挖掘的一阶学习

标题:First-order Learning for Web Mining

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\First-Order Learning for Web Mining.ps

作者:Mark Craven, Se'an Slattery and Kamal Nigam 备注:Mark Craven, Se'an Slattery and Kamal Nigam School of Computer Science, Carnegie Mellon University Pittsburgh, PA 15213?891, USA

email: hfirstnamei.hlastnamei@cs.cmu.edu

To appear in the proceedings of the 10th European Conference on Machine Learning. 摘要:We present compelling evidence that the World Wide Web is a domain in which applications can benefit from using first-order learning methods, since the graph structure inherent in hypertext naturally lends itself to a relational representation. We demonstrate strong advantages for two applications -- learning classifiers for Web pages, and learning rules to discover relations among pages.

3.8 从多语种文本数据库中学习单语种语言模型

标题:Learning a Monolingual Language Model from a Multilingual Text Database

6

参考文献

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Learning a Monolingual Language Model from a Multilingual Text Database.ps 作者:Rayid Ghani, Rosie Jones 备注:Rayid Ghani

Center for Automated Learning and Discovery School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA Rayid.Ghani@cs.cmu.edu Rosie Jones

Language Technologies Institute School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 USA Rosie.Jones@cs.cmu.edu

摘要:Language models are of importance in speech recognition, document classification, and database selection algorithms. Traditionally language models are learned from corpora specifically acquired for the purpose. Increasingly, however, there is interest in constructing language models for specific languages from heterogeneous sources such as the web. Query-based sampling has been shown to be effective for gauging the content of monolingual heterogeneous databases. We propose evaluating an extension to this approach by considering the case of learning a monolingual language model from a multilingual database, and extensions to the query-based sampling algorithm to handle this case. We test our approach on a corpus collected from the WWW and show that our proposed methods perform accurately and efficiently for learning a language model of Tagalog, when these documents are only 2.5% of the documents in a collection.

3.9 从因特网中学习以构造知识库

标题:Learning to Construct Knowledge Bases from the World Wide Web

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Learning to Construct Knowledge Bases from the World Wide Web.ps

作者:Mark Craven a Dan DiPasquo a Dayne Freitag

备注:Mark Craven a Dan DiPasquo a Dayne Freitag (Just Research, 4616 Henry Street, Pittsburgh, PA, 15213, USA) Andrew McCallum

Tom Mitchell a Kamal Nigam

Se'an Slattery (School of Computer Science, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, 15213-3891, USA )

To appear in Artificial Intelligence, Elsevier, 1999.

摘要:The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our

7

参考文献

approach is to develop a trainable information extraction system that takes two inputs. The first is an ontology that defines the classes (e.g., company, person, employee, product) and relations (e.g., employed by, produced by) of interest when creating the knowledge base. The second is a set of training data consisting of labeled regions of hypertext that represent instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This article describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects.

3.10 未标识数据在有指导学习中的角色

标题:The Role of Unlabeled Data in Supervised Learning

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\The Role of Unlabeled Data in Supervised Learning.ps

作者:Mark Craven a Dan DiPasquo a Dayne Freitag 备注:Tom M. Mitchell School of Computer Science Carnegie Mellon University Tom.Mitchell@cmu.edu ICCS-99 May 1999

In Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain, 1999 (invited paper).

摘要:Most computational models of supervised learning rely only on labeled training examples, and ignore the possible role of unlabeled data. This is true both for cognitive science models of learning such as SOAR [Newell 1990] and ACT-R [Anderson, et al. 1995], and for machine learning and data mining algorithms such as decision tree learning and inductive logic programming (see, e.g., [Mitchell 1997]). In this paper we consider the potential role of unlabeled data in supervised learning. We present an algorithm and experimental results demonstrating that unlabeled data can significantly improve learning accuracy in certain practical problems. We then identify the abstract problem structure that enables the algorithm to successfully utilize this unlabeled data, and prove that unlabeled data will boost learning accuracy for problems in this class. The problem class we identify includes problems where the features describing the examples are redundantly sufficient for classifying the example; a notion we make precise in the paper. This problem class includes many natural learning problems faced by humans, such as learning a semantic lexicon over noun phrases in natural language, and learning to recognize objects from multiple sensor inputs. We argue that models of human and animal learning should consider more strongly the potential role of unlabeled data, and that many natural learning problems fit the class we identify.

3.11 使用增强学习来有效爬行网页

标题:Using Reinforcement Learning to Spider the Web Efficiently 链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Using Reinforcement Learning to Spider the Web Efficiently.ps

8

参考文献

作者:Jason Rennie, Andrew Kachites McCallum 备注:Jason Rennie (jrennie@justresearch.com)

Andrew Kachites McCallum (mccallum@justresearch.com) School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Just Research 4616 Henry Street Pittsburgh, PA 15213

摘要:Consider the task of exploring the Web in order to find pages of a particular kind or on a particular topic. This task arises in the construction of search engines and Web knowledge bases. This paper argues that the creation of efficient web spiders is best framed and solved by reinforcement learning, a branch of machine learning that concerns itself with optimal sequential decision making. One strength of reinforcement learning is that it provides a formalism for measuring the utility of actions that give benefit only in the future. We present an algorithm for learning a value function that maps hyperlinks to future discounted reward using a naive Bayes text classifier. Experiments on two real-world spidering tasks show a three- fold improvement in spidering efficiency over traditional breadth-first search, and up to a two-fold improvement over reinforcement learning with immediate reward only.

3.12 ★文本学习和相关智能Agents:综述

标题:Text-Learning and Related Intelligent Agents: A Survey 链接:ACM\\IEEE Intelligent Systems\\1999\\x4.zip 作者:Dunja Mladenic

备注:J.Stefan Institute, IEEE INTELLIGENT SYSTEMS, JULY/AUGUST 1999

摘要:在开发文本学习智能AGENT的研究中,作者关注三个关键标准:文档的表达,如何选取特征,以及使用哪种学习算法。然后她描述了Personal WebWatcher,基于内容的智能AGENT,采用文本学习技术来个性化网站浏览。

网上信息的特点:海量的、分布的、非同构的、多媒体混合的形式。为了帮助用户更好地浏览网站,最近采用的技术混合了机器学习和信息检索,在文本数据库上应用机器学习技术,称为文本学习。本文综述了有指导学习的文本学习方法——用于文本分类,然后详细探讨Personal WebWatcher。

智能代理的机器学习:尽管对智能代理有多种定义,我更愿意把它看作用户助手和推荐系统,采用了机器学习和数据挖掘技术。它可以帮助用户查找信息或按用户的意愿做某些简单工作。例如,当用户找到某些所需文档后,帮助用户查找类似的文档。

开发基于机器学习的智能代理常采用两种技术:基于内容的和协作式的方法。在表一中给出了某些系统原型。

基于内容的方法:用于文本分类,系统会查找与用户指定内容类似的项目,起源于信息检索。但大多数表达结构只能抓取内容的部分侧面,从而导致系统性能较差。

AGENT Antagonomy WHERE DEVELOPED GOAL 个性化新闻 监视用户行为->生成profile->新闻排序 PUBLICATION T. Kamba, H. Sakagami, and Y. Koseki, ―Anatagonomy: A Personalized News-paper on the World Wide Web,‖ Int’l J. Human-Computer Studies, Vol. 46, No. 6, June 1997, pp. 789–803. NEC 9

参考文献

T. Mitchell et al., ―Experience with a Learning Personal Assistant,‖ Comm. ACM, Vol. 37, No. 7, July 1994, pp. 81–91. K. Bollacker, S. Lawrence, and L. Giles, ―CiteSeer: An Autonomous System for Processing and Organizing Scientific Literature on the Web,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu. edu/~conald/conald.shtml B. Krulwich and C. Burkey, ―The ContactFinder Agent: Answering Bulletin Board Questions with Referrals,‖ Proc. 13th Nat’l Conf. AI (AAAI 96), AAAI Press, Menlo Park, Calif., 1996, pp. 10–15. R. Burke, K. Hammond, and J. Kozlovsky, ―Knowledge-Based Information Retrieval for Semi-Structured Text,‖ Working Notes from AAAI Fall Symp. AI Applications in Knowledge Navigation and Retrieval, AAAI Press, Menlo Park, Calif. 1995, pp. 19–24. R. Burke et al., ―Question Answering from Frequently Asked Question Files,‖ AI Magazine, Vol. 18, No. 2, Summer 1997, pp. 57–66. B.A. LaMacchia, ―Internet Fish, A Revised Version of a Thesis Proposal,‖ MIT, AI Lab and Dept. of Electrical Eng. and Computer Science, Cambridge, Mass., 1996. H. Lieberman, ―Letizia: An Agent that Assists Web Browsing,‖ Proc. 14th Int’l Joint Conf. AI (IJCAI95), AAAI Press, Menlo Park, Calif., 1995, pp. 924–929. M. Balabanovic and Y. Shoham, ―Learning Information Retrieval Agents: Experiments with Automated Web Browsing,‖ AAAI 1995 Spring Symp. Information Gathering from Heterogeneous, Distributed Environments, AAAI Press, Menlo Park, Calif., 1995. C.V. Goldman, A. Langer, and J.S. Rosenschein, ―Musag: An Agent That Learns What You Mean,‖ Applied AI, Vol. 11, No. 5, 1997, pp. 413–435. K. Lang, ―News Weeder: Learning to Filter Netnews,‖ Proc. 12th Int’l Conf. Machine Learning,Morgan Kaufmann, San Francisco, 1995, pp. 331–339. D. Mladenic, Personal WebWatcher: Implementation and Design, Tech. Report IJS-DP-7472, Dept. of Computer Science, J. Stefan Inst., 1996; http://cs.cmu.edu/~TextLearning/pww. M. Pazzani, J. Muramatsu, and D. Billsus, ―Syskill & Webert: Identifying Interesting Web Sites,‖ Proc. 13th Nat’l Conf. AI AAAI 96, AAAI Press, Menlo Park, Calif., 1996, pp. 54–61. M. Pazzani and D. Billsus, ―Learning and Revising User Profiles: The Identification of Interesting Web Sites,‖ Machine Learning 27, Kluwer Academic Publishers, Dorrdrecht, The Netherlands, 1997, pp. 313–331. J. Shavlik and T. Eliassi-Rad, ―Building Intelligent Agents for Web-based Tasks: Calendar Apprentice CiteSeer CMU ContactFinder 会议安排 TX, NEC, 在WWW上查找论文 UMIACS 用户给出关键词->提交给搜索引擎->从PostScripts文件中提取标题、摘要、引用;根据引用来判别相似文档 Andersen 寻找专家 Consulting 读取和响应公告牌信息,返回结果为人名 Chicago Univ. 回答问题 用户用自然语言提出问题、分布式数据源(FAQ文件) FAQFinder Internet Fish MIT Letizia MIT Lira Stanford Musag Hebrew Univ. CMU NewsWeeder Personal WebWatcher Syskill Webert CMU, IJS 在因特网上查找信息 资源发现工具、信息抽取、自然语言界面、使用现有搜索引擎及用户评价 WWW浏览 辅助浏览的UI代理,无须提供关键词和评价 WWW浏览 学习用户浏览因特网的行为习惯。查找某时间段的网页、排序、接受用户的评价、重新调整查找和选取策略 WWW浏览 关键词扩展、概念语义相关 Usenet新闻过滤 使用文本分类来生成用户兴趣模型,接受用户反馈 WWW浏览 高亮度相关链接、根据内容分析生成用户profile WWW浏览 用户评价->生成profile(按主题划分)->生成查询提交给搜索引擎 & UCI WAWA Wisconsin WWW浏览 10

参考文献

A Theory-Refinement Approach,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. R. Armstrong et al., ―WebWatcher: A Learning Apprentice for the World Wide Web,‖ AAAI 1995 Spring Symp. Information Gathering from Heterogeneous, Distributed Environments, AAAI Press, Menlo Park, Calif., 1995. 用户给出兴趣偏好->存储到神经网络->用theory revision来精炼所得知识 WebWatcher CMU WWW浏览 表1. 基于内容的方法(采用机器学习技术)

协作式方法:假设有一组用户在使用系统(也称为social learning)。查找相同兴趣的用户,计算用户的相似度而不是文档间的相似度。它不对内容进行分析。用户相似度的计算是根据他们对相同文档评价是否相似。

缺点:当用户数很少时,会有数据稀疏的问题。对于那些有特殊兴趣的用户来说,系统的性能很差。 Music Chopin Bach Matheny Balasevic Prodigy Presley ABBA Enya User1 6 7 5 7 1 2 3 7 User2 7 6 6 7 - 1 5 6 User3 2 1 1 - 7 6 6 3 图3 用户对音乐的评价(7为最高分)

协作式方法发现User1和User2兴趣类似,因而会将User2感兴趣的文档推荐给User1。 协作式方法常用于处理非文本数据,但有时也可以处理文本数据(如新闻过滤)。

AGENT Firefly, Ringo GroupLense Phoaks Referral Web WHERE GOAL DEVELOPED MIT Minnesota AT&T Labs AT&T Labs PUBLICATION Siteseer Imana Finding music, movie, book P. Maes, ―Agents that Reduce Work and Information Overload,‖ Comm. ACM,Vol.37, No. 7, July 1994, pp. 30–40. Usenet news filtering J.A. Konstan et al., ―GroupLense: Applying Filtering to Usenet News,‖ Comm.ACM, Vol. 40, No. 3, Mar. 1997, pp. 77–87. Browsing WWW T. Terveen et al., ―PHOAKS: A System for Sharing Recommendations,‖ Comm.ACM, Vol. 40, No. 3, Mar. 1997, pp. 59–62. Finding experts H. Kautz, B. Selman, and M. Shah, ―Referral Web: Combining Social Networks and Collaborative Filtering,‖ Comm. ACM, Vol. 40, No. 3, Mar. 1997, pp. 63–65. H. Kautz, B. Selman, and M. Shah, ―The Hidden Web,‖ AI Magazine, Vol. 18, No.2, Summer 1997, pp. 27–36. Browsing WWW J. Rucker and J.P. Marcos, ―Siteseer: Personalized Navigation for the Web,‖ Comm. ACM, Vol. 40, No. 3, Mar. 1997, pp. 73–75. 表2. 某些使用机器学习技术的协作式方法 AGENT Fab WHERE GOAL DEVELOPED Stanford Browsing WWW PUBLICATION M. Balabanovic and Y. Shoham, ―Fab: Content-Based, Collaborative Recommendation,‖ Comm. ACM, Vol. 40, No. 3, Mar. 1997, pp. 66–70. B. Krulwich, ―Lifestyle Finder,‖ AI Magazine, Vol. 18, No. 2, Summer 1997, pp.37–46. O. de Vel and S.A. Nesbitt, ―Collaborative Filtering Agent System for Dynamic Virtual Communities on the Web,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998: http://www.cs.cmu.edu/~conald/conald.shtml. Lifestyle Finder AgentSoft WebCobra James Univ. Cook Browsing WWW Browsing WWW 表3. 混合基于内容和协作式方法的系统

文本数据的机器学习:如何在文本数据上应用机器学习技术,涉及三个关键标准:文档

11

参考文献

的表达,如何选取特征,以及使用哪种学习算法。

文档的表达 在信息检索和文本学习中,常用向量作为文档的表达。它是bag-of-words representation,不考虑单词的顺序和文档的结构(参见图A)。另外也可以使用文本文档的附加信息——如句子结构、单词位置或相临的单词等。问题在于这些附加信息会对文本学习带来多少好处,以及需要多少开销?

某些信息检索研究表明,对于长文档来说,考虑附加信息是得不偿失的。对于短文档来说,使用单词序列(n-grams)代替单个单词则可以提高分类器的性能。

如图A中所示,现有的系统一般采用带有布尔特征的bag-of-words表达法,指示出某个词是否在文档中出现或出现的词频。某些系统还利用了一些附加信息如单词位置或单词序列(n-grams,”machine learning”是2 grams,“World Wide Web”是3 grams)。最近的研究还表明利用超文本的结构和网页的组织结构图可以提高分类性能。目前还没有人研究不同的文档表达法在某些领域上有什么特别优越之处。

特征的数目 首先要去掉停用词如a,the,with等,去掉超低频词;取词根(用work来代替works、working和worked)。

很多系统中采用语言无关的方法,对单词打分并选取特征词,或者采用潜在语义索引(latent semantic indexing,LSI)来降低维数。

在大多数文本分类的特征选取实验中,有时可以通过精选少量特征词(<10%)得到很好的结果,有时也可以利用所有特征词来得到很好的结果。单词的打分一般要根据领域和分类算法而定。令人惊奇的是,简单地使用词频和停用词表常常可以得到很好的效果。

算法 在文本分类中常使用TFIDF向量空间法。每个文档用TFIDF向量表示->向量求和来构造分类模型。

d(i)TF(wi,d)IDF(wi)

TF(Term Frequency)是单词wi在文档中出现的次数,

IDF(Inverse Document Frequency)=log[D/DF(wi)],其中

D为总文档个数,DF(wi)为包含单词wi的文档个数。

12

参考文献

TFIDF向量的距离通常采用余弦测度。

一般来说,它的性能比机器学习方法要差。 probabilistic TFIDF是对TFIDF的扩展,与naive Bayesian分类器性能相当。Naive Bayesian分类器和k-nearest neighbor是两个常用的且性能最好的分类器。另外还有人使用了决策树和决策规则。 AUTHORS C. Apte, F. Damerau, and S.M. Weiss, ―Toward Language Independent Automated Learning of Text Categorization Models,‖ Proc. Seventh Ann. Int’l ACM-SIGIR Conf. Research and Development in Information Retrieval, ACM Press, NewYork, 1994, pp. 23–30. C. Apte, F. Damerau, and S.M. Weiss, ―Text Mining with Decision Rules and Decision Trees,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ.,Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. R. Armstrong et al., ―WebWatcher: A Learning Apprentice for the World Wide Web,‖ AAAI 1995 Spring Symp. Information Gathering from Heterogeneous, Distributed Environments, AAAI Press, Menlo Park, Calif., 1995. M. Balabanovic and Y. Shoham, ―Learning Information Retrieval Agents: Experiments with Automated Web Browsing,‖ AAAI 1995 Spring Symp. Information Gathering from Heterogeneous, Distributed Environments, AAAI Press, Menlo Park, Calif., 1995 B.T. Bartell, G.W. Cottrell, and R.K. Belew, ―Latent Semantic Indexing is an Optimal Special Case of Multidimensional Scaling,‖ Proc. ACM SIG Information Retriev-al, ACM Press, New York, 1992, pp. 161–167. M.W. Berry, S.T. Dumais, and G.W. O’Brein, ―Using Linear Algebra for Intelligent Information Retrieval,‖ SIAM Review, Vol. 37, No. 4, Dec. 1995, pp. 573–595. P.W. Foltz and S.T. Dumais, ―Personalized Information Delivery: An Analysis of Information-Filtering Methods,‖ Comm. ACM, Vol. 35, No. 12, 1992, pp. 51–60. R.M. Creecy et al., ―Trading MIPS and Memory for Knowledge Eng.,‖ Comm. ACM, Vol. 35, No. 8, Aug. 1992, pp. 48–64. W.W. Cohen, ―Learning to Classify English Text with ILP Methods,‖ Workshop on Inductive Logic Programming, CS Dept., K.U. Leuven, 1995, pp. 3–24. W.W. Cohen and Y. Singer, ―Context-Sensitive Learning Methods for Text Categorization,‖ Proc. 19th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’96), ACM Press, New York, 1996, pp. 307–315. B. Gelfand, M. Wulfekuhler, and W.F. Punch III, ―Automated Concept Extraction from Plain Text,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. T.A. Joachims, ―Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization,‖ Proc. 14th Int’l Conf. Machine Learning (ICML97), Morgan Kaufmann, San Francisco, 1997, pp. 143–151. T.A. Joachims, ―Text Categorization with Support Vector Machines: Learning with Many Relevant Features,‖ Proc. 10th European Conf. Machine Learning (ECML 98), Springer-Verlag, Berlin, 1998, pp. 137–142. W. Lam, K.F. Low, and C.Y. Ho, ―Using Bayesian Network Induction Approach for Text Categorization,‖ 15th Int’l Joint Conf. Artificial Intelligence (IJCAI97), AAAI Press, Menlo Park, Calif., 1997, pp. 745–750. W. Lam and C.Y. Ho, ―Using A Generalized Instance Set for Automatic Text Categorization,‖ Proc. 21th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’98), ACM Press, New York, 1998, pp. 81–89. D.D. Lewis and M. Ringuette, ―Comparison of Two Learning Algorithms for Text Categorization,‖ Proc. Third Ann. Symp. Document Analysis and Information Retrieval, Information Sciences Research Inst., Las Vegas, 1994, pp. 81–93. D.D. Lewis and W.A. Gale, ―A Sequential Algorithm for Training Text Classifiers,‖ Proc. Seventh Ann. Int’l ACM-SIGIR Conf. Research and DOCUMENT FEATURE CLASSIFICAREPRESENTATION SELECTION Bag of words (frq) Stop list + Decision rules frequency weight Stemming min.frq trees + Bag of words (frq) Boosted decisionBag of words Informativity TFIDF, WordStat TFIDF WBag of words (frq) Stoplist+stemming LSI (latent semantic indexing using SVD) LSI Bag of words (frq) — Bag of words (frq) TFIDF Bag of words Bag of words + word position Ordered word list — Minimum frq — Memory-based reasoning Decision rules, ILDecision rules, sexpert Semantic graph Bag of words + WordNet Minimum connectivity relatBag of words (frq) Minimum frq + informativity Minimum frq TFIDF, PrTFIDFBayes Support Machines Bag of words (frq) Bag of words (frq) Mutual info Bayesian networBag of words (frq) Stop list Generalized inset, k-nearest neBag of words Stop list +informativity Log ratio likelihood Naive Bayes, dtrees Bag of words Logistic regressinaive Bayes 13

Development in Information Retrieval, ACM Press, New York, 1994. D.D. Lewis et al., ―Training Algorithms for Linear Text Classifiers,‖ Proc. 19th Ann. Int’l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR ’96), ACM Press, New York, 1996, pp. 298-306. R. Liere and P. Tadepalli, ―Active Learning with Committees: Preliminary Results in Comparing Winnow and Perceptron in Text Categorization,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. P. Maes, ―Agents that Reduce Work and Information Overload,‖ Comm. ACM, Vol. 37, No. 7, July 1994, pp. 30–40. D. Mladenic, Personal WebWatcher: Implementation and Design, Tech. Report IJS-DP-7472, Carnegie Mellon Univ., Pittsburgh, 1996; http://www.cs.cmu.edu/~TextLearning/pww D. Mladenic and M. Grobelnik, ―Feature Selection for Classification Based on Text Hierarchy, Working Notes of Learning from Text and the Web, Conf. Automated Learning and (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998. D. Mladenic and M. Grobelnik, ―Word Sequences as Features in Text-Learning,‖ Proc. Seventh Electrotechnical and Computer Security Conf. (ERK ’98), IEEE Region 8, Slovenia Section IEEE, Ljubljana, Slovenia., 1998, pp. 145–148. I. Moulinier and J.-G. Ganascia, ―Applying an Existing Machine Learning Algorithm to Text Categorization,‖ Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, S. Wermter, E. Riloff, and G. Scheler, eds., Springer-Verlag, Berlin, 1996, pp. 343–354. K. Nigam and A. McCallum, ―Pool-Based Active Learning for Text Classification,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. M. Pazzani, J. Muramatsu, and D. Billsus, ―Syskill & Webert: Identifying Interesting Web Sites,‖ Proc. 13th Nat’l Conf. Artificial Intelligence AAAI 96, AAAI Press, Menlo Park, Calif., 1996, pp. 54–61. M. Pazzani and D. Billsus, ―Learning and Revising User Profiles: The Identification of Interesting Web Sites,‖ Machine Learning 27, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1997, pp. 313–331. J. Shavlik and T. Eliassi-Rad, ―Building Intelligent Agents for Web-Based Tasks: A Theory-Refinement Approach,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. S. Slattery and M. Craven, ―Learning to Exploit Document Relationships and Structure: The Case for Relational Learning on the Web,‖ Working Notes of Learning from Text and the Web, Conf. Automated Learning and Discovery (CONALD-98), Carnegie Mellon Univ., Pittsburgh, 1998; http://www.cs.cmu.edu/~conald/conald.shtml. M. Mc Elligott and H. Sorensen, ―An Emergent Approach to Information Filtering,‖ Abakus. U.C.C. Computer Science J., Vol. 1, No. 4, Dec. 1993, pp. 1–19. H. Sorensen and M. McElligott, ―PSUN: A Profiling System for Usenet News,‖ CIKM’95 Intelligent Information Agents Workshop, 1995. E. Wiener, J.O. Pedersen, and A.S. Weigend, ―A Neural Network Approach to Topic Spotting ,‖ Proc. Fourth Ann. Symp. Document Analysis and Information Retrieval (SDAIR ’95), Information Science Research Inst., Las Vegas, 1995; http://www.stern.nyu.edu/~aweigend/Research/Papers/TextCategorization. Y. Yang, ―Expert Network: Effective and Efficient Learning form Human Decisions in Text Categorization and Retrieval,‖ Proc. Seventh Ann. Int’l ACM-SIGIR Conf. Research and Development in Information Retrieval, ACM Press, New York, 1994, pp. 13–22. Y. Yang, ―An Evaluation of Statistical Approaches to Text Categorization,‖ Information Retrieval J., May 1999. 参考文献

Bag of words (frq) — Widrow-Hoff, EGBag of words — Winnow (in qucommittee) Bag of words + header information Bag of words (frq) Selecting keywords Informativity Memory-based reasoning Naive Bayes, neighbor Naive Bayes Bag of words n-grams (frq) using Stop list + minimum frq + odds ratio Bag of words Informativity Decision rules Bag of words Minimum frq EM with QBC Bag of words Stop list informativity Stop-list stemming + Localized bag of words + TFIDF, naive nearest neneural nedecision trees Theory refinemeneural networks Bag of words hypertext/graph + Informativity Naive Bayes, ILPn-gram bigrams) graph (only Weighting graph edges Connectionistcomwith genetic algoBag of words Stop-list + minimum frq + stemming + relevancy or LSI Informativity, -stat. c2 Neural nelogistic regressioBag of words k-nearest LLSF n表A. 文本学习中的文档表达,特征选取及学习算法 (The bag-of-words representation is used on Boolean features unless notified that word frequency is used—frq.)

14

参考文献

3.13 ★新事件检测和跟踪的学习方法

标题:Learning Approaches for Detecting and Tracking News Events 链接:ACM\\IEEE Intelligent Systems\\1999\\x4.zip

作者:Yiming Yang, Jaime G. Carbonell, Ralf D. Brown, Thomas Pierce, Brian T. Archibald, and Xin Liu

备注:Language Technologies Institute, Carnegie Mellon University, IEEE INTELLIGENT SYSTEMS, JULY/AUGUST 1999 摘要:事件检测和跟踪的目标是从按时间顺序的新闻流中检测到新事件并不断跟踪感兴趣的事件。Yang概述了相关的信息检索和机器学习技术,扩展了现有的有指导学习和无指导的聚类算法,文档可以根据内容和事件的时间段进行分类。她们采用Reuters和CNN的新闻来评价自己的算法,发现凝聚的(agglomerative)文档聚类算法对检测回顾事件很有效,而带时间窗口的单遍聚类则能有效地发现新事件。当仅有很少的训练样本时要进行事件追踪,k-nearest neighbor分类算法和决策树方法是有效的。

3.14 ★信息检索中的机器学习——神经网络,符号学习和遗传算法

标题:Machine Learning for IR - Neural Networks, Symbolic Learning, and Genetic Algorithms 链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Machine Learning for IR - Neural Networks, Symbolic Learning, and Genetic Algorithms.ps 作者:Hsinchun Chen

备注:July 5, 1994. MIS Dept., College of Business and Public Administration, Univ. of Arizona 摘要:Information retrieval using probabilistic techniques has attracted significant attention on the part of researchers in information and computer science over the past few decades. In the 1980s knowledge-based techniques also made an impressive contribution to ``intelligent'' information retrieval and indexing. More recently, information science researchers have turned to other newer artificial-intelligence based inductive learning techniques including neural networks, symbolic learning, and genetic algorithms. These newer techniques, which are grounded on diverse paradigms, have provided great opportunities for researchers to enhance the information processing and retrieval capabilities of current information storage and retrieval systems.

In this article we first provide an overview of these newer techniques and their use in information science research. In order to familiarize readers with these techniques, we present three popular methods: the connectionist Hopfield network, the symbolic ID3/ID5R, and evolution-based genetic algorithms. We discuss their knowledge representations and algorithms in the context of information retrieval. Sample implementation and testing results from our own research are also provided for each technique. We believe these techniques are robust in their ability to analyze user queries, identify users' information needs, and suggest alternatives for search. With proper user-system interactions, these methods can greatly complement the prevailing full-text, keyword-based, probabilistic, and knowledge-based techniques.

3.15 用NLP来对用户特征进行机器学习

标题:Using NLP for Machine Learning of User Profiles

链接:Papers 论文集\\AI 人工智能\\Machine Learning 机器学习\\Using NLP for Machine Learning of User Profiles.mht

15

参考文献

作者:Eric Bloedorn and Inderjeet Mani 备注:1997

摘要:As more information becomes available electronically, tools for finding information of interest to users becomes increasingly important. The goal of the research described here is to build a system for generating comprehensible user profiles that accurately capture user interest with minimum user interaction. The research focuses on the importance of a suitable generalization hierarchy and representation for learning profiles which are predictively accurate and comprehensible. In our experiments we evaluated both traditional features based on weighted term vectors as well as subject features corresponding to categories which could be drawn from a thesaurus. Our experiments, conducted in the context of a content-based profiling system for on-line newspapers on the World Wide Web (the IDD News Browser), demonstrate the importance of a generalization hierarchy and the promise of combining natural language processing techniques with machine learning (ML) to address an information retrieval (IR) problem.

4模式识别

4.1 JAVA中的模式处理

标题:Think In Patterns with Java

链接:Papers 论文集\\AI 人工智能\\Pattern Recognition 模式识别\\Think In Patterns with Java.pdf

作者:Bruce Eckel

备注:President, MindView, Inc. July, 2000

摘要:This book introduces the important and yet non-traditional “patterns” approach to program design.

4.2 统计的模式识别:回顾

标题:统计的模式识别:回顾

链接:Papers 论文集\\AI 人工智能\\Pattern Recognition 模式识别\\统计模式识别(回顾).doc 作者: 备注: 摘要:有导师与无导师的分类是模式识别的首要目标。统计的方法是在模式识别被传统公式化的不同体系中被最多研究和应用的。最近,神经网络技术和从统计学习理论引入的方法不断受到重视。识别系统的设计需要谨慎地注意以下几点:对模式类的定义、感知的环境、模式表示、特征提取和选择、群分析、分类器设计和学习、训练的选择和测试样本、性能评估。尽管在该领域内的研究和发展已有近50年历史,但对任意方向、地点、范围的复杂模式进行识别的通用性的问题仍悬而未决。新的和正在涌现的应用手段,如数据挖掘、网络搜索、多媒体数据的再搜索、面像识别、连笔识别、都需要鲁棒和有效的模式识别技术。这篇回顾文章的用意在与对用于模式识别系统各分支的显著的方法进行总结和比较,并对这个令人兴奋和充满挑战的领域前沿的研究课题和应用进行叙述。

16

因篇幅问题不能全部显示,请点此查看更多更全内容