1、全文检索系统的基本内容
全文检索系统是指可以对资料源的全部文本内容进行检索的系统。这涉及到资料源文本内容的提取和建索引、检索条件分析、索引匹配、匹配结果排序输出四个步
骤,其中的难点在于保证检索结果的相关性,即系统查出来的条目确实符合查询条件的意思。很显然,正确理解用户输入的检索条件是前提条件,而“理解”依次要
经历词法、语法和语义分析三个阶段。
词法分析是为了从连续的输入串中分辨出基本的语言单位,通常称做分词。对于英文来说,单词与单词之间都有明确的分隔符,分词简单而直接,而对于中文来说,
没有明确的词语分
隔标记,确定哪几个字是一个词就比较麻烦了。目前中文分词有两种常用做法:固定长度切分(N-Gram) 或者
基于字典比较。固定长度切分操作简单,把一串文字按照固定长度依次切成词,比如固定长度为2,那么“全文检索”就能分出“全文/文检/检索”三个词。基于
字典比较法相对准确,系统内有一个记录有全部词语及其出现频率的字典,分词时按照某种策略依次取输入串的片断来查字典以确定它是不是一个词,进一步还可以
利用词性等参数来进一步提高分析的准确性。
固定长度切分法的最终检索结果并不象想象中的那么差,目前还是主要的实用算法。
对于自然语言的语法和语义分析目前离实用还有点距离。
2、全文检索产品
目前比较流行的全文检索产品有:
作者:
Doug
Cutting cutting@apache.org
版权协议:Apache License
Doug Cutting 在全文检索领域有多年的研究和实践经验,他的部份论文和专利可见
http://lucene.sourceforge.net/publications.html。
Lucene是一个全文检索引擎,很好的实现前面提到的四个步骤所需功能,但是还不是一个完整的产品,并不提供索引管理界面、资料源的解析组件等。
Lucene已经开发多年,并在继续发展中。
Nutch是一个全文检索产品,其核心部份还是Lucene,但提供了各种文件格式的分析器等大量外围功能。Nutch项目从Apache孵化器出来时间
不长,其稳定性还有待观察。
Lucene把分词的处理逻辑封装成一个对象,称为Analyzer,早期的Lucene中没有带中文分词的实现,目前在官方代码库里已经有两个可处理中
文的分词对象,分别是:
org.apache.lucene.analysis.cn.ChineseAnalyzer 和
org.apache.lucene.analysis.cjk.CJKAnalyzer。两者的区别在于,ChineseAnalyzer会将每个汉字
分成一个词(si-gram),而CJKAnalyzer会将每两个相连的字分成一个词(bi-gram),前者是由大富翁论坛的创始人yysun(
http://www.delphibbs.com/)开发的,后者作者
是che dong(叫这个名字的人似乎不少,不确定是哪个)。1.4以
后的版本里,StandardAnalyzer也能正确处理双字节字符了,实际效果也是单字切分。
代码见:
http://svn.apache.org/viewvc/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/
基于词典的分词也有一些可供利用的资源,比如Erik Peterson的
http://www.mandarintools.com/segmenter.html
、中科院计算所的ICTCLAS
http://www.nlp.org.cn/project/project.php?proj_id=6
等。
Lucene本身是用JAVA语言写的,现在有官方在 Lucene4C
支持c/c++,也有.Net(C#语言)实现的称为dotLucene或者Lucene.net(
http://www.dotlucene.net/)
作者:
XWare http://www.astronet.ru/xware/
版权协议:GPL V2
这是一个基于PostgreSQL GIST特性的引擎,本身用perl/tcl实现,也提供python接口。
2.3 商业产品
ICTCLAS http://www.nlp.org.cn/project/project.php?proj_id=6
海量 http://www.hylanda.com/
TRS http://www.trs.com.cn/
dtSearch http://www.dtsearch.com/
FindinSite http://www.phdcc.com/fis/
2.4 不那么出名的
Java ( 引自
http://www.manageability.org/blog/stuff/full-text-lucene-jxta-search-engine-java-xml/view
)
- Egothor - Impressive demo
is
worth a look. Key features include: HTML, PDF, PS, and Microsoft's DOC,
and XLS indexing; Golomb, Elias-Gamma and Block coding; Universal
stemmer that can process almost any language; Boolean model and Vector
model.
- Carrot2
-
Carrot2 is a research framework for experimenting with automated
querying of various data sources (such as search engines), processing
search results and their visualization.
- BDDBot - BDDBot is
a web
robot, search engine, and web server written entirely in Java. It was
written by Tim Macinta for his book (co-authored with Wes Sonnenreich),
a Web Developer's Guide to Search Engines. It was written as an example
for a chapter on how to write your search engines, and as such it is
very simplistic.
- MG4J - MG4J lets you
build
compressed full-text indices for large collections of documents using
sophisticated techniques such as interpolative coding. Moreover, it
provides utility classes that are essential in any serious
text-processing activity.
- eXist - Primarily designed as
an XML
database however it includes an inverted index that speeds up XPath
based queries. The author describes this "Indexing in eXist is based on
a numbering scheme which supports quick identification of structural
relationships between nodes, such as parent-child, ancestor-descendant
or previous-/next-sibling. This way, a wide range of common path
expressions is processed only using indexing information".
- XQEngine - A
full-text
search engine for XML documents. Utilizes XQuery as its front-end query
language. XPath expressions lets you specify constraints on attributes
and element hierarchies, in addition to the specific word content.
- Zilverline - Search a
collection a set of files and directories in a directory. PDF, Word,
txt, java, CHM and HTML is supported, as well as zip and rar files.
Search results of the search can be retrieved from local disk or
remotely, if you run a webserver on your machine. Files inside zip, rar
and chm files are extracted, indexed and can be cached. The cache can
be mapped to sit behind your webserver as well.
- Red
Piranha
- Red Piranha combines Lucene (Searching Ability), XML-RDF (ability to
learn), Tomcat (for P2P Power) and Spring (Ease of use) to not only let
you find anything, anywhere, but to actually understand what you are
looking for.