至于和Lucene的比较,Sphinx拥有下列与Lucene所对应的权重计算模式,那个PDF文档已经在各种类型下与Lucene进行比较:
SPH_RANK_PROXIMITY_BM25, default ranking mode which uses and combines both phrase proximity and BM25 ranking.
SPH_RANK_BM25, statistical ranking mode which uses BM25 ranking only (similar to most other full-text engines). This mode is faster but may result in worse quality on queries which contain more than 1 keyword.
SPH_RANK_NONE, disabled ranking mode. This mode is the fastest. It is essentially equivalent to boolean searching. A weight of 1 is assigned to all matches.
SPH_RANK_WORDCOUNT, ranking by keyword occurrences count. This ranker computes the amount of per-field keyword occurrences, then multiplies the amounts by field weights, then sums the resulting values for the final result.
SPH_RANK_PROXIMITY, added in version 0.9.9, returns raw phrase proximity value as a result. This mode is internally used to emulate SPH_MATCH_ALL queries.
SPH_RANK_MATCHANY, added in version 0.9.9, returns rank as it was computed in SPH_MATCH_ANY mode ealier, and is internally used to emulate SPH_MATCH_ANY queries.
ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)是中国科学院计算技术研究所在多年研究工作积累的基础上,基于多层隐马模型研制出的汉语词法分析系统,主要功能包括中文分词;词性标注;命名实体识别;新词识别;同时支持用户词典。ICTCLAS经过五年精心打造,内核升级6次,目前已经升级到了ICTCLAS3.0,分词精度98.45%,各种词典数据压缩后不到3M。ICTCLAS在国内973专家组组织的评测中活动获得了第一名,在第一届国际中文处理研究机构SigHan组织的评测中都获得了多项第一名,是当前世界上最好的汉语词法分析器。
ICTCLAS(Institute of Computing Technology, Chinese Lexical Analysis System)是中国科学院计算技术研究所在多年研究工作积累的基础上,基于多层隐马模型研制出的汉语词法分析系统,主要功能包括中文分词;词性标注;命名实体识别;新词识别;同时支持用户词典。ICTCLAS经过五年精心打造,内核升级6次,目前已经升级到了ICTCLAS3.0,分词精度98.45%,各种词典数据压缩后不到3M。ICTCLAS在国内973专家组组织的评测中活动获得了第一名,在第一届国际中文处理研究机构SigHan组织的评测中都获得了多项第一名,是当前世界上最好的汉语词法分析器。
2、搜索引擎架构设计思路: (1)、调用方式最简化: 尽量方便前端Web工程师,只需要一条简单的SQL语句“SELECT ... FROM myisam_table JOIN sphinx_table ON (sphinx_table.sphinx_id=myisam_table.id) WHERE query='...';”即可实现高效搜索。
②、通过国外《High Performance MySQL》专家组的测试可以看出,根据主键进行查询的类似“SELECT ... FROM ... WHERE id = ...”的SQL语句(其中id为PRIMARY KEY),每秒钟能够处理10000次以上的查询,而普通的SELECT查询每秒只能处理几十次到几百次:
MySQL在高并发连接、数据库记录数较多的情况下,SELECT ... WHERE ... LIKE '%...%'的全文搜索方式不仅效率差,而且以通配符%和_开头作查询时,使用不到索引,需要全表扫描,对数据库的压力也很大。MySQL针对这一问题提供了一种全文索引解决方案,这不仅仅提高了性能和效率(因为MySQL对这些字段做了索引来优化搜索),而且实现了更高质量的搜索。但是,至今为止,MySQL对中文全文索引无法正确支持。
ttserver对php内容无法反序列化,不支持压缩,这两点也很讨厌。
要是没有这几个问题就好了。
一个人将记录写到了缓存,数据库中并更新索引,
另一个人通过索引从缓存或数据库中读出记录.
搜索引擎Cache命中率一般在60%略高的样子,索引所用的内存都是几百G几百G的
你这个只对增量增加敏感,好像删除的话不能更新索引吧?
不过不得不赞一下你这个也相当棒:)
所以,相对而言,我的这套索引单台机器支撑1亿索引,达到Google/Baidu/Sogou的查询速度,算不错了。
至于和Lucene的比较,Sphinx拥有下列与Lucene所对应的权重计算模式,那个PDF文档已经在各种类型下与Lucene进行比较:
SPH_RANK_PROXIMITY_BM25, default ranking mode which uses and combines both phrase proximity and BM25 ranking.
SPH_RANK_BM25, statistical ranking mode which uses BM25 ranking only (similar to most other full-text engines). This mode is faster but may result in worse quality on queries which contain more than 1 keyword.
SPH_RANK_NONE, disabled ranking mode. This mode is the fastest. It is essentially equivalent to boolean searching. A weight of 1 is assigned to all matches.
SPH_RANK_WORDCOUNT, ranking by keyword occurrences count. This ranker computes the amount of per-field keyword occurrences, then multiplies the amounts by field weights, then sums the resulting values for the final result.
SPH_RANK_PROXIMITY, added in version 0.9.9, returns raw phrase proximity value as a result. This mode is internally used to emulate SPH_MATCH_ALL queries.
SPH_RANK_MATCHANY, added in version 0.9.9, returns rank as it was computed in SPH_MATCH_ANY mode ealier, and is internally used to emulate SPH_MATCH_ANY queries.
增量索引能够实现索引的增加、更新。索引的删除更简单,Sphinx支持属性标记,假如正常状态is_delete属性为0,那么删除就将is_delete属性标记为1,属性标记是在内存中进行的,在Sphinx停止时自动写入磁盘,非常快,因而删除索引可以说是实时的。在合并索引时,通过--merge-dst-range参数,即可排除掉被标记为删除的索引。
我发现凡带 [ ] 号的会对检索结构有严重干扰.