当前位置:   article > 正文

全文检索django-haystack+jieba+whoosh_haystack结合jieba

haystack结合jieba

全文检索里的组件简介

  1、什么是haystack?

    1. haystack是django的开源搜索框架,该框架支持Solr,Elasticsearch,Whoosh, *Xapian*搜索引擎,不用更改代码,直接切换引擎,减少代码量。

    2. 搜索引擎使用Whoosh,这是一个由纯Python实现的全文搜索引擎,没有二进制文件等,比较小巧,配置比较简单,当然性能自然略低。

    3. 中文分词Jieba,由于Whoosh自带的是英文分词,对中文的分词支持不是太好,故用jieba替换whoosh的分词组件。

  2、什么是jieba?

    1、很多的搜索引擎对中的支持不友好,jieba作为一个中文分词器就是加强对中文的检索功能

  3、Whoosh是什么?

    1、Python的全文搜索库,Whoosh是索引文本及搜索文本的类和函数库

    2、Whoosh 自带的是英文分词,对中文分词支持不太好,使用 jieba 替换 whoosh 的分词组件。

 

haystack配置使用(前后端分离)

安装工具

pip install django-haystack
pip install whoosh
pip install jieba

在setting.py中配置

  1. '''注册app '''
  2. INSTALLED_APPS = [
  3. 'django.contrib.admin',
  4. 'django.contrib.auth',
  5. 'django.contrib.contenttypes',
  6. 'django.contrib.sessions',
  7. 'django.contrib.messages',
  8. 'django.contrib.staticfiles',
  9. # haystack要放在应用的上面
  10. 'haystack',
  11. 'jsapp', # 这个jsapp是自己创建的app
  12. ]
  13. ''' 模板路径 '''
  14. TEMPLATES = [
  15. {
  16. 'DIRS': [os.path.join(BASE_DIR,'templates')],
  17. },
  18. ]
  19. '''配置haystack '''
  20. # 全文检索框架配置
  21. HAYSTACK_CONNECTIONS = {
  22. 'default': {
  23. # 指定whoosh引擎
  24. 'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
  25. # 'ENGINE': 'jsapp.whoosh_cn_backend.WhooshEngine', # whoosh_cn_backend是haystack的whoosh_backend.py改名的文件为了使用jieba分词
  26. # 索引文件路径
  27. 'PATH': os.path.join(BASE_DIR, 'whoosh_index'),
  28. }
  29. }
  30. # 添加此项,当数据库改变时,会自动更新索引,非常方便
  31. HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

定义数据库 ( jsapp/models.py)

  1. from django.db import models
  2. # Create your models here.
  3. class UserInfo(models.Model):
  4. name = models.CharField(max_length=254)
  5. age = models.IntegerField()
  6. class ArticlePost(models.Model):
  7. author = models.ForeignKey(UserInfo,on_delete=models.CASCADE)
  8. title = models.CharField(max_length=200)
  9. desc = models.SlugField(max_length=500)
  10. body = models.TextField()

索引文件生成

 1)在子应用下创建索引文件

  在子应用的目录下,创建一个名为 jsapp/search_indexes.py 的文件

  1. from haystack import indexes
  2. from .models import ArticlePost
  3. # 修改此处,类名为模型类的名称+Index,比如模型类为GoodsInfo,则这里类名为GoodsInfoIndex(其实可以随便写)
  4. class ArticlePostIndex(indexes.SearchIndex, indexes.Indexable):
  5. # text为索引字段
  6. # document = True,这代表haystack和搜索引擎将使用此字段的内容作为索引进行检索
  7. # use_template=True 指定根据表中的那些字段建立索引文件的说明放在一个文件中
  8. text = indexes.CharField(document=True, use_template=True)
  9. # 对那张表进行查询
  10. def get_model(self): # 重载get_model方法,必须要有!
  11. # 返回这个model
  12. return ArticlePost
  13. # 建立索引的数据
  14. def index_queryset(self, using=None):
  15. # 这个方法返回什么内容,最终就会对那些方法建立索引,这里是对所有字段建立索引
  16. return self.get_model().objects.all()

2)指定索引模板文件   

# 创建文件路径命名必须这个规范:templates/search/indexes/应用名称/模型类名称_text.txt
# templates/search/indexes/jsapp/articlepost_text.txt

 

templates/search/indexes/jsapp/articlepost_text.txt

  1. {{ object.title }}
  2. {{ object.author.name }}
  3. {{ object.body }}

3)使用命令创建索引

python manage.py rebuild_index  # 建立索引文件

替换成jieba分词

  1)将haystack源码复制到项目中并改名

'''1.复制源码中文件并改名 '''
将 C:\python37\Lib\site-packages\haystack\backends\whoosh_backend.py文件复制到项目中
并将 whoosh_backend.py改名为 whoosh_cn_backend.py 放在APP中如:jsapp\whoosh_cn_backend.py

'''2.修改源码中文件'''
# 在全局引入的最后一行加入jieba分词器
from jieba.analyse import ChineseAnalyzer

# 修改为中文分词法
查找
analyzer=StemmingAnalyzer()
改为
analyzer=ChineseAnalyzer()

whoosh_cn_backend.py

  1. # encoding: utf-8
  2. from __future__ import absolute_import, division, print_function, unicode_literals
  3. import json
  4. import os
  5. import re
  6. import shutil
  7. import threading
  8. import warnings
  9. from django.conf import settings
  10. from django.core.exceptions import ImproperlyConfigured
  11. from django.utils import six
  12. from django.utils.datetime_safe import datetime
  13. from django.utils.encoding import force_text
  14. from haystack.backends import BaseEngine, BaseSearchBackend, BaseSearchQuery, EmptyResults, log_query
  15. from haystack.constants import DJANGO_CT, DJANGO_ID, ID
  16. from haystack.exceptions import MissingDependency, SearchBackendError, SkipDocument
  17. from haystack.inputs import Clean, Exact, PythonData, Raw
  18. from haystack.models import SearchResult
  19. from haystack.utils import log as logging
  20. from haystack.utils import get_identifier, get_model_ct
  21. from haystack.utils.app_loading import haystack_get_model
  22. try:
  23. import whoosh
  24. except ImportError:
  25. raise MissingDependency(
  26. "The 'whoosh' backend requires the installation of 'Whoosh'. Please refer to the documentation.")
  27. # Handle minimum requirement.
  28. if not hasattr(whoosh, '__version__') or whoosh.__version__ < (2, 5, 0):
  29. raise MissingDependency("The 'whoosh' backend requires version 2.5.0 or greater.")
  30. # Bubble up the correct error.
  31. from whoosh import index
  32. from whoosh.analysis import StemmingAnalyzer
  33. from whoosh.fields import ID as WHOOSH_ID
  34. from whoosh.fields import BOOLEAN, DATETIME, IDLIST, KEYWORD, NGRAM, NGRAMWORDS, NUMERIC, Schema, TEXT
  35. from whoosh.filedb.filestore import FileStorage, RamStorage
  36. from whoosh.highlight import highlight as whoosh_highlight
  37. from whoosh.highlight import ContextFragmenter, HtmlFormatter
  38. from whoosh.qparser import QueryParser
  39. from whoosh.searching import ResultsPage
  40. from whoosh.writing import AsyncWriter
  41. DATETIME_REGEX = re.compile(
  42. '^(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})T(?P<hour>\d{2}):(?P<minute>\d{2}):(?P<second>\d{2})(\.\d{3,6}Z?)?$')
  43. LOCALS = threading.local()
  44. LOCALS.RAM_STORE = None
  45. class WhooshHtmlFormatter(HtmlFormatter):
  46. """
  47. This is a HtmlFormatter simpler than the whoosh.HtmlFormatter.
  48. We use it to have consistent results across backends. Specifically,
  49. Solr, Xapian and Elasticsearch are using this formatting.
  50. """
  51. template = '<%(tag)s>%(t)s</%(tag)s>'
  52. class WhooshSearchBackend(BaseSearchBackend):
  53. # Word reserved by Whoosh for special use.
  54. RESERVED_WORDS = (
  55. 'AND',
  56. 'NOT',
  57. 'OR',
  58. 'TO',
  59. )
  60. # Characters reserved by Whoosh for special use.
  61. # The '\\' must come first, so as not to overwrite the other slash replacements.
  62. RESERVED_CHARACTERS = (
  63. '\\', '+', '-', '&&', '||', '!', '(', ')', '{', '}',
  64. '[', ']', '^', '"', '~', '*', '?', ':', '.',
  65. )
  66. def __init__(self, connection_alias, **connection_options):
  67. super(WhooshSearchBackend, self).__init__(connection_alias, **connection_options)
  68. self.setup_complete = False
  69. self.use_file_storage = True
  70. self.post_limit = getattr(connection_options, 'POST_LIMIT', 128 * 1024 * 1024)
  71. self.path = connection_options.get('PATH')
  72. if connection_options.get('STORAGE', 'file') != 'file':
  73. self.use_file_storage = False
  74. if self.use_file_storage and not self.path:
  75. raise ImproperlyConfigured(
  76. "You must specify a 'PATH' in your settings for connection '%s'." % connection_alias)
  77. self.log = logging.getLogger('haystack')
  78. def setup(self):
  79. """
  80. Defers loading until needed.
  81. """
  82. from haystack import connections
  83. new_index = False
  84. # Make sure the index is there.
  85. if self.use_file_storage and not os.path.exists(self.path):
  86. os.makedirs(self.path)
  87. new_index = True
  88. if self.use_file_storage and not os.access(self.path, os.W_OK):
  89. raise IOError("The path to your Whoosh index '%s' is not writable for the current user/group." % self.path)
  90. if self.use_file_storage:
  91. self.storage = FileStorage(self.path)
  92. else:
  93. global LOCALS
  94. if getattr(LOCALS, 'RAM_STORE', None) is None:
  95. LOCALS.RAM_STORE = RamStorage()
  96. self.storage = LOCALS.RAM_STORE
  97. self.content_field_name, self.schema = self.build_schema(
  98. connections[self.connection_alias].get_unified_index().all_searchfields())
  99. self.parser = QueryParser(self.content_field_name, schema=self.schema)
  100. if new_index is True:
  101. self.index = self.storage.create_index(self.schema)
  102. else:
  103. try:
  104. self.index = self.storage.open_index(schema=self.schema)
  105. except index.EmptyIndexError:
  106. self.index = self.storage.create_index(self.schema)
  107. self.setup_complete = True
  108. def build_schema(self, fields):
  109. schema_fields = {
  110. ID: WHOOSH_ID(stored=True, unique=True),
  111. DJANGO_CT: WHOOSH_ID(stored=True),
  112. DJANGO_ID: WHOOSH_ID(stored=True),
  113. }
  114. # Grab the number of keys that are hard-coded into Haystack.
  115. # We'll use this to (possibly) fail slightly more gracefully later.
  116. initial_key_count = len(schema_fields)
  117. content_field_name = ''
  118. for field_name, field_class in fields.items():
  119. if field_class.is_multivalued:
  120. if field_class.indexed is False:
  121. schema_fields[field_class.index_fieldname] = IDLIST(stored=True, field_boost=field_class.boost)
  122. else:
  123. schema_fields[field_class.index_fieldname] = KEYWORD(stored=True, commas=True, scorable=True,
  124. field_boost=field_class.boost)
  125. elif field_class.field_type in ['date', 'datetime']:
  126. schema_fields[field_class.index_fieldname] = DATETIME(stored=field_class.stored, sortable=True)
  127. elif field_class.field_type == 'integer':
  128. schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, numtype=int,
  129. field_boost=field_class.boost)
  130. elif field_class.field_type == 'float':
  131. schema_fields[field_class.index_fieldname] = NUMERIC(stored=field_class.stored, numtype=float,
  132. field_boost=field_class.boost)
  133. elif field_class.field_type == 'boolean':
  134. # Field boost isn't supported on BOOLEAN as of 1.8.2.
  135. schema_fields[field_class.index_fieldname] = BOOLEAN(stored=field_class.stored)
  136. elif field_class.field_type == 'ngram':
  137. schema_fields[field_class.index_fieldname] = NGRAM(minsize=3, maxsize=15, stored=field_class.stored,
  138. field_boost=field_class.boost)
  139. elif field_class.field_type == 'edge_ngram':
  140. schema_fields[field_class.index_fieldname] = NGRAMWORDS(minsize=2, maxsize=15, at='start',
  141. stored=field_class.stored,
  142. field_boost=field_class.boost)
  143. else:
  144. schema_fields[field_class.index_fieldname] = TEXT(stored=True, analyzer=ChineseAnalyzer(),
  145. field_boost=field_class.boost, sortable=True)
  146. if field_class.document is True:
  147. content_field_name = field_class.index_fieldname
  148. schema_fields[field_class.index_fieldname].spelling = True
  149. # Fail more gracefully than relying on the backend to die if no fields
  150. # are found.
  151. if len(schema_fields) <= initial_key_count:
  152. raise SearchBackendError(
  153. "No fields were found in any search_indexes. Please correct this before attempting to search.")
  154. return (content_field_name, Schema(**schema_fields))
  155. def update(self, index, iterable, commit=True):
  156. if not self.setup_complete:
  157. self.setup()
  158. self.index = self.index.refresh()
  159. writer = AsyncWriter(self.index)
  160. for obj in iterable:
  161. try:
  162. doc = index.full_prepare(obj)
  163. except SkipDocument:
  164. self.log.debug(u"Indexing for object `%s` skipped", obj)
  165. else:
  166. # Really make sure it's unicode, because Whoosh won't have it any
  167. # other way.
  168. for key in doc:
  169. doc[key] = self._from_python(doc[key])
  170. # Document boosts aren't supported in Whoosh 2.5.0+.
  171. if 'boost' in doc:
  172. del doc['boost']
  173. try:
  174. writer.update_document(**doc)
  175. except Exception as e:
  176. if not self.silently_fail:
  177. raise
  178. # We'll log the object identifier but won't include the actual object
  179. # to avoid the possibility of that generating encoding errors while
  180. # processing the log message:
  181. self.log.error(u"%s while preparing object for update" % e.__class__.__name__,
  182. exc_info=True, extra={"data": {"index": index,
  183. "object": get_identifier(obj)}})
  184. if len(iterable) > 0:
  185. # For now, commit no matter what, as we run into locking issues otherwise.
  186. writer.commit()
  187. def remove(self, obj_or_string, commit=True):
  188. if not self.setup_complete:
  189. self.setup()
  190. self.index = self.index.refresh()
  191. whoosh_id = get_identifier(obj_or_string)
  192. try:
  193. self.index.delete_by_query(q=self.parser.parse(u'%s:"%s"' % (ID, whoosh_id)))
  194. except Exception as e:
  195. if not self.silently_fail:
  196. raise
  197. self.log.error("Failed to remove document '%s' from Whoosh: %s", whoosh_id, e, exc_info=True)
  198. def clear(self, models=None, commit=True):
  199. if not self.setup_complete:
  200. self.setup()
  201. self.index = self.index.refresh()
  202. if models is not None:
  203. assert isinstance(models, (list, tuple))
  204. try:
  205. if models is None:
  206. self.delete_index()
  207. else:
  208. models_to_delete = []
  209. for model in models:
  210. models_to_delete.append(u"%s:%s" % (DJANGO_CT, get_model_ct(model)))
  211. self.index.delete_by_query(q=self.parser.parse(u" OR ".join(models_to_delete)))
  212. except Exception as e:
  213. if not self.silently_fail:
  214. raise
  215. if models is not None:
  216. self.log.error("Failed to clear Whoosh index of models '%s': %s", ','.join(models_to_delete),
  217. e, exc_info=True)
  218. else:
  219. self.log.error("Failed to clear Whoosh index: %s", e, exc_info=True)
  220. def delete_index(self):
  221. # Per the Whoosh mailing list, if wiping out everything from the index,
  222. # it's much more efficient to simply delete the index files.
  223. if self.use_file_storage and os.path.exists(self.path):
  224. shutil.rmtree(self.path)
  225. elif not self.use_file_storage:
  226. self.storage.clean()
  227. # Recreate everything.
  228. self.setup()
  229. def optimize(self):
  230. if not self.setup_complete:
  231. self.setup()
  232. self.index = self.index.refresh()
  233. self.index.optimize()
  234. def calculate_page(self, start_offset=0, end_offset=None):
  235. # Prevent against Whoosh throwing an error. Requires an end_offset
  236. # greater than 0.
  237. if end_offset is not None and end_offset <= 0:
  238. end_offset = 1
  239. # Determine the page.
  240. page_num = 0
  241. if end_offset is None:
  242. end_offset = 1000000
  243. if start_offset is None:
  244. start_offset = 0
  245. page_length = end_offset - start_offset
  246. if page_length and page_length > 0:
  247. page_num = int(start_offset / page_length)
  248. # Increment because Whoosh uses 1-based page numbers.
  249. page_num += 1
  250. return page_num, page_length
  251. @log_query
  252. def search(self, query_string, sort_by=None, start_offset=0, end_offset=None,
  253. fields='', highlight=False, facets=None, date_facets=None, query_facets=None,
  254. narrow_queries=None, spelling_query=None, within=None,
  255. dwithin=None, distance_point=None, models=None,
  256. limit_to_registered_models=None, result_class=None, **kwargs):
  257. if not self.setup_complete:
  258. self.setup()
  259. # A zero length query should return no results.
  260. if len(query_string) == 0:
  261. return {
  262. 'results': [],
  263. 'hits': 0,
  264. }
  265. query_string = force_text(query_string)
  266. # A one-character query (non-wildcard) gets nabbed by a stopwords
  267. # filter and should yield zero results.
  268. if len(query_string) <= 1 and query_string != u'*':
  269. return {
  270. 'results': [],
  271. 'hits': 0,
  272. }
  273. reverse = False
  274. if sort_by is not None:
  275. # Determine if we need to reverse the results and if Whoosh can
  276. # handle what it's being asked to sort by. Reversing is an
  277. # all-or-nothing action, unfortunately.
  278. sort_by_list = []
  279. reverse_counter = 0
  280. for order_by in sort_by:
  281. if order_by.startswith('-'):
  282. reverse_counter += 1
  283. if reverse_counter and reverse_counter != len(sort_by):
  284. raise SearchBackendError("Whoosh requires all order_by fields"
  285. " to use the same sort direction")
  286. for order_by in sort_by:
  287. if order_by.startswith('-'):
  288. sort_by_list.append(order_by[1:])
  289. if len(sort_by_list) == 1:
  290. reverse = True
  291. else:
  292. sort_by_list.append(order_by)
  293. if len(sort_by_list) == 1:
  294. reverse = False
  295. sort_by = sort_by_list
  296. if facets is not None:
  297. warnings.warn("Whoosh does not handle faceting.", Warning, stacklevel=2)
  298. if date_facets is not None:
  299. warnings.warn("Whoosh does not handle date faceting.", Warning, stacklevel=2)
  300. if query_facets is not None:
  301. warnings.warn("Whoosh does not handle query faceting.", Warning, stacklevel=2)
  302. narrowed_results = None
  303. self.index = self.index.refresh()
  304. if limit_to_registered_models is None:
  305. limit_to_registered_models = getattr(settings, 'HAYSTACK_LIMIT_TO_REGISTERED_MODELS', True)
  306. if models and len(models):
  307. model_choices = sorted(get_model_ct(model) for model in models)
  308. elif limit_to_registered_models:
  309. # Using narrow queries, limit the results to only models handled
  310. # with the current routers.
  311. model_choices = self.build_models_list()
  312. else:
  313. model_choices = []
  314. if len(model_choices) > 0:
  315. if narrow_queries is None:
  316. narrow_queries = set()
  317. narrow_queries.add(' OR '.join(['%s:%s' % (DJANGO_CT, rm) for rm in model_choices]))
  318. narrow_searcher = None
  319. if narrow_queries is not None:
  320. # Potentially expensive? I don't see another way to do it in Whoosh...
  321. narrow_searcher = self.index.searcher()
  322. for nq in narrow_queries:
  323. recent_narrowed_results = narrow_searcher.search(self.parser.parse(force_text(nq)),
  324. limit=None)
  325. if len(recent_narrowed_results) <= 0:
  326. return {
  327. 'results': [],
  328. 'hits': 0,
  329. }
  330. if narrowed_results:
  331. narrowed_results.filter(recent_narrowed_results)
  332. else:
  333. narrowed_results = recent_narrowed_results
  334. self.index = self.index.refresh()
  335. if self.index.doc_count():
  336. searcher = self.index.searcher()
  337. parsed_query = self.parser.parse(query_string)
  338. # In the event of an invalid/stopworded query, recover gracefully.
  339. if parsed_query is None:
  340. return {
  341. 'results': [],
  342. 'hits': 0,
  343. }
  344. page_num, page_length = self.calculate_page(start_offset, end_offset)
  345. search_kwargs = {
  346. 'pagelen': page_length,
  347. 'sortedby': sort_by,
  348. 'reverse': reverse,
  349. }
  350. # Handle the case where the results have been narrowed.
  351. if narrowed_results is not None:
  352. search_kwargs['filter'] = narrowed_results
  353. try:
  354. raw_page = searcher.search_page(
  355. parsed_query,
  356. page_num,
  357. **search_kwargs
  358. )
  359. except ValueError:
  360. if not self.silently_fail:
  361. raise
  362. return {
  363. 'results': [],
  364. 'hits': 0,
  365. 'spelling_suggestion': None,
  366. }
  367. # Because as of Whoosh 2.5.1, it will return the wrong page of
  368. # results if you request something too high. :(
  369. if raw_page.pagenum < page_num:
  370. return {
  371. 'results': [],
  372. 'hits': 0,
  373. 'spelling_suggestion': None,
  374. }
  375. results = self._process_results(raw_page, highlight=highlight, query_string=query_string,
  376. spelling_query=spelling_query, result_class=result_class)
  377. searcher.close()
  378. if hasattr(narrow_searcher, 'close'):
  379. narrow_searcher.close()
  380. return results
  381. else:
  382. if self.include_spelling:
  383. if spelling_query:
  384. spelling_suggestion = self.create_spelling_suggestion(spelling_query)
  385. else:
  386. spelling_suggestion = self.create_spelling_suggestion(query_string)
  387. else:
  388. spelling_suggestion = None
  389. return {
  390. 'results': [],
  391. 'hits': 0,
  392. 'spelling_suggestion': spelling_suggestion,
  393. }
  394. def more_like_this(self, model_instance, additional_query_string=None,
  395. start_offset=0, end_offset=None, models=None,
  396. limit_to_registered_models=None, result_class=None, **kwargs):
  397. if not self.setup_complete:
  398. self.setup()
  399. field_name = self.content_field_name
  400. narrow_queries = set()
  401. narrowed_results = None
  402. self.index = self.index.refresh()
  403. if limit_to_registered_models is None:
  404. limit_to_registered_models = getattr(settings, 'HAYSTACK_LIMIT_TO_REGISTERED_MODELS', True)
  405. if models and len(models):
  406. model_choices = sorted(get_model_ct(model) for model in models)
  407. elif limit_to_registered_models:
  408. # Using narrow queries, limit the results to only models handled
  409. # with the current routers.
  410. model_choices = self.build_models_list()
  411. else:
  412. model_choices = []
  413. if len(model_choices) > 0:
  414. if narrow_queries is None:
  415. narrow_queries = set()
  416. narrow_queries.add(' OR '.join(['%s:%s' % (DJANGO_CT, rm) for rm in model_choices]))
  417. if additional_query_string and additional_query_string != '*':
  418. narrow_queries.add(additional_query_string)
  419. narrow_searcher = None
  420. if narrow_queries is not None:
  421. # Potentially expensive? I don't see another way to do it in Whoosh...
  422. narrow_searcher = self.index.searcher()
  423. for nq in narrow_queries:
  424. recent_narrowed_results = narrow_searcher.search(self.parser.parse(force_text(nq)),
  425. limit=None)
  426. if len(recent_narrowed_results) <= 0:
  427. return {
  428. 'results': [],
  429. 'hits': 0,
  430. }
  431. if narrowed_results:
  432. narrowed_results.filter(recent_narrowed_results)
  433. else:
  434. narrowed_results = recent_narrowed_results
  435. page_num, page_length = self.calculate_page(start_offset, end_offset)
  436. self.index = self.index.refresh()
  437. raw_results = EmptyResults()
  438. searcher = None
  439. if self.index.doc_count():
  440. query = "%s:%s" % (ID, get_identifier(model_instance))
  441. searcher = self.index.searcher()
  442. parsed_query = self.parser.parse(query)
  443. results = searcher.search(parsed_query)
  444. if len(results):
  445. raw_results = results[0].more_like_this(field_name, top=end_offset)
  446. # Handle the case where the results have been narrowed.
  447. if narrowed_results is not None and hasattr(raw_results, 'filter'):
  448. raw_results.filter(narrowed_results)
  449. try:
  450. raw_page = ResultsPage(raw_results, page_num, page_length)
  451. except ValueError:
  452. if not self.silently_fail:
  453. raise
  454. return {
  455. 'results': [],
  456. 'hits': 0,
  457. 'spelling_suggestion': None,
  458. }
  459. # Because as of Whoosh 2.5.1, it will return the wrong page of
  460. # results if you request something too high. :(
  461. if raw_page.pagenum < page_num:
  462. return {
  463. 'results': [],
  464. 'hits': 0,
  465. 'spelling_suggestion': None,
  466. }
  467. results = self._process_results(raw_page, result_class=result_class)
  468. if searcher:
  469. searcher.close()
  470. if hasattr(narrow_searcher, 'close'):
  471. narrow_searcher.close()
  472. return results
  473. def _process_results(self, raw_page, highlight=False, query_string='', spelling_query=None, result_class=None):
  474. from haystack import connections
  475. results = []
  476. # It's important to grab the hits first before slicing. Otherwise, this
  477. # can cause pagination failures.
  478. hits = len(raw_page)
  479. if result_class is None:
  480. result_class = SearchResult
  481. facets = {}
  482. spelling_suggestion = None
  483. unified_index = connections[self.connection_alias].get_unified_index()
  484. indexed_models = unified_index.get_indexed_models()
  485. for doc_offset, raw_result in enumerate(raw_page):
  486. score = raw_page.score(doc_offset) or 0
  487. app_label, model_name = raw_result[DJANGO_CT].split('.')
  488. additional_fields = {}
  489. model = haystack_get_model(app_label, model_name)
  490. if model and model in indexed_models:
  491. for key, value in raw_result.items():
  492. index = unified_index.get_index(model)
  493. string_key = str(key)
  494. if string_key in index.fields and hasattr(index.fields[string_key], 'convert'):
  495. # Special-cased due to the nature of KEYWORD fields.
  496. if index.fields[string_key].is_multivalued:
  497. if value is None or len(value) is 0:
  498. additional_fields[string_key] = []
  499. else:
  500. additional_fields[string_key] = value.split(',')
  501. else:
  502. additional_fields[string_key] = index.fields[string_key].convert(value)
  503. else:
  504. additional_fields[string_key] = self._to_python(value)
  505. del (additional_fields[DJANGO_CT])
  506. del (additional_fields[DJANGO_ID])
  507. if highlight:
  508. sa = StemmingAnalyzer()
  509. formatter = WhooshHtmlFormatter('em')
  510. terms = [token.text for token in sa(query_string)]
  511. whoosh_result = whoosh_highlight(
  512. additional_fields.get(self.content_field_name),
  513. terms,
  514. sa,
  515. ContextFragmenter(),
  516. formatter
  517. )
  518. additional_fields['highlighted'] = {
  519. self.content_field_name: [whoosh_result],
  520. }
  521. result = result_class(app_label, model_name, raw_result[DJANGO_ID], score, **additional_fields)
  522. results.append(result)
  523. else:
  524. hits -= 1
  525. if self.include_spelling:
  526. if spelling_query:
  527. spelling_suggestion = self.create_spelling_suggestion(spelling_query)
  528. else:
  529. spelling_suggestion = self.create_spelling_suggestion(query_string)
  530. return {
  531. 'results': results,
  532. 'hits': hits,
  533. 'facets': facets,
  534. 'spelling_suggestion': spelling_suggestion,
  535. }
  536. def create_spelling_suggestion(self, query_string):
  537. spelling_suggestion = None
  538. reader = self.index.reader()
  539. corrector = reader.corrector(self.content_field_name)
  540. cleaned_query = force_text(query_string)
  541. if not query_string:
  542. return spelling_suggestion
  543. # Clean the string.
  544. for rev_word in self.RESERVED_WORDS:
  545. cleaned_query = cleaned_query.replace(rev_word, '')
  546. for rev_char in self.RESERVED_CHARACTERS:
  547. cleaned_query = cleaned_query.replace(rev_char, '')
  548. # Break it down.
  549. query_words = cleaned_query.split()
  550. suggested_words = []
  551. for word in query_words:
  552. suggestions = corrector.suggest(word, limit=1)
  553. if len(suggestions) > 0:
  554. suggested_words.append(suggestions[0])
  555. spelling_suggestion = ' '.join(suggested_words)
  556. return spelling_suggestion
  557. def _from_python(self, value):
  558. """
  559. Converts Python values to a string for Whoosh.
  560. Code courtesy of pysolr.
  561. """
  562. if hasattr(value, 'strftime'):
  563. if not hasattr(value, 'hour'):
  564. value = datetime(value.year, value.month, value.day, 0, 0, 0)
  565. elif isinstance(value, bool):
  566. if value:
  567. value = 'true'
  568. else:
  569. value = 'false'
  570. elif isinstance(value, (list, tuple)):
  571. value = u','.join([force_text(v) for v in value])
  572. elif isinstance(value, (six.integer_types, float)):
  573. # Leave it alone.
  574. pass
  575. else:
  576. value = force_text(value)
  577. return value
  578. def _to_python(self, value):
  579. """
  580. Converts values from Whoosh to native Python values.
  581. A port of the same method in pysolr, as they deal with data the same way.
  582. """
  583. if value == 'true':
  584. return True
  585. elif value == 'false':
  586. return False
  587. if value and isinstance(value, six.string_types):
  588. possible_datetime = DATETIME_REGEX.search(value)
  589. if possible_datetime:
  590. date_values = possible_datetime.groupdict()
  591. for dk, dv in date_values.items():
  592. date_values[dk] = int(dv)
  593. return datetime(date_values['year'], date_values['month'], date_values['day'], date_values['hour'],
  594. date_values['minute'], date_values['second'])
  595. try:
  596. # Attempt to use json to load the values.
  597. converted_value = json.loads(value)
  598. # Try to handle most built-in types.
  599. if isinstance(converted_value, (list, tuple, set, dict, six.integer_types, float, complex)):
  600. return converted_value
  601. except:
  602. # If it fails (SyntaxError or its ilk) or we don't trust it,
  603. # continue on.
  604. pass
  605. return value
  606. class WhooshSearchQuery(BaseSearchQuery):
  607. def _convert_datetime(self, date):
  608. if hasattr(date, 'hour'):
  609. return force_text(date.strftime('%Y%m%d%H%M%S'))
  610. else:
  611. return force_text(date.strftime('%Y%m%d000000'))
  612. def clean(self, query_fragment):
  613. """
  614. Provides a mechanism for sanitizing user input before presenting the
  615. value to the backend.
  616. Whoosh 1.X differs here in that you can no longer use a backslash
  617. to escape reserved characters. Instead, the whole word should be
  618. quoted.
  619. """
  620. words = query_fragment.split()
  621. cleaned_words = []
  622. for word in words:
  623. if word in self.backend.RESERVED_WORDS:
  624. word = word.replace(word, word.lower())
  625. for char in self.backend.RESERVED_CHARACTERS:
  626. if char in word:
  627. word = "'%s'" % word
  628. break
  629. cleaned_words.append(word)
  630. return ' '.join(cleaned_words)
  631. def build_query_fragment(self, field, filter_type, value):
  632. from haystack import connections
  633. query_frag = ''
  634. is_datetime = False
  635. if not hasattr(value, 'input_type_name'):
  636. # Handle when we've got a ``ValuesListQuerySet``...
  637. if hasattr(value, 'values_list'):
  638. value = list(value)
  639. if hasattr(value, 'strftime'):
  640. is_datetime = True
  641. if isinstance(value, six.string_types) and value != ' ':
  642. # It's not an ``InputType``. Assume ``Clean``.
  643. value = Clean(value)
  644. else:
  645. value = PythonData(value)
  646. # Prepare the query using the InputType.
  647. prepared_value = value.prepare(self)
  648. if not isinstance(prepared_value, (set, list, tuple)):
  649. # Then convert whatever we get back to what pysolr wants if needed.
  650. prepared_value = self.backend._from_python(prepared_value)
  651. # 'content' is a special reserved word, much like 'pk' in
  652. # Django's ORM layer. It indicates 'no special field'.
  653. if field == 'content':
  654. index_fieldname = ''
  655. else:
  656. index_fieldname = u'%s:' % connections[self._using].get_unified_index().get_index_fieldname(field)
  657. filter_types = {
  658. 'content': '%s',
  659. 'contains': '*%s*',
  660. 'endswith': "*%s",
  661. 'startswith': "%s*",
  662. 'exact': '%s',
  663. 'gt': "{%s to}",
  664. 'gte': "[%s to]",
  665. 'lt': "{to %s}",
  666. 'lte': "[to %s]",
  667. 'fuzzy': u'%s~',
  668. }
  669. if value.post_process is False:
  670. query_frag = prepared_value
  671. else:
  672. if filter_type in ['content', 'contains', 'startswith', 'endswith', 'fuzzy']:
  673. if value.input_type_name == 'exact':
  674. query_frag = prepared_value
  675. else:
  676. # Iterate over terms & incorportate the converted form of each into the query.
  677. terms = []
  678. if isinstance(prepared_value, six.string_types):
  679. possible_values = prepared_value.split(' ')
  680. else:
  681. if is_datetime is True:
  682. prepared_value = self._convert_datetime(prepared_value)
  683. possible_values = [prepared_value]
  684. for possible_value in possible_values:
  685. terms.append(filter_types[filter_type] % self.backend._from_python(possible_value))
  686. if len(terms) == 1:
  687. query_frag = terms[0]
  688. else:
  689. query_frag = u"(%s)" % " AND ".join(terms)
  690. elif filter_type == 'in':
  691. in_options = []
  692. for possible_value in prepared_value:
  693. is_datetime = False
  694. if hasattr(possible_value, 'strftime'):
  695. is_datetime = True
  696. pv = self.backend._from_python(possible_value)
  697. if is_datetime is True:
  698. pv = self._convert_datetime(pv)
  699. if isinstance(pv, six.string_types) and not is_datetime:
  700. in_options.append('"%s"' % pv)
  701. else:
  702. in_options.append('%s' % pv)
  703. query_frag = "(%s)" % " OR ".join(in_options)
  704. elif filter_type == 'range':
  705. start = self.backend._from_python(prepared_value[0])
  706. end = self.backend._from_python(prepared_value[1])
  707. if hasattr(prepared_value[0], 'strftime'):
  708. start = self._convert_datetime(start)
  709. if hasattr(prepared_value[1], 'strftime'):
  710. end = self._convert_datetime(end)
  711. query_frag = u"[%s to %s]" % (start, end)
  712. elif filter_type == 'exact':
  713. if value.input_type_name == 'exact':
  714. query_frag = prepared_value
  715. else:
  716. prepared_value = Exact(prepared_value).prepare(self)
  717. query_frag = filter_types[filter_type] % prepared_value
  718. else:
  719. if is_datetime is True:
  720. prepared_value = self._convert_datetime(prepared_value)
  721. query_frag = filter_types[filter_type] % prepared_value
  722. if len(query_frag) and not isinstance(value, Raw):
  723. if not query_frag.startswith('(') and not query_frag.endswith(')'):
  724. query_frag = "(%s)" % query_frag
  725. return u"%s%s" % (index_fieldname, query_frag)
  726. class WhooshEngine(BaseEngine):
  727. backend = WhooshSearchBackend
  728. query = WhooshSearchQuery
  729. '''2.修改源码中文件'''
  730. # 在全局引入的最后一行加入jieba分词器
  731. from jieba.analyse import ChineseAnalyzer
  732. # 修改为中文分词法
  733. # 查找
  734. # analyzer=StemmingAnalyzer()
  735. # 改为
  736. analyzer=ChineseAnalyzer()
'
运行



2)Django内settings内修改相应的haystack后台文件名
(setting.py)

  1. # 全文检索框架配置
  2. HAYSTACK_CONNECTIONS = {
  3. 'default': {
  4. # 指定whoosh引擎
  5. 'ENGINE': 'haystack.backends.whoosh_backend.WhooshEngine',
  6. # 'ENGINE': 'jsapp.whoosh_cn_backend.WhooshEngine', #article.whoosh_cn_backend便是你刚刚添加的文件
  7. # 索引文件路径
  8. 'PATH': os.path.join(BASE_DIR, 'whoosh_index'),
  9. }
  10. }
  11. # 添加此项,当数据库改变时,会自动更新索引,非常方便
  12. HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'

索引文件使用

jsapp/urls.py

  1. from django.conf.urls import url
  2. from . import views as view
  3. urlpatterns=[
  4. url(r'abc/$', view.basic_search),
  5. ]

 jsapp/views.py

  1. from django.shortcuts import render
  2. # Create your views here.
  3. import json
  4. from django.conf import settings
  5. from django.core.paginator import InvalidPage, Paginator
  6. from django.http import Http404, HttpResponse,JsonResponse
  7. from haystack.forms import ModelSearchForm
  8. from haystack.query import EmptySearchQuerySet
  9. RESULTS_PER_PAGE = getattr(settings, 'HAYSTACK_SEARCH_RESULTS_PER_PAGE', 20)
  10. def basic_search(request, load_all=True, form_class=ModelSearchForm, searchqueryset=None, extra_context=None, results_per_page=None):
  11. query = ''
  12. results = EmptySearchQuerySet()
  13. if request.GET.get('q'):
  14. form = form_class(request.GET, searchqueryset=searchqueryset, load_all=load_all)
  15. if form.is_valid():
  16. query = form.cleaned_data['q']
  17. results = form.search()
  18. else:
  19. form = form_class(searchqueryset=searchqueryset, load_all=load_all)
  20. paginator = Paginator(results, results_per_page or RESULTS_PER_PAGE)
  21. try:
  22. page = paginator.page(int(request.GET.get('page', 1)))
  23. except InvalidPage:
  24. result = {"code": 404, "msg": 'No file found!', "data": []}
  25. return HttpResponse(json.dumps(result), content_type="application/json")
  26. context = {
  27. 'form': form,
  28. 'page': page,
  29. 'paginator': paginator,
  30. 'query': query,
  31. 'suggestion': None,
  32. }
  33. if results.query.backend.include_spelling:
  34. context['suggestion'] = form.get_suggestion()
  35. if extra_context:
  36. context.update(extra_context)
  37. jsondata = []
  38. print(len(page.object_list))
  39. for result in page.object_list:
  40. data = {
  41. 'pk': result.object.pk,
  42. 'title': result.object.title,
  43. 'content': result.object.body,
  44. }
  45. jsondata.append(data)
  46. result = {"code": 200, "msg": 'Search successfully!', "data": jsondata}
  47. return JsonResponse(result, content_type="application/json")

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/运维做开发/article/detail/860226
推荐阅读
相关标签
  

闽ICP备14008679号