YUDA't


whoosh

개요

pure-Python search engine

  • Okapi BM25F ranking function 사용
  • Lucene 같이 엿같은 java 환경 안 써도 됨
  • 모든 인덱스는 반드시 unicode이어야 함

용어 사전

  • Analysis
    • The process of breaking the text of a field into individual terms to be indexed. This consists of tokenizing the text into terms, and then optionally filtering the tokenized terms (for example, lowercasing and removing stop words). Whoosh includes several different analyzers.
  • Corpus
    • The set of documents you are indexing.
  • Documents
    • The individual pieces of content you want to make searchable. The word “documents” might imply files, but the data source could really be anything – articles in a content management system, blog posts in a blogging system, chunks of a very large file, rows returned from an SQL query, individual email messages from a mailbox file, or whatever. When you get search results from Whoosh, the results are a list of documents, whatever “documents” means in your search engine.
  • Fields
    • Each document contains a set of fields. Typical fields might be “title”, “content”, “url”, “keywords”, “status”, “date”, etc. Fields can be indexed (so they’re searchable) and/or stored with the document. Storing the field makes it available in search results. For example, you typically want to store the “title” field so your search results can display it.
  • Forward index
    • A table listing every document and the words that appear in the document. Whoosh lets you store term vectors that are a kind of forward index.
  • Indexing
    • The process of examining documents in the corpus and adding them to the reverse index.
  • Postings
    • The reverse index lists every word in the corpus, and for each word, a list of documents in which that word appears, along with some optional information (such as the number of times the word appears in that document). These items in the list, containing a document number and any extra information, are called postings. In Whoosh the information stored in postings is customizable for each field.
  • Reverse index
    • Basically a table listing every word in the corpus, and for each word, the list of documents in which it appears. It can be more complicated (the index can also list how many times the word appears in each document, the positions at which it appears, etc.) but that’s how it basically works.
  • Schema
    • Whoosh requires that you specify the fields of the index before you begin indexing. The Schema associates field names with metadata about the field, such as the format of the postings and whether the contents of the field are stored in the index.
  • Term vector
    • A forward index for a certain field in a certain document. You can specify in the Schema that a given field should store term vectors.

Schema 디자인

Schema 와 fields

  • Schema는 index의 문서 field를 지정한다.
  • 각 문서는 title, content, url, date 등 여러 field를 가질 수 있다.
  • 일부 field는 인덱싱되거나 저장될 수 있다.(둘 모두도 가능)
  • 스키마는 문서의 모든 possible fields의 집합이며, 각각의 문서는 스키마에 있는 필드의 하위집합을 통해서만 사용할 수 있다.
    > e.g. 이메일을 인덱싱 할 때의 필드는, from_addr, to_addr, subject, body, attachments 가 될 것이다.

내장 필드 유형

Whoosh는 몇 가지 내장 필드 유형을 제공한다.

  • whoosh.fields.TEXT
    • 본문 텍스트
    • 구문 검색을 허용하기 위해 텍스트를 인덱싱 (및 선택적으로 저장)하고 용어 위치를 저장
  • whoosh.fields.KEYWORD
    • 공백 또는 쉼표로 구분 된 키워드
  • whoosh.fields.ID
    • ID필드는 나머지 필드 값들을 single unit으로 간단히 인덱싱한다.
    • url이나 file path, date, category 등에 사용할 것을 추천
    • default는 stored=False 이므로, 저장하려면 ID(stored=True) 을 사용하도록!
  • whoosh.fields.STORED
    • 이 필드는 문서와 함께 저장되지만 인덱싱되거나 검색되지 않는다.
  • whoosh.fields.NUMERIC
    • int, long, or floating point numbers in a compact, sortable format
    • 정렬 가능
  • whoosh.fields.DATETIME
    • datetime objects
    • 정렬 가능
  • whoosh.fields.BOOLEAN
    • boolean value(yes, no, true, false, 1, 0, t, f)
  • whoosh.fields.NGRAM
    • 미정

Schema 만들기

from whoosh.fields import Schema, TEXT, KEYWORD, ID, STORED
from whoosh.analysis import StemmingAnalyzer

schema = Schema(from_addr=ID(stored=True),
                to_addr=ID(stored=True),
                subject=TEXT(stored=True),
                body=TEXT(analyzer=StemmingAnalyzer()),
                tags=KEYWORD)

or

from whoosh.fields import SchemaClass, TEXT, KEYWORD, ID, STORED

class MySchema(SchemaClass):
    path = ID(stored=True)
    title = TEXT(stored=True)
    content = TEXT
    tags = KEYWORD

Modifying the schema after indexing

  • add_field() 와 remove_field()
writer = ix.writer()
writer.add_field("fieldname", fields.TEXT(stored=True))
writer.remove_field("content")
writer.commit()

Dynamic fields

associate a field type with any field name that matches a given “glob” (a name pattern containing *, ?, and/or [abc] wildcards).

schema = fields.Schema(...)
# Any name ending in "_d" will be treated as a stored
# DATETIME field
schema.add("*_d", fields.DATETIME(stored=True), glob=True)

Advanced schema setup

  • Field boosts
schema = Schema(title=TEXT(field_boost=2.0), body=TEXT)
  • Field types
  • Formats
  • Vectors

How to index documents

Creating an Index object

  • index.create_in : create an index
    clear the current contents if the directory with an existing index
import os, os.path
from whoosh import index

if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

ix = index.create_in("indexdir", schema)
  • index.open_dir : open an existing index
import whoosh.index as index

ix = index.open_dir("indexdir")

Indexing documents

Index object를 만들게 되면, IndexWriter를 사용하여 문서를 인덱스에 추가할 수 있다.

ix = index.open_dir("index")
writer = ix.writer()

NOTE

  • Because opening a writer locks the index for writing, in a multi-threaded or multi-process environment your code needs to be aware that opening a writer may raise an exception (whoosh.store.LockError) if a writer is already open. Whoosh includes a couple of example implementations (whoosh.writing.AsyncWriter and whoosh.writing.BufferedWriter) of ways to work around the write lock.

The IndexWriter’s add_document(**kwargs) method accepts keyword arguments where the field name is mapped to a value:

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my document!",
                    path=u"/a", tags=u"first short", icon=u"/icons/star.png")
writer.add_document(title=u"Second try", content=u"This is the second example.",
                    path=u"/b", tags=u"second short", icon=u"/icons/sheep.png")
writer.add_document(title=u"Third time's the charm", content=u"Examples are many.",
                    path=u"/c", tags=u"short", icon=u"/icons/book.png")
writer.commit()

NOTE

  • 모든 필드를 다 채울 필요는 없음

Finishing adding documents

IndexWriter object를 commit()한다.

writer.commit()

커밋하지 않은 채로 writer를 끝내고 싶으면, commit() 대신 cancel()을 호출한다.

writer.cancel()

Merging segments

문서를 추가할 때마다 전체 index를 다시 작성하는 것보다 몇 개의 세그먼트를 사용하는 것이 더 효율적이다.
하지만 많은 세그먼트를 검색하는 것은 검색을 느리게 할 수 있으므로 이를 주의해야 한다.
그래서 Whoosh는 commit()을 실행할 때마다 작은 세그먼트들을 병합하여 큰 세그먼트를 만드는 알고리즘을 가지고 있다.
그게 싫다면,

writer.commit(merge=False)

모든 세그먼트를 병합하고 싶다면, optimize를 사용해 index를 하나의 세그먼트로 최적화한다.

writer.commit(optimize=True)

Deleting documents

다 지운 후에는 commit()을 호출해야 한다.

delete_document(docnum)
is_deleted(docnum)
delete_by_term(fieldname, termtext)
delete_by_query(query)
# Delete document by its path -- this field must be indexed
ix.delete_by_term('path', u'/a/b/c')
# Save the deletion to disk
ix.commit()

Updating documents

from whoosh.fields import Schema, ID, TEXT

schema = Schema(path = ID(unique=True), content=TEXT)

ix = index.create_in("index")
writer = ix.writer()
writer.add_document(path=u"/a", content=u"The first document")
writer.add_document(path=u"/b", content=u"The second document")
writer.commit()

writer = ix.writer()
# Because "path" is marked as unique, calling update_document with path="/a"
# will delete any existing documents where the "path" field contains "/a".
writer.update_document(path=u"/a", content="Replacement for the first document")
writer.commit()

Clearing the index

from whoosh import writing

with myindex.writer() as mywriter:
    # You can optionally add documents to the writer here
    # e.g. mywriter.add_document(...)

    # Using mergetype=CLEAR clears all existing segments so the index will
    # only have any documents you've added to this writer
    mywriter.mergetype = writing.CLEAR

How to search

The Searcher object

get whoosh.searching.Searcher object

searcher = myindex.searcher()
#or
with ix.searcher() as searcher:
    ...
#or
try:
    searcher = ix.searcher()
    ...
finally:
    searcher.close()

methods

  • lexion(fieldname)
>>> list(searcher.lexicon("content"))
[u"document", u"index", u"whoosh"]
  • search()
    • very important
    • takes a whoosh.query.Query object and returns a Results object
from whoosh.qparser import QueryParser

qp = QueryParser("content", schema=myindex.schema)
q = qp.parse(u"hello world")

with myindex.searcher() as s:
    results = s.search(q)
  • limit
    • use with search()
    • By default, search() shows the first 10 matching documetns.
results = s.search(q, limit=20)
  • search_page
    • 특정 페이지 검색
results = s.search_page(q, 1)
  • pagelen
    • default는 10.
results = s.search_page(q, 5, pagelen=20)

Results object

>>> results[0]
{"title": u"Hello World in Python", "path": u"/a/b/c"}
>>> results[0:2]
[{"title": u"Hello World in Python", "path": u"/a/b/c"},
{"title": u"Foo", "path": u"/bar"}]
  • Searcher.search(myquery) 는 default hit number를 20으로 제한한다.
    따라서 Result object의 scored hits 는 index matching document보다 적다.
    뭔소리??
>>> # How many documents in the entire index would have matched?
>>> len(results)
27
>>> # How many scored and sorted documents in this Results object?
>>> # This will often be less than len() if the number of hits was limited
>>> # (the default).
>>> results.scored_length()
10
  • 크고 무거운 index의 delay를 피하고 싶으면,
found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")

Scoring and sorting

  • 보통 result document list는 score에 의해 정렬된다.
    whoosh.scoring 모듈은 scoring 알고리즘에 대한 많은 implementation을 담고 있다.
    default는 BM25F이다.
  • weighting
    • score object를 설정할 수 있다.
from whoosh import scoring

with myindex.searcher(weighting=scoring.TF_IDF()) as s:
    ...

Sorting

  • sortable=True
schema = fields.Schema(title=fields.TEXT(sortable=True),
                       content=fields.TEXT,
                       modified=fields.DATETIME(sortable=True)
                       )