Pylucene 的构建与基本使用

期末考完了。结合一下自己的使用经验，开整。

简介

Lucene

Lucene 是一个开源的全文检索引擎工具包，由 Apache 软件基金会支持和提供。Lucene 提供了完整的查询引擎和索引引擎，部分文本分析引擎。Lucene 的目的是为软件开发人员提供一个简单易用的工具包，以方便的在目标系统中实现全文检索的功能，或者是以此为基础建立起完整的全文检索引擎。

Pylucene

Pylucene 是 Lucene 的 Python 接口，可以在 Python 中使用 Lucene 的功能。

从源码安装

系统环境

Ubuntu 22.04.1 LTS

下载 Pylucene 源码

在这里下载 Pylucene 的源码，我选择下载的是 pylucene-9.6.0-src.tar.gz。源码包里含有所有需要手动构建的软件包。

将源码包提取到你想要的目录中，提取出来的内容应该是这样的：

安装依赖

构建 Pylucene 以及 JCC 需要以下软件包¹：

Java JDK（官方推荐使用 Temurin Java）
Python 3.9+
gcc
g++
make

安装 Temurin Java (Adoptium):

  
sudo -s
apt install wget apt-transport-https gnupg
wget -O - https://packages.adoptium.net/artifactory/api/gpg/key/public | apt-key add -
echo "deb https://packages.adoptium.net/artifactory/deb $(awk -F= '/^VERSION_CODENAME/{print$2}' /etc/os-release) main" | tee /etc/apt/sources.list.d/adoptium.list
apt update
apt install temurin-17-jdk

安装 gcc、g++ 和 make：

sudo apt install gcc g++ make

安装 Python3.9+:

sudo apt install python3-dev python3-pip python3-venv python3-setuptools

构建 JCC

JCC 是一个用于将 Java 类转换为 Python 类的 C++ 程序。Pylucene 的构建需要 JCC。前面下载的源码包里已经包含了 JCC 的源码，我们只需要构建它就行了。

  
cd pylucene-9.6.0/jcc
python3 setup.py build
sudo python3 setup.py install

终端最后输出如下内容就说明构建完成。

构建 PyLucene

回到目录 ~/pylucene-9.6.0，编辑 Makefile。在文件前半段可以看到有很多被注释掉的不同系统的配置，我们需要找到自己的系统对应的配置，取消注释，然后根据实际把配置改成正确无误的。~~实际上的工作就是确认 Python 的路径。~~在这里，我们找到 Linux + Python3 的配置代码，取消注释并确认如下。

# Linux     (Debian Bullseye 64-bit, Python 3.9.2, Temurin Java 17
PREFIX_PYTHON=/usr
PYTHON=$(PREFIX_PYTHON)/bin/python3
JCC=$(PYTHON) -m jcc --shared
NUM_FILES=16

然后就是喜闻乐见的 make 环节了。

  
make
sudo make test
sudo make install

正常来说安装成功后终端也会输出以下内容。

运行测试

创建一个 Python 文件 test.py，写入以下内容：

  
import lucene
from java.io import StringReader
from org.apache.lucene.analysis.standard import StandardTokenizer
from org.apache.lucene.analysis.tokenattributes import CharTermAttribute
lucene.initVM()

test = "To be, or not to be, that is the question."

tokenizer = StandardTokenizer()
tokenizer.setReader(StringReader(test))
charTermAttrib = tokenizer.getAttribute(CharTermAttribute.class_)
tokenizer.reset()
tokens = []
while tokenizer.incrementToken():
    tokens.append(charTermAttrib.toString())
print(tokens)

这段代码的作用是调用 Lucene 的标准分词器对字符串进行分词。运行这个 Python 文件，如果正确输出了分词结果，就说明 Pylucene 已经安装成功了。

基本使用

下面简单讲解使用 Pylucene 从一个数据集中提取内容、创建索引，并在终端实现简单的命令行查找的过程²。~~其实是我某个专业课的大作业。~~

创建索引

我使用 TDT3 英文语料库作为数据集进行内容提取和索引构建。TDT3 英语音频语料库包含在三个月的收集期间内（1998 年 10 月至 12 月）每天从六个美式英语新闻源收集的新闻广播音频文档，根据日期整合为文件夹，共含有 37526 个文本文档。文档的示例内容如下。

<DOC>
<DOCNO> APW19981010.0182 </DOCNO>
<DOCTYPE> NEWS </DOCTYPE>
<TXTTYPE> NEWSWIRE </TXTTYPE>
<TEXT>
Aki Takamura fired a 3-under-par 69 Saturday to share a one-stroke 
lead with overnight leader Hiromi Takamura after three rounds in the 
80 million yen (dlrs 650,000) Takara World Invitational. Aki Takamura, 
the 1996 Japan Women's Open champion, scored four birdies against 
a bogey for a 1-under 215 total at the 6,226-yard, par-72 Caledonian 
Golf Club course. Hiromi Takamura, winless since her seventh career 
victory in the 1994 Toyo Suisan Ladies, went around in 71. The two 
co-leaders are not related. Nayoko Yoshikawa, seeking her first title 
in three years and 30th overall, shot the day's best round of 65 to 
share third place at even-par 216 with Taiwan's Cheng Mei-chi, who 
turned in a 70. Thirty wins will give Yoshikawa a permanent exempt 
status on the Japan LPGA Tour. South Korean Ku Ok-hee shot a 75 and 
was alone in fifth at 220. Another shot behind at 221 came Michiko 
Hattori, who is leading the money list this season, Natsuko Noro and 
Ko Woo-soon of South Korea. The four-round event offers a top prize 
of 14.4 million yen (dlrs 117,000). 
</TEXT>
</DOC>

首先，我们为索引定义一个类 Indexer，在初始化时将定义索引的存储位置、分词器类型、索引器配置以及建立索引器。然后，定义一个 newIndex 方法，这个方法在调用时将遍历数据集中的所有文档，提取出文档的 TEXT 标签内的内容，用其创建索引。

简单来说，创建索引的基本步骤如下：

指定索引器的一些基本信息并创建索引器
创建文档对象
创建信息字段对象，将提取的文档内容包装到信息字段对象中
将信息字段对象添加到文档对象中
将文档对象添加到索引器中
关闭索引器

示例代码如下。

  
class Indexer:
    def __init__(self, doClear=True, computeLengthNorm=False):
        lucene.getVMEnv().attachCurrentThread()
        self.dir = FSDirectory.open(Paths.get(INDEX_PATH))  # 索引存储位置
        self.analyzer = StandardAnalyzer()                  # 分词器类型
        self.config = IndexWriterConfig(self.analyzer)      # 索引器配置
        self.writer = IndexWriter(self.dir, self.config)    # 建立索引器

    def newIndex(self):
        for i in range(len(os.listdir(DOC_PATH))):  # 遍历数据集中的所有文档
            dir = os.path.join(DOC_PATH, os.listdir(DOC_PATH)[i])
            for j in range(len(os.listdir(dir))):
                fname = os.listdir(dir)[j]
                title = fname[0:-4]             # 提取文档标题
                text = ""
                with open(os.path.join(dir, fname), "r") as f:  # 提取文档内容
                    for line in f:
                        if line[0] != "<" and line[-1] != ">":
                            text += line[0:-1]
                    text += "\n"
                doc = Document()                # 创建文档对象
                titleField = TextField("title", title, Field.Store.YES)
                # 为 title 字段创建信息字段对象，选择完全存储
                textField = TextField("text", text, Field.Store.NO)
                # 为 text 字段创建信息字段对象，选择不存储（为了减少索引文件的大小，不存储的内容将来无法从索引中直接获取）
                doc.add(titleField)             # 将信息字段对象添加到文档对象中
                doc.add(textField)              # 将信息字段对象添加到文档对象中
                self.writer.addDocument(doc)    # 将文档对象添加到索引器中
        self.writer.close()                     # 关闭索引器

~~待完成~~