Skip to content

Commit bbf36db

Browse files
junhewkclaude
andcommitted
v0.0.1.5: mecab-ko 0.999 support, Windows build fix, dict_index()
- Fix Windows build: add $(SHLIB):$(MECAB_OBJS) dependency so MeCab .o files are compiled before linking; patch dllimport, _stdcall, size_t overload, progress_bar gutting, and HAVE_WINDOWS_H for MinGW/Rtools45 compatibility. - Korean builds now use mecab-ko-msvc 0.999 (Pusnow/mecab-ko-msvc) instead of mecab-ko 0.9.2. Japanese builds continue using taku910/mecab 0.996. Selected via MECAB_LANG env var (default: ko). - Add dict_index() R function wrapping mecab_dict_index, allowing user dictionary compilation directly from R without the external mecab-dict-index command-line tool. - Update CI to install mecab-ko 0.999 pre-built binaries on Ubuntu and macOS for Korean builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 614b7ce commit bbf36db

17 files changed

Lines changed: 342 additions & 156 deletions

.github/workflows/R-CMD-check.yml

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -29,32 +29,32 @@ jobs:
2929
sudo apt-get update -y -qq
3030
sudo apt-get install -y mecab libmecab-dev mecab-ipadic-utf8
3131
sudo ldconfig
32-
- name: Install MeCab (Korean, Ubuntu)
32+
- name: Install MeCab-Ko (Korean, Ubuntu)
3333
if: ${{ matrix.platform == 'ubuntu-latest' && matrix.mecab_lang == 'ko' }}
3434
run: |
35-
sudo apt-get update -y -qq
36-
sudo apt-get install -y autoconf automake
37-
MECAB_KO_BUILD=$(mktemp -d)
38-
cd "$MECAB_KO_BUILD"
39-
curl -fsSL "https://github.com/junhewk/RcppMeCab/releases/download/0.0.1.0/mecab-0.996-ko-0.9.2.tar.gz" -o mecab-ko.tar.gz
40-
tar xzf mecab-ko.tar.gz
41-
cd mecab-0.996-ko-0.9.2
42-
./configure
43-
make
44-
sudo make install
35+
curl -fsSL "https://github.com/Pusnow/mecab-ko-msvc/releases/download/release-0.999/mecab-ko-linux-x86_64.tar.gz" -o mecab-ko.tar.gz
36+
sudo tar xzf mecab-ko.tar.gz -C /usr/local --strip-components=1
4537
sudo ldconfig
46-
cd "$MECAB_KO_BUILD"
47-
curl -fsSL "https://github.com/junhewk/RcppMeCab/releases/download/0.0.1.0/mecab-ko-dic-2.1.1-20180720.tar.gz" -o mecab-ko-dic.tar.gz
48-
tar xzf mecab-ko-dic.tar.gz
49-
cd mecab-ko-dic-2.1.1-20180720
50-
./autogen.sh
51-
./configure
52-
make
53-
sudo make install
54-
rm -rf "$MECAB_KO_BUILD"
38+
curl -fsSL "https://github.com/Pusnow/mecab-ko-msvc/releases/download/release-0.999/mecab-ko-dic.tar.gz" -o mecab-ko-dic.tar.gz
39+
sudo mkdir -p /usr/local/lib/mecab/dic
40+
sudo tar xzf mecab-ko-dic.tar.gz -C /usr/local/lib/mecab/dic
5541
- name: Install MeCab (Japanese, macOS)
5642
if: ${{ matrix.platform == 'macos-latest' && matrix.mecab_lang == 'ja' }}
5743
run: brew install mecab mecab-ipadic
44+
- name: Install MeCab-Ko (Korean, macOS)
45+
if: ${{ matrix.platform == 'macos-latest' && matrix.mecab_lang == 'ko' }}
46+
run: |
47+
ARCH=$(uname -m)
48+
if [ "$ARCH" = "arm64" ]; then
49+
VARIANT="macos-arm64"
50+
else
51+
VARIANT="macos-x86_64"
52+
fi
53+
curl -fsSL "https://github.com/Pusnow/mecab-ko-msvc/releases/download/release-0.999/mecab-ko-${VARIANT}.tar.gz" -o mecab-ko.tar.gz
54+
sudo tar xzf mecab-ko.tar.gz -C /usr/local --strip-components=1
55+
curl -fsSL "https://github.com/Pusnow/mecab-ko-msvc/releases/download/release-0.999/mecab-ko-dic.tar.gz" -o mecab-ko-dic.tar.gz
56+
sudo mkdir -p /usr/local/lib/mecab/dic
57+
sudo tar xzf mecab-ko-dic.tar.gz -C /usr/local/lib/mecab/dic
5858
- name: Install dependencies
5959
run: |
6060
install.packages(c("remotes", "rcmdcheck"))

DESCRIPTION

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
Package: RcppMeCab
22
Title: 'rcpp' Wrapper for 'mecab' Library
3-
Version: 0.0.1.4
3+
Version: 0.0.1.5
44
Authors@R: c(person("Junhewk", "Kim", role = c("aut", "cre"),
55
email = "junhewk.kim@gmail.com"),
66
person("Taku", "Kudo", role = c("aut"),
@@ -17,7 +17,7 @@ Depends: R (>= 3.4.0)
1717
License: GPL
1818
Encoding: UTF-8
1919
BugReports: https://github.com/junhewk/RcppMeCab/issues
20-
RoxygenNote: 7.1.1
20+
RoxygenNote: 7.3.3
2121
Language: en-US
2222
LinkingTo:
2323
Rcpp,
@@ -30,4 +30,5 @@ Suggests:
3030
testthat,
3131
spelling
3232
SystemRequirements:
33-
MeCab 0.996 (or mecab-ko 0.9.2) or higher (libmecab-dev (deb), mecab-devel (rpm))
33+
MeCab 0.996 or higher for Japanese (libmecab-dev (deb), mecab-devel (rpm)),
34+
mecab-ko 0.999 (https://github.com/Pusnow/mecab-ko-msvc) for Korean

NAMESPACE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
# Generated by roxygen2: do not edit by hand
22

3+
export(dict_index)
34
export(pos)
45
export(posParallel)
56
import(Rcpp)

R/RcppExports.R

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,10 @@
11
# Generated by using Rcpp::compileAttributes() -> do not edit by hand
22
# Generator token: 10BE3573-1514-4C36-9D1C-5A225CD40393
33

4+
dictIndexRcpp <- function(args) {
5+
.Call(`_RcppMeCab_dictIndexRcpp`, args)
6+
}
7+
48
posParallelJoinRcpp <- function(text, sys_dic, user_dic) {
59
.Call(`_RcppMeCab_posParallelJoinRcpp`, text, sys_dic, user_dic)
610
}

R/dict.R

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
#' Compile a MeCab user dictionary
2+
#'
3+
#' \code{dict_index} compiles a user dictionary CSV file into a binary
4+
#' dictionary that can be used with \code{pos} and \code{posParallel}.
5+
#'
6+
#' This function wraps MeCab's \code{mecab-dict-index} internally, so you
7+
#' do not need the command-line tool installed separately.
8+
#'
9+
#' @param dic_csv Character scalar. Path to the user dictionary CSV file(s).
10+
#' Multiple CSV files can be provided as a character vector.
11+
#' @param out_dic Character scalar. Path for the output compiled dictionary file.
12+
#' @param dic_dir Character scalar. Path to the system dictionary directory.
13+
#' This is required so that MeCab can reference the system dictionary
14+
#' configuration during compilation.
15+
#' @param dic_charset Character scalar. Charset of the input CSV file.
16+
#' Default is \code{"utf-8"}.
17+
#' @param out_charset Character scalar. Charset of the output dictionary.
18+
#' Default is \code{"utf-8"}.
19+
#' @return Invisible \code{TRUE} on success.
20+
#'
21+
#' @examples
22+
#' \dontrun{
23+
#' dict_index(
24+
#' dic_csv = "user_words.csv",
25+
#' out_dic = "user.dic",
26+
#' dic_dir = "/usr/local/lib/mecab/dic/ipadic"
27+
#' )
28+
#'
29+
#' # Then use the compiled dictionary:
30+
#' pos("some text", user_dic = "user.dic")
31+
#' }
32+
#'
33+
#' @export
34+
dict_index <- function(dic_csv, out_dic, dic_dir,
35+
dic_charset = "utf-8", out_charset = "utf-8") {
36+
if (!is.character(dic_csv) || length(dic_csv) < 1)
37+
stop("dic_csv must be a character vector of CSV file path(s)")
38+
if (!is.character(out_dic) || length(out_dic) != 1)
39+
stop("out_dic must be a single output file path")
40+
if (!is.character(dic_dir) || length(dic_dir) != 1)
41+
stop("dic_dir must be a single directory path")
42+
43+
for (f in dic_csv) {
44+
if (!file.exists(f))
45+
stop("CSV file not found: ", f)
46+
}
47+
if (!dir.exists(dic_dir))
48+
stop("System dictionary directory not found: ", dic_dir)
49+
50+
out_dir <- dirname(out_dic)
51+
if (!dir.exists(out_dir))
52+
stop("Output directory does not exist: ", out_dir)
53+
54+
args <- c("mecab-dict-index",
55+
"-d", normalizePath(dic_dir, mustWork = TRUE),
56+
"-u", normalizePath(out_dic, mustWork = FALSE),
57+
"-f", dic_charset,
58+
"-t", out_charset,
59+
normalizePath(dic_csv, mustWork = TRUE))
60+
61+
dictIndexRcpp(args)
62+
63+
invisible(TRUE)
64+
}

README.md

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,17 @@ This package, RcppMeCab, is a `Rcpp` wrapper for the part-of-speech morphologica
1010

1111
__Please see [this](README_kr.md) for easy installation and usage examples in Korean.__
1212

13+
## MeCab backends
14+
15+
RcppMeCab builds MeCab from source at install time. The MeCab variant is selected by the `MECAB_LANG` environment variable:
16+
17+
| `MECAB_LANG` | Backend | Version | Source |
18+
|---|---|---|---|
19+
| `ko` (default) | [mecab-ko-msvc](https://github.com/Pusnow/mecab-ko-msvc) | 0.999 | Pusnow/mecab-ko-msvc |
20+
| `ja` | [MeCab](http://taku910.github.io/mecab/) | 0.996 | taku910/mecab |
21+
22+
On Linux and macOS, if MeCab is already installed system-wide (detected via `mecab-config`), RcppMeCab uses the system installation regardless of `MECAB_LANG`.
23+
1324
## Installation
1425

1526
### Linux, macOS, and Windows
@@ -26,9 +37,9 @@ devtools::install_github("junhewk/RcppMeCab")
2637

2738
If you already have MeCab installed (e.g. via `brew install mecab` on macOS, or `apt install libmecab-dev` on Linux), RcppMeCab will use your system installation.
2839

29-
### Language selection (Windows)
40+
### Language selection
3041

31-
On Windows, set `MECAB_LANG` before installation to choose the MeCab language variant. The default is `ko` (Korean).
42+
Set `MECAB_LANG` before installation to choose the MeCab variant:
3243

3344
```r
3445
# Korean (default)
@@ -44,7 +55,7 @@ install.packages("RcppMeCab", type = "source")
4455
You need a MeCab dictionary for your target language:
4556

4657
+ **Japanese**: Install [MeCab](http://taku910.github.io/mecab/) and IPAdic, or on macOS: `brew install mecab mecab-ipadic`
47-
+ **Korean**: Install [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko) and [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic). Mirrors of these files are also available on the [RcppMeCab releases page](https://github.com/junhewk/RcppMeCab/releases/tag/0.0.1.0) in case Bitbucket is unavailable. On Windows: install [mecab-ko-msvc](https://github.com/Pusnow/mecab-ko-msvc) and [mecab-ko-dic-msvc](https://github.com/Pusnow/mecab-ko-dic-msvc) in `C:\mecab`
58+
+ **Korean**: Install [mecab-ko-dic](https://github.com/Pusnow/mecab-ko-msvc/releases) (available as `mecab-ko-dic.zip`/`mecab-ko-dic.tar.gz` from mecab-ko-msvc releases)
4859
+ **Chinese**: Install MeCab with [MeCab Chinese Dic](http://www.52nlp.cn/%E7%94%A8mecab%E6%89%93%E9%80%A0%E4%B8%80%E5%A5%97%E5%AE%9E%E7%94%A8%E7%9A%84%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D%E7%B3%BB%E7%BB%9F%E4%B8%89%EF%BC%9Amecab-chinese)
4960

5061
## Usage
@@ -65,32 +76,27 @@ posParallel(sentence) # parallelized, faster for large inputs
6576
+ `join`: if `TRUE` (default), output is `morpheme/tag`; if `FALSE`, output is `morpheme` with tag as attribute
6677
+ `format`: `"list"` (default) or `"data.frame"`
6778
+ `sys_dic`: directory containing `dicrc`, `model.bin`, etc. Set a default with `options(mecabSysDic = "/path/to/dic")`
68-
+ `user_dic`: path to a user dictionary compiled by `mecab-dict-index`
79+
+ `user_dic`: path to a user dictionary compiled by `dict_index()`
6980

7081
Note: provide full paths for `sys_dic` and `user_dic` (no tilde `~/` expansion).
7182

7283
## Compiling a user dictionary
7384

74-
MeCab's `DictionaryCompiler` API calls `die()`, which would crash the R session, so it is not exposed through RcppMeCab. Use the `mecab-dict-index` command-line tool instead.
75-
76-
You need a `model_file` for automatic cost estimation:
77-
78-
+ Japanese: [model_file in ipadic](https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7bnc5aFZSTE9qNnM)
79-
+ Korean: `model.bin` in [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic)
85+
RcppMeCab provides the `dict_index()` function to compile user dictionaries directly from R, without needing the `mecab-dict-index` command-line tool.
8086

8187
Prepare your entries as a CSV file ([Japanese format](http://taku910.github.io/mecab/dic.html), [Korean format](https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/final/user-dic/README.md)), then compile:
8288

83-
```sh
84-
/usr/local/libexec/mecab/mecab-dict-index \
85-
-m /path/to/model.bin \
86-
-d /path/to/mecab-dic \
87-
-u userdic.dic \
88-
-f utf8 -t utf8 \
89-
entries.csv
89+
```r
90+
dict_index(
91+
dic_csv = "entries.csv",
92+
out_dic = "userdic.dic",
93+
dic_dir = "/path/to/mecab-dic"
94+
)
95+
96+
# Then use the compiled dictionary:
97+
pos("some text", user_dic = "userdic.dic")
9098
```
9199

92-
On Windows, use `mecab-dict-index.exe` bundled with [mecab-ko-msvc](https://github.com/Pusnow/mecab-ko-msvc) or the MeCab binary installer.
93-
94100
## Author
95101

96102
Junhewk Kim (junhewk.kim@gmail.com)

0 commit comments

Comments
 (0)