org.aminds.lucene.analysis
クラス CJKSubReader
java.lang.Object
java.io.Reader
org.apache.lucene.analysis.CharStream
org.apache.lucene.analysis.CharFilter
org.aminds.lucene.analysis.SubReader
org.aminds.lucene.analysis.CodePointBasedSubReader
org.aminds.lucene.analysis.CJKSubReader
- すべての実装されたインタフェース:
- Closeable, Readable, ReusableCharFilter
public class CJKSubReader
- extends CodePointBasedSubReader
SubReader that ignores whitespaces between CJ characters. This behavior realizes
fine tokenization of multiline/multipage Japanese text.
- 作成者:
- Masashi Nakanishi
|
メソッドの概要 |
protected boolean |
accept(int codePoint)
|
static boolean |
isCJK(int codePoint)
U+02E80-02EFF : CJK Radicals Supplement * not included
U+02F00-02FDF : Kangxi Radicals * not included
U+02FF0-02FFF : Ideographic Description Characters * not included
U+03000-0303F : CJK Symbols and Punctuation * not included
U+03040-0309F : Hiragana
U+030A0-030FF : Katakana
U+03100-0312F : Bopomofo
U+031F0-031FF : Katakana Phonetic Extensions
U+03200-032FF : Enclosed CJK Letters and Months
U+03300-033FF : CJK Compatibility ; extended (㌀-㍿)
U+03400-04DBF : CJK Unified Ideographs Extension A
U+04E00-09FFF : CJK Unified Ideographs
U+0F900-0FAFF : CJK Compatibility Ideographs
U+0FE30-0FE4F : CJK Compatibility Forms
U+0FF00-0FF9F : Halfwidth and Fullwidth Forms ; extended (・-゚)
U+0FFE0-0FFEF : Halfwidth and Fullwidth Forms ; excluding Hungle
U+20000-2A6DF : CJK Unified Ideographs Extension B
U+2A700-2B73F : CJK Unified Ideographs Extension C
U+2F800-2FA1F : CJK Compatibility Ideographs Supplement
In the future, the followings may be added:
U+2B740-U+2B81F : Ext-D: CJK Unified Ideographs Extension D
U+2B820-U+2F7FF : Ext-E: CJK Unified Ideographs Extension E
It is partly different from StandardTokenizerImpl.jflex, which contains Hangul Compatibility Jamo (U+3130-318F) |
| クラス java.lang.Object から継承されたメソッド |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CJKSubReader
public CJKSubReader()
accept
protected boolean accept(int codePoint)
- 定義:
- クラス
CodePointBasedSubReader 内の accept
isCJK
public static boolean isCJK(int codePoint)
- U+02E80-02EFF : CJK Radicals Supplement * not included
- U+02F00-02FDF : Kangxi Radicals * not included
- U+02FF0-02FFF : Ideographic Description Characters * not included
- U+03000-0303F : CJK Symbols and Punctuation * not included
- U+03040-0309F : Hiragana
- U+030A0-030FF : Katakana
- U+03100-0312F : Bopomofo
- U+031F0-031FF : Katakana Phonetic Extensions
- U+03200-032FF : Enclosed CJK Letters and Months
- U+03300-033FF : CJK Compatibility ; extended (㌀-㍿)
- U+03400-04DBF : CJK Unified Ideographs Extension A
- U+04E00-09FFF : CJK Unified Ideographs
- U+0F900-0FAFF : CJK Compatibility Ideographs
- U+0FE30-0FE4F : CJK Compatibility Forms
- U+0FF00-0FF9F : Halfwidth and Fullwidth Forms ; extended (・-゚)
- U+0FFE0-0FFEF : Halfwidth and Fullwidth Forms ; excluding Hungle
- U+20000-2A6DF : CJK Unified Ideographs Extension B
- U+2A700-2B73F : CJK Unified Ideographs Extension C
- U+2F800-2FA1F : CJK Compatibility Ideographs Supplement
In the future, the followings may be added:
- U+2B740-U+2B81F : Ext-D: CJK Unified Ideographs Extension D
- U+2B820-U+2F7FF : Ext-E: CJK Unified Ideographs Extension E
It is partly different from StandardTokenizerImpl.jflex, which contains Hangul Compatibility Jamo (U+3130-318F)
Copyright (c) 2008-2011 Masashi Nakanishi.