Cross-domain technology application is the application of technology from one field to another to create a wide range of application opportunities. To successfully identify emerging technological application cross sections of patent documents is vital to the competitive advantage of companies, and even nations. An automatic process is needed to save precious resources of human experts and exploit huge numbers of patent documents. Chinese patent documents are the source data of our experiment. In this study, an identification algorithm was developed on the basis of a cross-collection mixture model to identify cross section and emerging technology from patents written in Chinese. To verify the algorithm’s effectiveness, documents in three transmission-related technology subclasses and one application technology category were collected from WEBPAT Taiwan. The former subclasses consist of H04B: Transmission; H04L: Transmission of digital information; and H04N: Image communication; and the latter is G06Q: Patents for administration, management, commerce, operation, supervision, or prediction by using data processing systems or methods. Growth rate detection was the most popular approach to forecast emerging technologies, our research defined the growth rate as the difference between the numbers of technology-containing documents published in different time. The emerging technology identified using the proposed method exhibited an average growth rate of 95.08%. By comparison, two benchmark methods identified emerging technology with average growth rates of 9.57% and 51.49%.
Cross-domain technology application uses technology from one field in another, with impacts on human lives and commercial undertakings. Examples include the global positioning system (GPS), radio frequency identification (RFID), light-emitting diode (LED), and financial technology (fintech). The GPS project was launched by the U.S. Department of Defense in 1973 for military use, and became fully operational in 1995 [
Patent data banks are valuable resources for technological research and documentation. They record more than 80% of all developed technology [
Since China’s accession to the World Trade Organization (WTO) in 2000, the number of patents written in Chinese has increased exponentially [
In summary, to successfully identify emerging technological applications across classes of patent documents is crucial to the competitive advantage of corporations, and even countries. An automatic process to solve the language issue is critical to reduce the need for human experts and take advantage of the large numbers of available patent documents. However, to automate this process faces several obstacles:
No automatic methodology has been proposed to identify IPC cross-class technology applications.
The methodology developed must be able to automatically identify emerging applications of technologies.
A novel method is necessary to accommodate the clumsy results of current Chinese word segmentation technologies.
To resolve these issues, this study proposes a methodology based on common and specific theme analysis of patent documents. To avoid litigation, patent documents tend to use different words to describe similar technologies. A keyword-based approach is therefore inadequate. The cross-collection mixture model (CCMM) has been developed to identify common and specific themes [
To verify the effectiveness of the developed method, three subclasses belonging to class H04 were treated as source classes. These were H04B (transmission), H04L (digital information), and H04N (image communication). Class G06Q, belonging to another section, was selected as the application class. G06Q collects patents for administration, management, commerce, operation, supervision, and prediction using data processing systems or methods. The patents were collected from WEBPAT Taiwan [
A technology application map was developed to visualize the identified source and application technologies. In this study, three application technologies were identified as emerging technologies. Growth rate detection was the most popular approach to forecast emerging technologies, our research defined the growth rate as the difference between the numbers of technology-containing documents published in different time. The average growth rate of these technologies was 95.08%, whereas those of the technologies identified using two benchmark methods [
The remainder of this article is organized as follows. Section 2 reviews the research on the introduction of the IPC, identification of emerging technological terminologies, identification of technological terms in Chinese patents, and cross-collection mixture (theme) models. The proposed model, research design, and methods are detailed in Section 3. The data acquisition process, experiments, visualization of cross-class technology, and identified emerging technologies are discussed in Section 4. Contributions, research limitations of this study, and recommendations for future research are described in Section 5.
The world’s most widely used patent classification system, the IPC was established in 1954 with the condition that it be updated every five years [
Section | Classification | Number of main classes |
---|---|---|
A | Human living necessities | 21 |
B | Operation, transportation | 37 |
C | Chemistry, metallurgy, combination technology | 21 |
D | Textile, paper manufacturing | 9 |
E | Fixed building | 8 |
F | Mechanical engineering, illuminating, heat supply, weapons, blasting | 18 |
G | Physics | 14 |
H | Electrical science | 6 |
Emerging technology shows high potential whose value has not yet been demonstrated or agreed upon by a community of users [
Most studies turn to text mining to identify the terminologies representing or symbolizing emerging technologies. This involves identifying n-gram words (is a contiguous sequence of n items from a given sequence of text or speech) [
This study utilizes momentums and frequencies to identify terminologies from a set of words included in specific themes, which we explain below. Be reminded that words in specific themes must be further processed due to the inappropriate word phrasing of Chinese word segmentation systems.
Research has shown that even among native Chinese speakers, only approximately 75% agreement can be achieved with regard to correct segmentation, and the percentage of agreement decreases as the number of people involved increases [
CCMM [
Emerging technology application involves transferring popular technology from a source field to an application field to create new applications; therefore, the technology terminology identified in the source field should be associated with common themes, and that in the application field with specific themes related to a particular temporal interval to reflect the freshness of the emerging technology.
In the proposed method, patents are collected from documents in IPC sections. At least one section should provide the source technology, and one should provide the application technology. The collected patent documents should have been published over at least two consecutive years. If the documents are written in Chinese, then a Chinese word segmentation system such as CKIP [
A theme is a concept derived from a collection of documents and represented by a set of words [
Words highly associated with a common theme are considered popular technologies in the IPC-classified section. Cross section analysis entails the identification of technology that is popular in its own (source) section and emerging in another section. A specific theme within a common theme represents a subconcept derived in a particular year from patent documents published in that year. Terminology for specific themes is considered to denote popular technology in a specific year. Such terminology potentially represents emerging technology in that year.
We define the following based on Zhai’s definitions of themes [
A theme is a concept shared by a collection of documents. More than one document can share a theme, and a document can address several themes.
A set of document collections can address a set of common themes, denoted by
A background theme is a special theme
Given a collection of documents published at time
In a theme model, the main purpose of the background theme is to collect and remove words that appear too often as representative words. Words collected under the common theme are those with high probability in the document collections for the entire time frame. A specific theme collects only words with high probability in a certain time period, whose probabilities represent their intensities in a collection.
A document uses a sequence of words from a vocabulary set to describe concepts. Therefore, each document should include several themes. Each theme is associated with a set of words annotated with an intensity probability. Based on Zhai et al.’s research [
Given a document
Given a theme
A model of words can be derived based on the distribution models of themes.
Given a document
where
The values of
Given
We discuss the tuning of parameters to maximize data distribution likelihood. Three parameters are fixed before the estimation:
An expectation maximization algorithm [
Two hidden variables,
We obtained the representative words of common and specific themes defined in Definition 4 to identify popular and potential technology in a given source and application section, respectively.
Given a common theme
Given a specific theme
where
This study proposes to identify cross section technology terminology using representative words of common and specific themes according to the following observations.
Because a specific theme represents a subconcept that is popular during a particular time period, representative words that have high distribution values in a specific theme are candidates for emerging technology terminology.
Numerous cross section technology developments integrate popular technology in one section with technology in another section to improve products or services.
An n-gram method is applied to divide Chinese representative words into terms. Terms that appear in sufficient numbers of representative words are considered to denote popular technology. Popular technology terms are then used to identify cross section technology terminology.
Assuming that a word is a sequence of characters, a set of n-gram terms is defined as follows.
A term
Given a set of words
Given an n-gram term
Given a common theme
Cross section technology terminology can be identified using the top (most popular) terms. Representative words of specific application section themes that include popular terms are identified as cross section technology terminology.
Given a specific theme
Given a cross section technology terminology structure
Cross section technology terminology that appears in years beyond the specified number of consecutive years is considered emerging. Cross section emerging technology application is defined in Definition 7, and
Given a momentum threshold
Transmission and communication technologies critically affect daily life. Cell phones and other means of wireless communication have given rise to a generation with entirely new information technology consumption behaviors. Therefore, for this study, we chose the subclasses of transmission (H04B), transmission of digital information (H04L), and image communication (H04N) as source classes. To win or maintain competitive advantages, corporations have raced to adapt these technologies and create novel applications. Subclass G06Q includes patents for administration, management, commerce, operation, supervision, and predictions based on data processing systems or methods, and was therefore designated as the application class where the application of cross-class technology should be found.
In total, 1,562 abstracts of patent documents belonging to the four subclasses and published between 2006 and 2011 were collected. Among them, 378, 253, 324, 275, 265, and 67 were published in 2006, 2007, 2008, 2009, 2010, and 2011, respectively. The number of cases published in 2011 is low because only a portion of that year’s patents had been submitted when data were being collected for the IPC. Subclasses H04B, H04L, H04N, and G06Q had 339, 448, 396, and 379 cases, respectively, from WEBPAT Taiwan [
Using the four subclasses and six years of data, in CCMM, the number of common themes is the same as the number of patent classes investigated, which is K. The number of specific themes, M, corresponds to the number of years from which the patents were collected.
The theoretical value of
where
The initial value of the distribution model of each word (
The initial value of the distribution model of each word (
λS and λB, respectively, sort words into specific and background themes. Studies have set λB as 0.95 [
H04B | p(w/θ1 ) | H04L | p(w/θ2 ) | H04N | p(w/θ3 ) | G06Q | p(w/θ4 ) |
---|---|---|---|---|---|---|---|
signal(信號) | 0.0642848 | network(網路) | 0.06555026 | image(影像) | 0.02552709 | transaction(交易) | 0.0223248 |
signal(訊號) | 0.0350485 | packet(封包) | 0.03553849 | shot(鏡頭) | 0.02288817 | commodity(商品) | 0.0151396 |
antenna(天線) | 0.0198204 | message(訊息) | 0.02628243 | pixel(像素) | 0.02180988 | member(會員) | 0.0142142 |
frequency(頻率) | 0.018826 | server(伺服器) | 0.02122616 | fragment(片段) | 0.02123295 | order form(訂單) | 0.0118526 |
power(功率) | 0.0178861 | media(媒體) | 0.019644 | sensor(感測器) | 0.01929434 | capacity(產能) | 0.0090262 |
We used the representative words of common themes to identify cross section technology terminology. Therefore, before identifying the terminology, the quality of the identified word distribution among the common themes was evaluated. Paradimitriou et al. [
H04B | H04L | H04N | G06Q | |
---|---|---|---|---|
H04B | 0.0000% | 0.1584% | 0.0204% | 0.0002% |
H04L | 0.0274% | 0.0000% | 0.0047% | 0.0112% |
H04N | 0.0022% | 0.0007% | 0.0000% | 0.0002% |
G06Q | 0.0003% | 0.0060% | 0.0002% | 0.0000% |
The representative words in themes indicate significant terms in the corresponding section. In this experiment,
The n-gram approach was adopted to divide representing words into shorter Chinese words. Most Chinese words are one to four characters long [
Each term was counted to filter out meaningless or unpopular n-gram terms. Each term was associated with a count tracking the number of representing words including the term. We used a top-h methodology to select the h terms with the highest counts.
After the top popular terms were identified (
h (Top) | H04B | Support | H04L | Support | H04N | Support |
---|---|---|---|---|---|---|
1 | Receiving(接收) | 11 | Wireless(無線) | 22 | Pixels(像素) | 12 |
2 | Wireless(無線) | 10 | Data(資料) | 18 | Image(影像) | 11 |
3 | Digit(數位) | 10 | Network(網路) | 16 | Digit(數位) | 8 |
Popular terms | Cross section technology terminology | Year |
---|---|---|
Wireless(無線) | wireless ID card(無線識別卡) | 2008 |
Wireless(無線) | wireless tag(無線標籤) | 2007,2008 |
Wireless(無線) | RFID(無線射頻) | 2007,2008,2009,2010 |
Digit(數位) | digital right(數位權利) | 2006,2007,2008 |
Digit(數位) | digital database(數位資料庫) | 2007 |
Image(影像) | Image line(影像線) | 2008 |
Network(網路) | network version(網路版型) | 2008 |
Network(網路) | primary network(主要網路) | 2011 |
The map shows the development of cross section technology through the following features.
The source and application section are identified at the top of the map.
The representative terms are shown on the left side of the map.
In the main part of the figure, the evolution of the cross section technology terminology is shown with the years marked at the top.
Average growth rate | |
---|---|
Our method | 95.08% |
3-gram method | 9.57% |
tf-idf method | 51.49% |
Two other methods have also been proposed to identify emerging terminologies. Corrocher et al. [
In this study, the growth rate of a term
This study developed a methodology to systemically identify cross-section emerging technological applications between source sections and an application section. Concepts (themes) and representing words in each section were captured by methods revised from CCMM. Besides common themes, CCMM can also capture specific themes developed in each year. Representing words in common themes corresponding to source sections were the technologies that had been developed in that section and that could be adopted in the application section. The representing words of specific themes were the technologies being developed in the application section. These segmented representing words from the source sections were compared against the representing words in the application section to identify cross section technological terminologies. Those with high momentum values were further identified as emerging technologies. The proposed method also generated a technology map to illustrate the adoption of technological terminology in the application section.
To verify the effectiveness of the developed method, four subclasses of patent documents (IPC codes H04B, H04L, H04N, and G06Q) were collected from WEBPAT Taiwan [
This study was explorative and had several limitations. First, the sections covered were limited. Future research should collect patent documents from more varied sections to verify the method’s applicability. Second, CCMM is a traditional model, and other refined topic models may be applied to identify word and document distributions. Third, although designed for documents written in Chinese, the proposed model should not be restricted to a specific language.