Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not comple
Purpose: This paper tries to understand the dynamics of scientific communication systems during crises by investigating as a case study the blogging activities that took place during the period of the 2011 earthquake and related events in Japan. Interactions between bloggers and registered users are studied quantitatively and qualitatively at Sciencenet.cn, an influential science-related blogosphere in China.Design/methodology/approach: The editors of Sciencenet.cn compiled a special issue of science blog articles under the title Analysis of the Japanese Earthquake. We developed a spider program and downloaded from this special issue the metadata about title, content,publishing time, total read count, reply count and recommendation count, and further collected information about bloggers and recommenders. We then sent a short message to the bloggers who wrote articles on these emergencies, asking for their educational and professional background.Findings: We found that knowledge reflected in the blog articles is strongly related to the educational and professional background of bloggers. Knowledge diffusion is facilitated by interactions, such as recommendations, comments and answers. Interactions via comments and recommendations are of an assortative nature: A blog article is more likelyto be commented on and recommended by those bloggers who write on the same or similar topics than by those writing on a different one. Registered users tend to give comments on articles dealing with the topic that they recommend, and vice versa.Interaction in the intersection of two or three topics is more intense than that within one topic. The impact of blog articles is also influenced by other factors, such as the reputation of the blogger and the type of information they contain.Implications and limitations: It is confirmed that studying blogs is a valid approach within informetric studies. Yet, we only studied one triple(earthquake, tsunami, nuclear disaster) event based on data originating from one Chinese blog website. More