suika.fam.cx:/www/2007/2ch-entities/

Files

Character entity reference like strings in 2ch threads
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This directory contains results for a quick survey on
usage of character entity reference like strings 
(/&[0-9A-Za-z]+;?/) in 28,085 *biased* dat file collection,
as of Auguest 2007, containing 11,332,194 res, 
from 2ch and similar BBS Web sites.

* Files

result-all.txt
  List of entities sorted by occurence.
result-res.txt
  List of entities sorted by occurence, counting more than
  one occurence of an entity in a res as one.
result-dat.txt
  List of entities sorted by occurence, counting more than
  one occurence of an entity in a thread as one.
all-result.txt
  Source for result files above, in Perl Data::Dumper output format.
  It's a Perl array reference representing:
    [number_of_res, {entity => occurence_in_res_number},
     number_of_threads, {entity => occurence_in_thread_number},
     {entity => occurence}].

* Glossary

Dat file
  A file representing a thread, which consists of a number of "res".
  Formatted HTML documents provided for Web browsers are generated
  from dat files.  Dat files might contain some HTML markup including
  character entity references.
Thread
  A unit of sequential collection of messages in 2ch and similar BBS,
  discussing a topic.  A thread is part of a board.
Res
  A message posted by a user to 2ch or similar BBS.  A res belongs
  to a thread.