Forensic Linguistic Application in Author Identification 71







Iconfirm that this dissertation is in its entirety my own work andthat it has not been already been accepted in substance for anydegree and is not concurrently submitted in candidature for anydegree. It is the result of my own independent research except whereotherwise stated.




Thisthesis has been submitted with our approval as University supervisors






Thereis growthin thefieldsof computational linguistics, stylistics, andothernon-traditional authorshipattribution methodsto developa platformto identifytextauthorship.There are ongoingstudiesandmilestonestakenin suchfieldsas textclassification,software forensics, machinelearning,andforensiclinguistics (David, 1998). Thisin lightof developmentsin technology can bebuiltinto thesoftware. Stylometric authorshipattribution software can be designedto rideon platformslike thosethehavebeenusedby developers of plagiarismdetectionsoftware.

Accordingto Grant and Baker (2001), authorshipattribution andauthorcharacterization are thebasisof stylometric in forensiclinguistics andto understandthesequitedistinctproblemsthisstudyinterrogateslinguisticelementsto understandgrouplanguageuse.Thisis usedto createa looseparadigmon which to base authorshipto identifyauthorsby gender.Authorshipattribution literaturewill normallyuseinterpretive, external,andlinguisticevidenceto establishauthorcategorieslike genderandage.

Theexternalevidencethat can beusedincludeshandwritingorsignedmanuscripts.Interpretive evidenceisbasedon thestudyof documentssuchas whentheywerewritten,whattheintendedmeaningwasandhowitcan be comparedto otherliteraryworks.Linguisticevidence,which is thetopicthatthispaperconcernsitself with, focuseson patternsof wordsandthedictionusedin thedocument.There are severalstatisticaltechniqueswhich can be appliedsuccessfullyforpurposesof authoridentity.

Stylometricanalysisis importantforsocialscientists,analystsandmarketers since itprovidesrawanddirectdemographic data. Such studies, therefore, become important for thepurposes of identifying and authenticate the authorship, and theirinput is the basis of the data herein used to deduce the creation ofa software tool that easily identifies an author by gender (Grant,2008b)

Forensicstylistics is a sub-field of Forensic linguistics thatdetailsin authoridentificationby applicationof stylistics. Thestylistic analysisisbasedthesepremises:

  • There are no two writers who write in the same pattern or style

  • The writers themselves do not write in the same style/pattern all the time.

Thepaperthencategorises stylistic analysisin two differentapproacheswhich are Qualitative andQuantitative.

Thequalitative approachdwellson an assessmentof theerrorsof theauthorandtheir personalbehaviour whilethequantitative approachmainlyfocuseson languagefeaturesthat are readilycomputableandcountable, like thelengthof wordsused,thelengthof thesentences,thelengthof the phrases, differencesin the distributionof the wordsof differentlengthsandfrequencyof vocabulary.

Sincemenandwomenspeakthesamelanguagetechnically itis not easyto identifyhowtheyuselanguagedifferently.Indeed,studiesconductedtryto ascertaintherelationshipbetween genderandlanguageuse.Authorgenderidentificationis a binaryclassificationproblem.Thismeansthatthere are two distinctclassesmalefemale.

Computationallysupportedthe authorshipattribution is a statisticalentity,anditisbasedon measurementsof certaintextual features.Automaticdeterminationof thegenderof theauthorwould appearmoresubtlethan categorization based on authorattribution ortopic.Nevertheless,textcategorization techniquesexploitthecombinationsin simplelexical andsyntactic featuresto infer authorgenderin an unseenformalwrittendocumentat an approximateaccuracyof 80% (Grant, 2008a)

Thisresearchwill interrogatethepossibilityof creatinganautomatictoolforclassificationof formalwrittentextsanchoredon thegenderof theauthor.Automaticcategorization dealswith morecomplexinterest.Thisis unlike typicaltextcategorization thatmainlyfocuseson thetopicorstylometric problemswhich focuson authorshipattribution. However,ideasfrom stylometric studiesandthetextcategorization will be integralto achievinga credibleresult.Theidealsoftware will makeuseof thesubstantialliteraturethat isolatesthedistinguishing characteristicsof male/femalelinguisticstyles.


Thisresearch is a culmination of support, guidance and intellectualinteraction with various individuals and institutions. Special thanksto everyone who helped in the research and eventual write up of thispaper. Special thanks to my supervisor for his selfless guidance andundeterred pursuit of my excellence. To the entire institution of(University) for continuously supporting this endeavor and providingresources to base the pithy knowledge herein contained. To all theparticipants of the intellectually fulfilling process I am gratefulfor the support, without which this dissertation would lack basis incontent and context.

Table of Contents



Acknowledgements 6



1.1 Background to the Study 11

1.2 Statement of the problem 13

1.3 Purpose of the study 13

1.4 Objectives of the Study This study is guided by the following objectives 14

1.5 Justification of the study 14

1.6 Limitations of the study 15

1.7 Assumptions of the Study The study was based on the following assumptions 16

1.8 Research Questions 17

1.9 Conceptual Framework 17



2.1 Introduction 18

2.2 Forensic linguistics 18

2.3 History of Forensic Linguistics 19

2.3.1 Forensic Linguistics in the U.K. 20

2.3.2 Forensic Linguistics in the U.S. 22

2.3.3 Forensic Linguistics in Australia 26

2.3.4 Forensic Linguistics in Germany 27

2.4 Development of Forensic Linguistics 28

2.5 Discourse Analysis 30

2.6 Forensic Linguistic Analysis 31

2.6.1 Stylometric Features 31

2.6.2 Lexical Features 32

2.6.3. Character Features 34

2.6.4 Syntactic Features 35

2.6.5. Semantic Features 37

2.6.6 Application-Specific Features 39

2.7 Author Gender Software Development 40

2.7.1 Text Categorization 40

2.7.2 Review of Similar Software 41 V.1. JVOCALYSE V 2.05 41 COPYCATCH GOLD V 2 42 SIGNATURE STYLOMETRIC SYSTEM V 1.0 43 WORDSMITH TOOL V. 4.0 43



3.1 Introduction 45

3.2 Research Design 45

3.3 Data Presentation, Discussion and Analysis 45



4.1 Introduction 48


5.1 Gender analysis 58

5.2 Bayesian algorithms 59

5.3 Testing the gender software 60





References 69



Thereis growthin forensiclinguistics in recenttimesdueto changesandthenatureof legalcontent andcontext.However,a holistic definitionof &quotforensiclinguistics&quot as a termis elusive.Asa branchof appliedlinguistics, itcan lightlybe describedas theapplicationof knowledgein linguistics with considerationof insightandmethodsin thecontextof thelawin thecriminalinvestigationandjudicialprocedures.

Douglaset al. (1986) confirms that forensiclinguistics can as wellbe usedby thedefence in criminalcasesas is usedby crimeinvestigatorstoo.Expertsubmissionbased on forensiclinguistics can be submittedby bothsidesfrom individualexpertsorindependentlaboratories.Inasmuch as itdoesnot solvethecrime,itis usedto reinforcearguments.

Accordingto Martin(1994), Forensiclinguistics is a combinationof processesandapplicationof linguisticknowledgein a givensocialsettingin thecontextof thelaw.Asan applicationof science,there are numerouslinguistictheoriesthat are employedto analyse languagesamples.Its diversefieldsincludeSpeech Act Theory, Cognitive Linguistics, Theory of Grammar, LanguageandMemory Studies Discourse Analysis andConversation Analysis. Broadspectrumin linguisticanalysisis requiredas aforensicapplicationsince itrequirestheinterrogating languageuse,sentenceconstruction,movesmadeby thespeakerorwriterandphraseandsentencestructures(Coulthardand Johnson, 2010).

Whereaslinguisticanalysishas widelybeendone,andresearchmilestonesachieved,there is moreworkyetto bedoneon theanalysisof writtenmaterial.Theissuewith authorshiphas beencontentiousover thecenturieswith Greek playwrightsinfrequentlyaccusingeachotherof plagiarism.There isincreasedneedto identifytheauthorsof writtenmaterialto enhancesecurity.According to John (2004), the risein cybercrime through publicationof online content thatcompromisesparameters of individualrightsorstatesecurityin anonymity makesitharderto mapandnarrow down on perpetrators.

Zheng, Li, Chen &amp Huang (2006) highlight that toidentifytheauthorof a givenmaterial,theanalysismust be doneto demystify theidiolect andpatternsof languageusedsuchas collocations, spelling,vocabularyandgrammar.Thisrideon thetheoreticalconstruct thatas there is variationin linguisticuseat thegrouplevel, thevariationis presentat an individuallevel. However,there is nohomogenous data on idiolects as providingsuchevidenceis difficult.There is alsoa limitationsince thedocumentspaucitythat should beusedin forensicsettingformostcriminalcasesare shortandthereforethetextto basereliableidentificationon is less.Ithowevercould be usedforeliminationpurposes.

However,with theknowledgethatcertaingroupsuselanguagein a differentwayfrom others, itis possibleto forma construct in idiolects that can narrow down on theauthorbased on factorslike race,culture,gender,religionorage.Thisis becauselanguageis sociallyacquiredas opposedto beingan inheritedproperty.

Basedon measurementof thenumberof syllables per word,wordlength,punctuation anduniquewords,wecan identifythegrouplevel of theauthor.Usingstatisticalapproaches,suchas factoranalysis,Poisson distributionandmultivariate analysisthegenderof theauthormay be identified.Identificationof thegenderis a plausiblewayof narrowing down on a suspectwherethere are suspectsof bothgenders(Coulthard,2004).

Thiscan be achievedbased on analysisof howmenandwomeninherentlyuselanguageof differentclassesandstylesthrough interrogationof linguisticfeaturesthat indicategender.Psychologicalanalysison psycholinguistic cues,as wellas genderpreferentialcues,can be usedalong with stylometric featuresto identifythegenderof theuser (David, 1998). Oncethecorrectsetof featuresindicatinggenderisidentified,thesoftware may be developed.With suchtools,ridingon existingtechnology platforms,genderidentificationin authorshipcan easilybe achieved.

1.2Statement of theproblem

Thereisincreasedneedto identifytheauthorsof writtenmaterialto enhancesecurity.Risein cybercrime through publicationof online content thatcompromisesparameters of individualrightsorstatesecurityin anonymity makesitharderto mapandnarrow down on perpetrators. However,with modernforensics, there is increasedzealin identificationof individualsbased on differentcategoriessuchas age,race,religionandgender.Thesolutionforthoseissuesliesin theintegrationof forensiclinguistics andcomputersoftware that will answervariouscategoriesof questionsin legalandinvestigative settings.

Forinstance,inquirysuchas identificationof thespeaker,languageandauthorbased on gendercan be resolved.Eagleson (1994) contends that through the relationship andintertextuality of the text, and linguistic profiling theinvestigation of the dialect of the author, age, gender, nativelanguage and educational level, and the classification of typed textssuch as suicide notes, threats or predatory chats can be done.Therefore,thisincreasestheneedto identifyauthorsof differenttextsdueto factorslike criminalactivity,anonymity orcriminalactivity,there is a needto identifytheauthorspositively.Thiscan be donebased on a numberof distinguishing factors,andsince gendereliminatesvariables by a bigratio,thedevelopmentof software that aidsin theidentificationof theauthorby genderis important.

1.3Purpose of thestudy

Thepurposeof thisstudyis to understandforensiclinguistics applicationunder authoridentification.Further,thestudyaimsat comingup with a paradigmthat marrieslinguistictheoriesthat can identifytheauthorsof thefemalegenderto adevelopedsoftware that articulately identifiestheauthorbased on gender.It,therefore,intendsto helpunderstandthefactorsof authoridentificationtheoretically andpracticaluseof thedevelopedsoftware.

1.4Objectives of the StudyThisstudyis guidedby thefollowingobjectives

  • Study of the language of the law, including the language of legal documents and the courts, the police, and prisons.

  • Investigate the use of linguistic evidence (phonological, morpho-syntactic, and discourse-pragmatic) in the analysis of authorship and plagiarism, linguistic profiling and suicide notes.

  • Research into the practice, improvement, and ethics of expert testimony and the presentation of linguistic evidence, as well as legal interpreting and translation.

  • Better public understanding of the interaction between language and the law.

  • Study the use of linguistic theories to identify the gender of the author of the documents.

  • Investigate the possibility of developing software that positively identifies an author by gender.

1.5Justification of thestudy

Theintegrationof bestpracticesin forensiclinguisticandapplicationsoftware is importantin propellingthefieldof authorshipdetectionfrom lawenforcementoracademics to an actualforensicsciencethat will be beneficialto thejudicialsystems.Thisstudywill developreliablemethodsforauthoridentificationindependentof anylitigation,testforlimitsthat correlateto a certainlevel of accuracy,testerrorsof othertechniquesthat could causeaccumulatederrors.In addition,theresearchoffersa conclusiveanalysisof linguisticapplicationsin forensics andsuggestson howto improvethem. Itequipsinterestedpartieswith intelligenceon howauthoridentificationworksandhowthetheoriescan be translatedthrough technological advancementsto identifythegenderof theauthor.

1.6Limitations of thestudy


Accessto informationwasa limitation.There is lessandbiasedinformationon theinternet concerningtheresearchquestion.Informationis unreliableas itissourcedfrom unverifiedcontent in blogs andsites.Email informationis alsoproneto respondent bias·Sample size

Dueto thesizeof theresearchlocationandmeansof data collection,thesamplenumberof unitswasmediumandshould havebeenbiggerto givemoresignificantrelationshipfrom thedata collected.However,since theresearchwasmeantfora rathersmallreadership within thelegalandforensicworld,therepresentativedistributionof thepopulationwasachieved.

·Lack of availabledata andpriorresearchstudies

Therewasnot muchdonein researchon theresearchproblem.Basing theresearchon secondarydata from previouspaperson thesameproblemin thesametargetareawasunachievable. Surveysdoneby individualexperts,andlegalentitiescould havebeenbiasedandunreliable.·Self-reported data

Therespondents are expertsin forensiclinguistics, andtheresponsewasratheropinioned than factual.Asa result,theresearchis based on unverifiedself-reported data. Thisdata cannot beindependentlyverified.Theresponseis takenas credibleat facevalueandusedforanalysiswithout furtherinterrogation.Theliteraturereviewisalsobasedon previousattemptsat studiesin thesamelinebutwithout a verified validity.From thepotentialbiasof selectivememory,telescoping, attribution andexaggeration,self-reported data may projectfindingsthat are totallydifferentfrom therealsituation.

·Longitudinal effects

Thetimeavailableforinvestigative purposesfortheresearchproblemwasshort.Asa result,constantandspontaneouschangesin thepopulationmay not be measuredin changeandstabilityover a periodto base a long-time researchon. There is a constraintof timeandresourcetowards theachievementof a crediblegoal.·Ethical issues

Asin theabovesubmissionon thelimitingfactorwithin thescopeof researchbased on priorinvestigative studies,thepaperis based on interpretationandapplicationof secondarydata andtheassumptionof reliabilityin researchinstrumentsandrespondents’ submissions.

1.7Assumptions of the StudyThestudywasbasedon thefollowingassumptions

  • The sample population was a true representation of the entire target population.

  • The target sample provided relevant, honest, accurate and reliable information upon which the findings and consequent recommendations were based.

1.8Research Questions

  1. What are the theoretical constructs of forensic linguistics?

  2. How can the theories be used to develop a paradigm that identifies the gender of the author?

  3. How can the theories developed be used to develop software that eases author identification by gender based on idealized models of existing software platforms?

1.9Conceptual Framework

Theconceptual frameworkhereinusedis thecreationof theresearcher based on a numberof modelsof conceptual frameworks.Itcombinesworkinghypotheseswith descriptivecategoriesandformalhypotheses.Theoverall recommendationsarebasedon themodelof operationsresearch.Thishelpsin conductinga conclusiveresearchthatexplores anddescribestheproblem,andmakes,anexplanationandpredictionthat can beusedin decisionmaking(Shields andRangarjan, 2004). Theconceptual frameworkdevelopedcomprisedof independentvariables that are thefactorsaffectinglanguageuse.There are interveningvariables that are thelanguageuseof thefemalegenderanddependentvariables that are themeasurableoutcomeof languageuse.

Independent Variables

Factors affecting language use

Intervening Variables

Language use by the female gender

Dependent Variables

Measurable outcome of language use


Thischapter concerns itself with the literature that is relevant to theresearch problem based on previous research and findings, journals,books and blogs. It explores the field of forensic linguisticsfocusing on gender identification in authors and its potency as wellas the challenges and opportunities it offers. It also concernsitself with previous research on identification software and theplatforms they ride on and the contributing factors to its failure orsuccess.

2.2Forensic linguistics

Accordingto Coulthard and Johnson (2010), Forensic Linguistics can bedescribed as the application of theories of linguistics in the lawand legal issues. However, to understand the application of thesetheories and work out a paradigm of how they can be affected to suitthe purpose in defining identity of authors, it is important tounderstand the process and theories themselves.

Inthe light of the working definition of the term, the application offorensic linguistic knowledge to the legal setup an interface betweencrime, law and language are formed. Judicial entities like lawenforcement, legislation and proceedings can be translated fromlanguage to give a deeper cognition of submissions hence make moreprofound judgments (Coulthard and Johnson, 2007).

Despitethe fact that language has been a central and integral part ofsociety for a long time, it is surprising that the use of forensiclinguistics for the law in particular has not entirely been exhaustedand is still considered relatively new. There is still a milestone tobe scored in the field to be acclaimed and widely used as otherprocesses such as shoeprint analysis and fingerprint identificationin the judicial process.

Toachieve such goals, the field of forensic linguistics must further beinterrogated and researched on with the tools and theories beingdeveloped to suit purpose and demand. There are various linguistictheories that may be applied in the process of analysing languagesamples. These theories may be drawn from separate and distinctmethods and theoretical constructs such as Conversation Analysis,language and memory studies, Discourse Analysis, CognitiveLinguistics, theory of grammar, and Speech Act Theory (Shuy, 2001).

Relyingon a broad spectrum in the linguistic fields helps the researchersince the data received by the linguist for analysis mostly requiresdeep interrogation on how language is remembered, construction ofconversations, relationship between moves and conversation or awritten text of the speakers or writers and aspects of phrase andsentence structure. With such, forensic linguists apply linguisticknowledge, techniques, processes and theories to the languageimplicated in the legal proceedings.

2.3History of Forensic Linguistics

Thereare no clear dates on the specific moments that work on forensiclinguistics began. Specifically for authorship, there have beenquestions on the authors of different material over the decades. TheGreek playwrights continuously accused each other of plagiarism forinstance. At least since the 18th century there has been interest onthe authorship of different material like sacred texts, poetry andplays from both scholars and readers (John, 2008).

McGehee(1937) points out that in the 19th century, studies on theidentification of authors had begun. Attempts at development ofauthorship attribution had begun with British and Americanstatisticians and mathematicians making the attempts. Augustus deMorgan made the first attempt in 1851 with input by others, such asTC Mendenhall and Udney Yule later. Due to technological limitationsand lack of construct theories at the time, the scholars, mainly tooka statistical approach and defined the identity based on measurableattributes like average word length and the mean sentence length. Theresult could not be reliable as the process was hardly forensic anddid not detail on linguistics. John (2008) affirms that theconnection between the two lacked for a long time since it was notuntil 1968 when the term Forensic Linguistics was used for the firsttime. Jan Svartvik, a linguistics professor, was commissioned to makean analysis of statements given to police by Timothy John Evans, whohad been accused of murdering his wife and baby in 1953. Throughsystematic and methodological analysis of the language used, he madenote that two styles were used in the statements. He quantified thedifference and demonstrated the use of a marked spoken style and aneducated, written one. The analysis concluded that Evans had notdictated the police statements that were attributed to him.

2.3.1Forensic Linguistics in the U.K.

Therewas an established set of laws regarding how the witnesses wereinterrogated in the English law. These laws known as Judges’ Rulesdictated how statements should be taken from witnesses by the police.The rules dictated that suspects were supposed to dictate theirnarrative to the police as the police officers recorded the statementwithout interruption or asking questions except where minorclarifications were needed (Douglas et al., 1986).

Thepracticability of these rules was questioned since the stipulationswere never considered. The norm was that the police officers tookdown notes as they asked a series of questions. They then wrote thestatements, but not in the words of the suspect. They followed a formand pattern long dictated by police customs. The result was that thetype of phrasing that was evident in the statements was far removedfrom the typical way of speech. The phrasing came to be known as‘police register’ which has grown to be an area of study inforensic linguistics (Peter, 2004). The creators of the Judges’Rule overlooked facts of the difficulty in dictation andtranscription of statements.

Thejudges who formulated these rules for taking statements seemed notaware that dictation of a statement and its verbatim transcription isdifficult. It has been argued as an impossible task for the averagespeaker. Coherent, sequential, articulate dictation of a narrative isextremely hard, but the harder task is with the person recording thestatement if the speaker has no skill at pacing the delivery.

Theusually, delivery of statements is not in a coherent and orderedfashion. Natural speech is either too fast or slow, and there is anomission of important details, speculating aloud, backtracking, andso on. In a word, the Judges’ Rules were plausible, but in effectthey were just not workable. This is attributed as the reason thepolice officers created their way of recording statements, butregrettably it involved making them in some cases.

Forsuch reasons, early days Forensic Linguistics mainly involvedquestioning of the authenticity of statements by defence attorneys.For example, as the first, expert evidence was given at a murdertrial from the witness box in 1989, at the Old Bailey. Peter Frenchshowed the presence of some police register in a statement that theprosecution had claimed was entirely and in totality in the words ofthe defendant (Peter, 2004).

Accordingto Coulthard (2000), there are other notable cases that include anappeal against conviction of Derek Bentley, who was pardonedposthumously, The Guildford Four, the Birmingham Six, and theBridgewater Three among others. Professor Malcolm Coulthard, adistinguished forensic linguist from Birmingham University, had hisworks relied upon heavily in these cases. His input as a discourseanalyst proved important in establishing the truth and eventuallyruling against the convictions

2.3.2Forensic Linguistics in the U.S.

ForensicLinguistics in the U.S. began differently. However, it was alsomajorly concerned with individual rights throughout the interrogationprocess. In 1963, an armed robbery convict Ernesto Miranda appealedagainst his conviction on grounds of not understanding his rights toremain silent during the process of arrest or the provision of havinga lawyer present while questioning. In 1966, the Court of Appealoverturned his conviction. They came to be known as Miranda cases(Ayres, 1988).

Mirandarights are based on a simple one provision that police officers havean obligation to advise the arrestee that they have a right not speakexcept where they wish to. They are also advised on their entitlementto a lawyer, and anything they say that could be self-incriminatingcan be used against them in a court of law (Shuy, 2001).

Therewere many issues that arose, however, according to Professor RogerShuy. They noted the following

(i)That the confession should and must be voluntary.

(ii)The questioning process should not be coercive.

(iii)The arrestee must portray an understanding of their rights.

Accordingto Shuy (2001) on the first point, it is hard to find an arrestee whovoluntarily agrees to the questioning process. Arguably, then, suchis deemed coercive as the US Supreme Court pointed out. Based on thisShuy (2001) gave credible examples of how the issue of coercionduring the interrogation process happens. He gave a description ofhow suspect, who, after declining to speak after his Miranda rightswere read, was escorted to the station by two police officers in theback of a police car. The two officers who then began a conversationabout the possibility of a shotgun, as a murder weapon involved inthe case, being accidentally found by children at a school nearby.The suspect consequently immediately waived his Miranda rights andled the officers to where the murder weapon was. The suspect wasconvicted of the murder charges later. The issue of contention beforethe appeal court during the appeal hearing was the possibility of thesuspect being coerced into making a confession.

Thejudges and lawyers had to reconsider the meaning of ‘interrogation’as a legal entity. In conclusion, The Rhode Island Supreme Court madea conclusion that the interrogation process did not need to involveasking of questions necessarily. They all observed that there wassubtle coercion hence it was “a functional equivalent ofinterrogation”. From hence, the US Supreme Court thus passed that“interrogation is not necessarily in the form of a question as itmay be based on psychological ploys creatively used by theinterrogator. A forensic audit of the conversation by the officersshows that what could probably have passed as casual remarks could bea deliberate ploy (Coulthard &amp Johnson, 2007).

Shuy(2001) therefore,s raised numerous important queries about theMiranda laws, and questioned its obvious assumptions of simplicity.In a case study, he cites, a boy fifteen years from Houston, Texashad his rights read to him and consequently confessed to the murder.After forensic analysis of the tape-recorded interviews duringinterrogation, the conclusion was that even though the boy might havesaid that he understood the questions that were asked his level ofcomprehension in such a case was extremely low. This was laterconfirmed by the school he attended saying that he had comprehensionability that equivalent to an eight year old.

Thesecases show that there is a need to query the very basic premises ofthe Miranda rights, and any other legal provisions like it. Accordingto Shuy (2001), forensic linguists should not take any concept orword, even the ‘simplest,` for granted. There is the need to querylegal provisions on a word by word level to understand their premiseand therefore interpret the entities correctly. This in thedevelopment of a forensic linguistic tool that can identify an authorcorrectly by gender should be taken to account as the softwaredevelopment should be as accurate as possible.

Shuy(2001) and other linguists in the US, Have had their works used forclarification and analysis in both civil and criminal practice. Therewas a realization made that the law has always been subject toquestioning since its inception. Linguistic forensics has beenintegral in trying to make judgements and to understand the conceptsof the law. The mid era in the development of forensic linguisticsaccording to Shuy’s school of thought dwelt on trying to understandwhat the law means and how different people perform in light of thequestion whether they ‘understand’ their rights.

Levi(1994) created a credible review of the early era of ForensicLinguistics in the US. She recalls an analysis of ‘bad news aboutyour social benefits,` a letter that was written by the Department ofPublic Aid of Illinois to recipients of child benefit payments. Thedepartment had categorised these recipients as ‘non-cooperative’.Levi based her analysis on an interrogation of the vocabulary used bythe drafters of that letter to determine whether they had usedbureaucratic and technical language instead of the ordinary languagethat the recipients used every day. The analysis also includedpragmatic questions such as whether inferences made by the recipientswere justified by the case’s facts. She also was concerned aboutwhether the writers of that letter provided information that wasincomplete which could have led to misleading inferences being made.The linguist’s concern also stretched to whether the reader forcedto infer the information that should otherwise have been madeexplicit (Levi, 1994).

Theresult of her analysis was a stepping stone for the success of therecipients. The recipients of these benefits most of whom were singlemothers had suffered hardships as a consequence of the State’sactions. The judge awarded them twenty million dollars while orderingthe State to re-examine its classification of those that were ‘noncooperating’ to comply fully with the consent order. Based on theanalysis made by Levi, the State was also ordered to rewrite itsletter using language that was comprehensible for the beneficiaries.A fundamental point noted by Levi is the comment by a linguist inanother case, stated as the legal system is “linguistically naïveand vulnerable” (Levi, 1994). This point is highlighted in the nextsection.

Also,notable is that forensic linguistics has been applied in the US inthe early years of the study in cases dealing with trademarks. Forinstance, there was a dispute that involved McDonald’s as a brandname of the multi national fast food franchise. According to Levi(1994), forensic linguists Roger Shuy and Genine Lentine were calledto settle the dispute in this case.

QualityInns International had announced that they intended to open aneconomy hotels chain that would be called ‘McSleep.` According tothe proprietors of ‘McDonald’s’ the intended attachment of theprefix ‘Mc’ to several unprotected nouns, for example, ‘Nuggets’in ‘McNuggets’ or ‘Fries’ in ‘McFries’ barred the ownersof Quality Inns from using the prefix ‘Mc.` The plaintiff’s casewas not just a claim of implicit ownership of the name it was basedon the underlying morphological principle. It dwelt on the attachmentof the prefix ‘Mc’ to any noun (Ayres, 1988).

Thecase needed forensic linguists since the claim was on the formula forcombination, and they were trying to invoke the protection for theformula. In their claim, ‘McDonald’s’ explained how theyoriginated the combination through the process of attachment ofunprotected words to ‘their’ prefix ‘Mc’ and that they hadpreviously run advertising campaigns that exemplified this. Lentineand Shuy in their analysis observed that the ‘Mc’ prefix waspreviously used in commercial applications. Ayres (1988) confirmsthat during these applications, McDonald’s had not made anyobjections and as such had no grounds for making any objections inthe case against Quality International Inns.

2.3.3Forensic Linguistics in Australia

InAustralia forensic linguists began speculations in the 1980’s onthe possibility of the application of sociolinguistics andlinguistics to legal issues. Of concern to them were individualrights during the legal processes. At that time in particularAboriginal suspects faced difficulties during the police questioningprocess. There was a quick realization that even such simple phraseas ‘the same language’ was open to question. This was fundamentalwith consideration of ‘Aboriginal English’ the dialect spoken bya majority of the Aboriginal people. Most of the white Australianswrongly thought it was a defective form of Standard English that isspoken by whites (Gibbons, 2003).

Asthe language of use for the Aboriginal, it is in fact a lingua francaon its own. Thus, during the police questioning process, theAboriginal people base their understanding of their use of Englishand bring that to the questioning process. The police who arespeakers of the Standard English did not always appreciate this.

Gibbonsand Teresa eds. (2008) confirm that instructional styles are mainlyculturally based, and the Aboriginal people brought their styles tothe questioning process. As forensic linguists have observed insociolinguistics, individual instructional style, when and whereperceived as a variance to the dominant culture, obviously compelscertain responses to the questions. This bias in particular isnon-confrontational, and it could lead to the assumption that isfalse on the part of the police that a suspect is either beingevasive or that they are making an admission of guilt.

Therewas more Australian research that focused on how the Aboriginalwitnesses, as well as defendants, understood legal processes in anumber of land claim hearings. Researchers examined how crosscultural differences impacted the presentation and eventual outcomeof the cases. On this, Gibbons (1994) observed that the system thatis set and on which interrogations are based on court is alien forthe Aboriginal culture. Gibbons in ‘Forensic Linguistics and‘Language and the Law’: An introduction to the language in thejustice system’ summarises considerable experiences in the judicialsystem and details the history of the growth and development ofForensic Linguistics.

2.3.4Forensic Linguistics in Germany

InGermany, one of the earliest cases involved slander in an apartmentcomplex by the tenant. At stake was an issue on whether the wordconcubine could be considered as an insult. According tosociolinguists, the word might have been amusing for some. Whileothers could have addressed each other by the word jokingly, othersmight have found the word very insulting (Coulthard, 2004). Theyestablished that it was impossible to qualify a word or phrase as aninsult or attribute verbal injury to it. This is solely dependent onthe relationship between the speaker and the hearer. It also isinfluenced by the situation in context, and to some extent the levelof education of the speaker. The conclusion was that no word has asingle and universally agreed meaning within any speech community.

Theearly days for Germany Forensic Linguistics also involved issues ofauthorship attribution. Importantly, there were methodologies forauthorship attribution that were developed. In one such case, therewas contention on the theses presented by twin sisters. Theirprevious academic performances were, considered of a much lower levelthan that shown in the final presentation of their theses accordingto their university. Forensic linguists ruled out the possibility ofauthorship attribution in that case since it was impossible as metalanguage was the language that was essentially used (Grant, 2008a).Such language could not be attributed to any given individual. Theirsuggestion was that the students write an examination subjected bythe university authorities testing their knowledge of concepts. Thiswas more plausible than a subjective comparison of their work.

2.4Development of Forensic Linguistics

Sincethe establishment of Forensic Linguistics as a discipline over theyears, there has been considerable growth in its scope. It began as ameans of interrogating statements by witnesses and defendants withlinguists being called in to give expert evidence in a differentcase. These cases have included authorship attribution in someterrorist cases, cases on product contamination and suspiciousdeaths. Through interpretation of the meaning of documents, throughthe analysis of text many cases have been solved. There are new andbetter methods and theories that are being researched every day.Forensic Phonetics has been proven as an important area of ForensicLinguistics. An auditory and acoustic analysis of speech and theeventual application both legal and criminal areas have proven vital(Baldwin and French, 1990).

Thegrowth of Forensic Linguistics has been characterised by these twocritical issues:

• Thegrowing need to grow the scope and increase the effectiveness ofForensic Linguistics as expert testimony input in the court system.

• Theneed for improvement of Forensic Linguistics applicationmethodologies of and development of easy-to-use tools fornon-linguists

Accordingto Coulthard and Johnson (2010), the line of forensic linguistics hasmany different entities. They are interconnected by authorship. Theconsideration of the question on an individual’s dictation of astatement, or whether the statement is in the words of an allegedspeaker, what analysts speculate on is the fundamental question of‘Who is the real author of the statement that attributed to X?.There is also concern on whether such statements are made knowingly,voluntarily and in full knowledge of the rights and privileges of thesuspect.

Therefore,there is an interrogation on the conditions of authorship. Forensiclinguists can, therefore, establish events and structures thatdistort the narrative while questioning through statements that seemvague and reluctantly given. Beyond language, there is thefundamental consideration of the asymmetric relationship between thepolice and the suspect that could result to variance with thesuspect’s submissions (Peter, 2004).

Thebroadest description of authorship is based on the theoretical ofentity that being an author means possessing the language you areusing. The author has the control over the use of language to producetext and direct its course freely. There is a bias, however on thecontrol of language for minority speakers like the disabled,illiterate and youth when they are giving statements to powerfulauthorities. Linguists can realistically challenge most texts thatare produced under duress. This is regardless of whether the duresswas intentional on the part of the questioning police officers(Coulthard and Johnson, 2007).

Whena suspect’s use of language is removed from the standard form of aparticular language the questioning officers’ use, there is thepotentiality of distortion of authorship during the questioningprocess (Douglas et al., 1986). This exacerbation is in proportionwith the differences of perspective and interactional styles based oncultural norms. However, this does not suggest malice or lack offairness in the judicial process. The established institutionalstructures may not always as conducive for taking statements as theyshould be.

Authoridentification is an interest that has promise, and if furtherstudies are done, it will prove very useful. However, its potentialis hampered by the ostensible often too short lengths of thedocuments in forensic analysis. For instance, the suicide notes,threatening letters and ransom notes might not be as lengthy toprovide credible information. There is yet to be established somelinguistic features that can be relied upon as indicators ofauthorship (Grant, 2008a).

Researchon these entities, however, is on-going with availability of corporaof speech as well as writing samples suggesting advancements in thefuture. With these developments, identification of authorship evenfor short documents will be easy. This will help eliminate potentialauthors while selecting the author from the target group.

2.5Discourse Analysis

Discourseanalysis as a field is very broad. The credibility of its conclusionsdepends entirely on the methodology used and the description ofconclusions therein. A Discourse analyst should provide informationthat is helpful by closer analysis of text. For instance, where asuspect uses &quotI&quot instead of &quotwe&quot might show nocomplicity in a conspiracy (Shuy, 2001).

Linguistsalso point out that where a suspect uses terms like &quotuh-huh&quotor&quot yeah&quot in their responses to suggestions shows that thesuspect does not necessarily agree with the suggestion made. Theiranswers could, therefore, be simply providing feedback markersindicating understanding of the context. This conclusion is based onhuman routine behaviour in ordinary conversations. Courts around theworld are divided on whether they should allow discourse analysts togive testimonies as experts. However, even if not allowed to givetestimonies in court, they are useful in preparations of cases.

Pennycook(1996) highlights that dialectology and proficiency testing inlinguistics are areas that are time-tested and have relatively littlecontroversy. There are more forms of the same language and growingsociolinguistic influences in the use of language due to mass mediaand mobility in population hence people are often mixing dialectfeatures. This poses a serious problem when analysis of linguisticorigin is being done. Determination of an author’s origin theirdialect or the language they use becomes complicated due to the manylanguages spoken in one country. Such determinations should be madeby qualified experts who inherently understand the limitations insuch an approach.

Allmethodologies should be integral in the criminal justice system asveracity analysis is though such testimony might not be used in thecourtroom. Where the authorship is in question well-trainedlinguists should be called in to assist the jury. A forensiclinguist can identify the author of a document accurately withrelative confidence (Gibbons, 2003)

Itis easier to eliminate a suspect in confidence than it could be toprove them guilty beyond reasonable doubt hence, linguisticexpertise is sought by the defense, since the reliability ofidentification is less essential. However, linguistic expertise isimportant for the prosecution too, during the investigatory stage. On-going research is increasingly making approaches in forensiclinguistic more reliable, and it is hence becoming increasinglyuseful to the investigators and prosecutors. Countries like Germanyand Holland have forensic linguists in their criminal laboratories.The Germans have the Bundeskriminalamt while the Netherlands hasNederlands Forensisch Instituut.

2.6Forensic Linguistic Analysis2.6.1Stylometric Features

Stylometryis a study using which one can judge about the author by theirwriting style. Previously, there have been studies on authorshipattribution that have proposed taxonomy of features that quantify awriting style. They are called style markers and are under distinctlabels and criteria (Holmes, 1994, Stamatatos, Fakotakis, &ampKokkinakis, 2000 and Zheng et al., 2006).

Currentreviews of features of text representation for purposes of stylisticsare focused on computational requirements for making measurements onthem. A text is considered as just a sequence of words or charactersbased on lexical and character analysis. It is noteworthy thatdespite being more complex than the character features, lexicalfeatures are always the beginning in an analysis as traditionexpects. Then, semantic and syntactic features require a deeperlinguistic analysis with which application-specific features aredefined by given text domains or languages (Holmes, 1994).

2.6.2Lexical Features

Thesimplest and most natural way of viewing text is in sequences oftokens that are grouped into sentences. Each token corresponds to aword, number, or punctuation mark. The earliest attempts onattribution of authorship were based on the simple lexical measuressuch entities as the length of sentences as well as counts on wordlength (Mendenhall 1887).

Suchfeatures have a significant advantage which is that they are easilyapplied to language and any corpus without additional requirements.It only requires a tokenizer that is a tool that segments the textinto tokens. However, for some natural languages like Chinese this isnot a trivial task. Where sentential information is used, thereshould be a tool that detects sentence boundaries. Some text domainsthat are heavily abbreviated or in which acronyms are used likee-mail messages, the procedure might have considerable difficulty inmeasures.

Thefunctions on the vocabulary richness make attempts at quantifying thediversity of vocabulary within the text. For example the type-tokenratio V/N. in this ratio V represents the size of vocabulary which isreferred to as unique tokens and N represents the tokens total numberin the text, it also considers the hapax legomena which are the wordsthat occur once according to de Vel, Anderson, Corney, &amp Mohay(2002). However, V the vocabulary size heavily depends on the text’slength. Hence, the longer the text, the more the vocabulary. There ismore vocabulary at the beginning of the text, but the preferencedecreases as the text increases. There have been a number of proposedfunctions in a trial to achieve stability over the text lengthaccording to Yule (1994). Since the results are questionable, suchmeasures are normally considered unreliable for use alone.

Thebest and easiest approach in the representation of texts is by use ofvectors of the word frequencies. Most of the authorship attributionanalysis is based on a number of lexical features that representstyle.

Traditionally,one may use bag-of-words text representation that researchers followfor topic-based text classification. Therefore, the text in questionis often considered as a set of words. Each of the sets has afrequency of occurrence that disregards the contextual information.However, the difference in style-based text classification issignificant. The words that are most common such as pronouns,articles and prepositions are among the best to discriminate todistinguish between authors. It is notable that these words should beexcluded from the feature set of topic-based text-classificationmethods. This is because they do not have semantic information, andare hence called “function” words. Consequently, style-based textclassification that uses lexical features may require dimensionalitythat is much lower when compared to the topic based textclassification. This essentially means that fewer words are enoughwhen performing authorship attribution when compared to acategorization task on a thematic text which involves severalthousands of words (Hoover, 2001). Function words use is largelyunconscious for most authors, and these function words aretopic-independent. Based on these words, linguists can analyse thestylistic choices that the authors make across different topics.

Thechoice of specific function words used as features usually is basedon an arbitrary criteria, and it employs language-dependentexpertise. Different groups of function words are used in English,but there is limited information provided on how they were selected.According to Abbasi and Chen, there are 150 function words whileArgamon, Saric, and Stein set 303 words. Zhao and Zobel used 365function words while Koppel and Schler proposed 480 function words ascited by Eagleson (1994).

Forsimplicity and very success, definition of lexical features that areset for authorship attribution should be done through the extractionof the words that are more frequent in the corpus in question. Adecision is then taken on the amount of frequent words used asfeatures. Earlier studies consider sets of 100 frequent words at mostthat are considered enough to represent the author’s style ofchoice (Burrows, 1987). The feature-set size is also affected by theclassification algorithm used as many algorithms where dimensionalityof the problem increases, over fits the training data. If a powerfulmachine is available, learning algorithms can deal with manyfeatures. Such are the support vector machines, have enabledresearchers to make increments to the feature-set size in thismethod. This has seen improvement in the number of words used overtime. For instance, Koppel, Schler, and Bonchek-Dokow used 250frequent words in 2003 while Stamatatos extracted the 1,000 frequentwords in 2006 as cited by Pennycook (1996).

2.6.3.Character Features

Inthis measures family, the text is characterized as a sequence ofcharacters. This way, we can define a number of character levelmeasures that include the alphabetic characters, digit characters,lowercase and uppercase character count, frequencies of certainletters, and a count of punctuation marks. Such information caneasily be available for natural languages and corpus. It is vital inquantifying the style the author uses in writing.

Zheng,Li, Chen, and Huang (2006) claim that an approach that is based onextraction frequencies of n-grams of the character level gives us amore elaborate analysis that is surprisingly computationallysimplistic. For example, the character 4-grams at the beginning of asentence like ‘A more elaborate design…’ would be:1 |A_mo|,|_mor|, |more|, |ore_|, |re_e|, etc. Using the approach, a linguistis able to capture the nuances of style. These include the lexicalinformation, for example, |_in_|, |text|, contextual informationhints like |in_t|, and the capitalization and punctuation. Where theselect texts contain grammatical errors, and there is the strangepunctuation use, which is common in e-mails, the characters ofthen-gram representation are not adversely affected.

Twowords can be considered different based on their lexically basedrepresentation of trigrams. Where the text categorization isstyle-based, such errors are often considered as personal traitsattributed to the author. Such information can be captured by thecharacter n-grams. This, for example, is observed in some uncommontrigrams like |stc| and |tc_|. In conclusion, for the orientallanguages which involve a hard tokenization procedure, a suitablesolution is offered by character n-grams (Zheng, Li, Chen, and Huang,2006).

2.6.4Syntactic Features

Employingthe syntactic information, the analysis is more elaborate as a stylein text-representation. This approach is entirely based on a roughidea that authors unconsciously make use of similar syntacticpatterns. The syntactic information therein is, therefore, consideredas a more reliable authorial fingerprint when in comparison with thelexical information. The success of given function words within thetext in representation of style is an indicator of how syntacticinformation is useful as such entities are encountered in givensyntactic structures (Baayen, Van, Neijit, &amp Tweedie, 2002).

Suchinformation, however, requires NLP tools that are robust and ensureaccuracy in performance of syntactic analysis of the texts. Based onthis concept, extraction based on syntactic measure is a procedurethat is language-dependent as it heavily relies on whether a parserthat can analyse a natural language accurately is availability. Suchfeatures often produce datasets that are faulty due to unavoidableerrors that are made by the parser.

Baayen,van Halteren, and Tweedie were the first linguists to use syntacticinformation in 1996 as measures in authorship attribution. Theyextracted rewrite rules on frequencies based on the syntacticallyannotated English corpus which comprised of a full parse tree foreach sentence that was semi automatically produced (Maley,1994).

Therewrite rules express part of the syntactic analysis. For example,the rewrite rule below:

A: PP→P : PREP + PC : NP

Accordingto McMenamin (1993), this rule means that the constitution of anadverbial prepositional phrase is a preposition that is followed by anoun phrase as a complement of the preposition. Such detailedinformation is useful in the description of both the syntactic classof every word and the combination process that forms the phrases orother lexical structures. There are Experimental results that haveexemplified that this measure performs better than the vocabularyrichness measures and lexical measures. It requires a fully automatedparser that is sophisticated and accurate and can provide a syntacticanalysis of English sentences that is detailed. Gamon in 2004 alsobased an analysis on a syntactic parser’s output to measure therewrite rule frequencies. The proposed syntactic features performedworse than the lexical features, but a combination of these twomeasures improved the results considerably.

2.6.5.Semantic Features

Theconclusion from the above details is that when the text analysis tobe used for extracting stylometric features is more detailed, theproduced measures are less accurate. The NLP tools can successfullybe applied to the low-level tasks like sentence partial parsing,splitting, text chunking and POS tagging and this measure relevantfeatures accurately keeping error in the datasets that correspond low(Corney, Vel, Anderson and Mohay, 2002).

Thetasks that are more complicated like full semantic analysis,pragmatic analysis and syntactic parsing cannot be handled accuratelyby the NLP technology for all texts as of yet. Consequently, theattempts made to exploit high-level features have been very few forstylometric purposes.

Gamonused a tool that was able to produce the semantic dependency graphsit however did not provide any information on the accuracy of thattool. It extracted two kinds of information that are the binarysemantic features, as well as semantic modification relations. Thebinary semantic features were concerned with the number and person ofnouns as well as the tense and aspect of the verbs. The semanticmodification relations described syntactic and semantic relations ofthe nodes on a graph and the daughters. For example of this is anominal node with a nominal modifier indicating location. Resultsfrom previous research have shown that combination of semanticinformation with lexical information, as well as syntacticinformation, helped in improving the accuracy of classification.

Linguisticscholars McCarthy, Lewis, Dufty, and McNamara also described yetanother approach to the extraction of semantic measures.

Theymade an estimation of information on hyponyms and synonyms of wordsand the identification of the causal verbs. They also applied latentsemantic analysis to these lexical features as a means of automaticdetection of semantic similarities that existed between words.However, these features have no detailed description and there is noclarification of the contribution by semantic information on theclassification model in the evaluation procedure. Eagleson (1994)described the most important method to exploit semantic informationthat was inspired by the Systemic Functional Grammar theory. Theyalso defined a set of functional features to associate given wordsand phrases with certain semantic information. This in deeper detailsmeans that, the “CONJUNCTION” scheme should denote how a clausecan expand based on its preceding context. The expansion could be“ELABORATION” which is exemplification or refocusing, “EXTENSION”which entails adding new information, or “ENHANCEMENT” which is aqualification.

Thereare given words or phrases that indicate modalities of a“CONJUNCTION” scheme. For instance, “specifically” as a wordcan be used as identification of a “CLARIFICATION,” an“ELABORATION”, and a “CONJUNCTION.” A phrase like “in otherwords” can be used in the identification of an “APPOSITION”, an“ELABORATION” or a “CONJUNCTION.” (Eisenman, 1997)

Detectof such semantic information, is based on a select lexicon of wordsas well as phrases that is produced semi automatically and is basedon online thesauruses. Each of the entries in the lexicon can beassociates a word or phrase with a given set of syntactic constraintsand semantic properties. The functional measures set that containmeasures showing the number of “CONJUNCTIONs” that are expandedto “ELABORATIONs” and so on. There is no information providedthose measures’ accuracy. There was an experiment on authorshipidentification in a corpus of English novels of the 19th century. Itshowed that functional features can reasonably improve theclassification results if they are combined with the traditionalfunction-word features.

2.6.6Application-Specific Features

Thefeatures previously described herein, are application-independentsince their extraction is from textual data. This is dependent on theavailability of NLP tools that are appropriate and resources requiredfor such detailed measurement. The application-specific measures maybe defined so as to represent the nuances of the style within giventext domains. This is a review of all these important measures.

Accordingto Zheng, Li, Chen, and Huang (2006), authorship attributiontechnology when applied in domains like e-mail messages as well asonline messages reveal that there is a possibility in definition ofstructural measures in quantifying the author’s style. Thesestructural measures may include the use of greetings as well asfarewell messages, signatures, indentation and paragraph length. Thetexts are in HTML format we can define measures that are related toHTML, as well as other entities like font-colour counts, tagdistribution and font-size counts. Such features can however only bedefined in a given text genre. These measures become very important,especially in texts that are very short texts whose textual content,stylistic properties are not adequately represented using methodsthat are application independent. Accurate tools are however requiredfor the extraction of such properties. Zheng et al. (2006) reportedon the difficulties that they faced when trying to make accuratemeasures on structural features.

Textsstyle factor is orthogonal to its topic. Consequently, thestylometric features make an attempt at avoiding information that iscontent-specific for more reliability in cross-topic texts. Where allthe available texts are on a similar thematic area by differentauthors, careful selection of content-based information is helpful inrevealing authorial choices.

Tocapture the properties of the author’s style better within a textdomain, an analyst should use content-specific keywords. In detail,considering that those texts deal with given topics and should be ofthe same genre, defining certain words that are frequently usedwithin a topic or the genre. For instance, in the framework for theanalysis of online texts from, a newsgroupZheng et al. (2006) made a definition of the content specifickeywords which included “deal,” “sale,” or “obo” (whichmeans or best offer). Their difference from the function words andthe measures was that they carried semantic information. They alsowere characteristic of certain topics as well as genres. However, theprocess of selecting such features, for a text domain remainsunclear.

2.7Author Gender Software Development 2.7.1Text Categorization

Variousnumbers of levels are involved in text categorization. They includethe following


Thisinvolves choosing a large set of text features that is potentiallyuseful in categorizing a certain text in words that are not toocommon or too rare then represent each of the texts as a vectorwithin which entries represent the frequency of each of the featuresin that text.


Asan option, use different criteria to reduce the dimension of the saidvectors this should be through eliminating the features that do notseem correlated with any of the categories this can be done by use oflatent semantic indexing or by a stepwise iteration of the learningalgorithm.


Machinelearning methods should be used in the construction of one or moremodels of the categories. There are a comparison and assessment ofsome of the promising algorithms, including neural nets,k-nearest-neighbour, SVM and Winnow. Where there are, multiple modelslearned, combining models can be done through such methods suchbagging and boosting (Coulthard, 2000).


Bootstrappingor k- fold cross validation may be used for the estimation of thesystem’s reliability.

2.7.2Review of Similar Software

Thedevelopment of software that easily identifies the gender of theauthor should be guided by the platforms that similar existingsoftware is based on. To understand them, we ought to interrogatesuch software.

Thereview will concern itself with software for plagiarism detection andthose that have historically been used for authorship detection. JVOCALYSE V 2.05

JVocalysev 2.05 was developed by David Woolls together with a team from CorpusForensic Linguists group (Wools, 2003). It has access to 450 Englishfunction words and can also work with other languages. It useslexically base measures like vocabulary range used by the documentauthor. Despite not being designed for statistical identification, itsupplies measurable data that can be used by linguists to give aquantitative analysis of the style used.

Theanalysis of the data is rapid, and the user is allowed to see thedifferent ratios of content to words at two levels, the full-textlevel and the vocabulary level. Further to this, it also facilitatesthe identification of word strings that may reveal linguisticpatterns. This allows the examination of the regularity of patternsin long texts. The different colours used for the full-text mark uphelp in giving a visual representation of the vocabulary frequencydistribution through the text (Wools, 2003).

However,as part of its limitations, JVocalyse only uses lexical measureshence other linguistic style markers that would have been relevant inclarification are overlooked. These markers include spelling,punctuation, errors and omissions, text format, and numbers andsymbols. GOLD V 2

CopyCatchGold is plagiarism detection software also developed by David Woollsand has recently been assembled together with JVocalyse to createCpyCatch Suite. The software loads functional words as well astechnical words that are used as functional for certain subjects. Thesoftware gives instructions in a simple, user-friendly Englishmanual. It establishes pre-defined similarity threshold between twosets of texts. It is noteworthy that CopyCatch Gold’s approach todetection of plagiarism also has lexically based measures. It allowsthe quantification and identification of content words, sentences,phrases and function words that are in common within a set of texts(Wools, 2003).

Forits function, CopyCatch Gold particular can be used with historicalinvestigation of texts by anonymous authors. Since it shows allmatching texts, it can be used to attribute certain works to certainauthours accurately. Since it includes statistics, they might beused as a secondary tool by forensic linguists for the statisticalanalysis of data since the quantity and frequency of lexical entitieslike words and phrases is measured (Wools, 2003).

However,as a software that is meant for forensic authorship identification,there are limitations since it was meant for examination of textsstructurally through phrasal and vocabulary analysis. There arefundamental differences in the platforms that are meant for authoridentification and plagiarism detection. Its design may showsimilarities in style but has no set measures to point at the reasonsfor the similarities. STYLOMETRIC SYSTEM V 1.0

Thesignature stylometric system is designed by Peter Millian of theUniversity of Leeds. It is a freeware intended for educationalpurposes. It is meant to facilitate stylometric analysis focusing onauthor identification. It helps researchers compare styles bydifferent authors and analyse disputed material while exploringauthor identification (Millian, 2013).

Withthe signature stylometric system, you can load three different filesat the same time to create a large but ingle corpus. Single texts canalso be divided by half. It projects two or three dimensional graphswith the results that correspond to the measurement of style markerssuch as lexical entity lengths. Its statistics option performsChi-square significance tests that evaluates the relative homogeneityof multiple variables that are expressed as actual frequencies.

Apartfrom lexical entities only, the software also measures other stylemarkers that are relevant such as letter and punctuation frequencies.Its analytical approach provides comparison of the results throughgraphical output. TOOL V. 4.0

TheWordsmith tool v. 4.0 is a computer programs suite that was developedby Mike Scott of the university of Liverpool. It ranks as aquantitative analysis program that explores how lexical andgrammatical features behave within their natural setting which is thetext. It has a text converter that works as a search-and-replacetool. Its word list generates words in alphabetical order as well asfrequency order. This way, the forensic linguist may compare thetexts at a lexical level (Scott, 2004).


Thestudyinvolvedcollectionof informationfrom previousstudiesandcollatingthem to createa credibleparadigmforthedevelopmentof workingsoftware.

3.2Research Design

Thestudyadoptedanduseda descriptive surveydesignas guidedby theconceptual framework.Thedescriptive surveydesignisconcernedwith gatheringfactsandobtainingpertinentandpreciseinformationconcerningthecurrentstatusandphenomenonanddrawingpossibleconclusions(Gay, 1989). Adescriptive surveydescribes,reportsandanalyzes theconditionsthat existorexisted.Thedescriptive designIs appropriateas itenabledtheresearcher to collectandanalyzedata from a widerangeof secondarysourceswithout manipulatingtheconditions.Bothqualitative andquantitative data weregeneratedhence,bothquantitative andqualitative data collectingtechniqueswereusedto analyzethedataobtained.Quantitative data wasanalyzedusingdescriptive statistics.

3.3Data Presentation, Discussion andAnalysis

Thisstudymainlypurposedto developa credibleframeworkthat utilizesthelinguistics techniquesin forensics not onlyto helpin authorshipidentification,butto dosoby gender.Based on lexical analysisthere are individualcharacteristicslike sizeof theletters,arrangement,wordsandlines,spacingbetween theletters,penpauseandlift,hesitationandconnectingstrokesthat are basicforthelinguisticanalysis.Others includeclasscharacteristicslike pictorialeffect,alignmentof letters,movementof writing,style,writingspeed,andthelinequalityin thewrittentext.Theyaredeterminedandcomparedto theresultsof linguisticanalysis.Basedon earlierreviewsof relatedstudies,as wellas analysisof textsin questionthree typesof linguisticfeatureset,wereconsidered.Theyare syntactic, lexical, andstructuralfeatures.

Thisstudyincludeslexical featuresusedin (de Vel, 2000), thevocabularyrichness(Yule, 1938), syntactic featureslike functionwords(Mosteller andWallace, 1964) theincorporationof partsof speech(Stamatatos et al. 2001) andcharacteristicsof punctuation ( Baayen et al. 2002). Thestructuralfeaturesare a representationof thewriter’sstyleof organizationof thewritingtextlayout(Zheng, 2006).

Ground-truthdata is toooftenoverlookedandundervaluedin stylometric computing. One of thestudieson the“write-print” showedthatthere existeda highdegreeof accuracyin identifyingtheauthorsof thetextwith over97% accuracyforEnglish andover 92% forChinese (Zheng et al., 2006). Thiswasregardedas an impressiveresulthowever,itwasunderminedsince thedataset wasnot basedon ground-truth data. Thiswasrevealedin thecommentsof researchers’ on a sub-study by three authorsin their data set. Ground-truth data, however must be verified.Gettingdata from thewebis aconvenient and fast wayof gatheringandcollecting data. However,thedata that iscollectedis not easilyverifiable.

Traditionalliteraryresearches, as wellas therecentcomputer-science-based stylometry, haveallfocusedon literaryandreligioustextsandscholarlypublications.Allof thesetexttypescontainlanguagethat is highlystylized,edited,rhetorically sophisticated,andformulaic. Thetextsare typically longtheyhavetens of thousands of words(Eagleson, 1994).

Thereare Stylometric computing methodsusedforliterarytextsas wellas largecollectionsof electronictextsuchas electroniclibrarianship. However,their feasibilityis stillwidelyuntested on forensically data. Thesemethodscan bebroughtinto thecomputedforensicauthoridentification.However,itis not thesameas establishinga protocolempirically usingthemethodson data that is forensically feasible.There must be forensically feasibletestson stylometric computing methodsto establishwhetherground-truth data reallywork(Bunz and Campbell, 2003).

Forensiccomputational linguistics isgroundedin linguistictheory,implementslinguisticanalysisin software, andusesastandard linguistic methodologynot onlyforanalyticaltechniquesbutalsofordata collectionandresearchmethodology. Neitherforensicstylistics norstylometric computing isgroundedin linguistic (Zheng et al., 2006).


Theevidence of gender differences in day-to-day communication hassubstantial evidence, and these differences are similarly present inelectronic communication. There are studies that have been designedto interrogate gender preferences in style of language in the contextof electronic discourse. Three experiments were carried out toascertain the preferences of language styles used by genders inelectronic messages. In the initial experiment, the participantswere asked to send electronic messages to a ‘netpal’ designated.A discriminant analysis carried out showed that the gender of theparticipants could accurately be determined at an accuracy level91.4% (Pennebaker, 1990).

Thesecond and third experiments were performed to determine accurateidentification of gender by readers of emails. A selection ofmessages was given to the participants of the study for theprediction of the gender identity of the authors. In 14 of the 16messages, the gender was accurately predicted. The third experimenthad six messages on gender-neutral topics. From a subset of variablesthat were identified in the first experiment, there were female andmale versions of each of the messages that were created. Whenparticipants made their predictions on the gender of the authors,there was a difference in the ratings as functions of the messageversions (Swallowe, 2003).

Fromthe findings, it is evident that the use of gender preferentiallanguage in electronic discourse too. The readers base theirpredictions of the identity of the authors of such texts on theevident gender-linked language differences.

Inconclusion, according to the results from the first experiment, thereis evidence of gender preferential language in electroniccommunication, and these expressions are similarly present in theface to face communication. From the second experiment, there is anindication that readers can discriminate and identify the gender ofthe author on a number of gender-preferential language features.Since the changes that involved the creation of male and femaleversions were minor, it is interesting that the suggestion is thatpeople are naturally sensitive to gender cues in electronic discourseas they are to face to face communication.

Thereis a salient issue to a growing gender identity crisis in the socialranks. Gender identity crisis is based on psychological factors thatlead to people identifying as of the opposite gender or of the neutergender. There are therefore implications, as suggested by the resultsin the experiment, for the people who have adopted different genderonline. There is the possibility of identifying the gender oflanguage users online even when they assume gender-neutralidentities. This possibility rides on the ability of the people to besensitive to the gender-linked effects on language.

Accordingto the outcome the study, there is an implication that social cuesare also present in electronic discourse. As a matter of fact, thereis more emphasis on such cues than the name of the author wheninferring the gender. ‘There are actual instances of queries on thegender of the writer due to a mismatch in the pseudonym used and thechoice of style in writing’ (Herring and Paolillo 2006, p. 47).

Thesample used in the study could lead to underestimation of the extentto which gender-preferential language is used in electroniccommunication. The sample was based on university students only, andthey might not have been representative of the generalcyber-community in the general population. Previous research onuniversity students suggested that they demonstrated fewergender-linked differences in language and importantly, they are lesslikely to display stereotypical sex roles as would the generalpopulation.

However,extensive research points out that there are cues of social identityprovided in electronic discourse nonetheless. The extent of use fortheses social cues is, however dependent on the social context ofelectronic communication. For social contexts that are primarilyinformal gender is more evident in the discourse. Such evidence couldhowever be lesser should the context of the content be based on thefactual exchange of important and formal communication. Behaviour anddiscrimination based on the gender type could be more likely to occurwhere gender happens to be a salient category in social identity.

Earlier,there was an expectation of equal gender voice provided bycomputer-mediated communication (Koenig, 1986). The general beliefwas that it would offer anonymity that would free users of theirstereotypical social roles. However, the creators of such thoughtoverlooked the propensity and ability that the human reader has inconstructing social realities.

Accordingto Tannen (1990), these findings raised the questions on how thereaders will reply to messages that are categorized male or femalebased on style characteristics. A theoretical perspective on theinfluences of communication is based on the conversationaccommodation theory. The general expectation is that thecomputer-mediated communication will force users to adapt theircommunication styles for more consistency with the target audience.Face to face communication, according to past studies, has evidencedaccommodation to gender-preferential styles. The researchers havesuggested that individual behaviour has been based on expectationsfrom others than their actual behaviour. This is attributed as thereason youth may over accommodate during communication with the olddue to social stereotypes that govern such interactions. Such overaccommodation has not been researched based on gender stereotypes,but should be further explored. It`s noteworthy, however, thatauthors often write for large target groups and the characteristicsof such a group might influence the writer’s style than individualcharacteristics. As such, research should focus on mutual influencesand reactions during communication rather than just the individualscommunicating.

Inconclusion, electronic discourse shows gender-linked languagedifferences just like written essays and face to face communicationsdo. People are also very sensitive to such differences even wheregender specification and physical gender indicators are absent, andthey can identify the gender of the author of electronic textaccurately.

Thisidentification is made possible by the difference in emotionalexpressiveness of the genders. Significant research has shown thatwomen happen to be more emotionally expressive in face‐to‐facecommunication. Further researches quantify given stereotypes thatassociate particular emotions are associated with certain genderroles. These stereotypes are inherent since they have been observedearlier in children at pre‐schoolage (Mulac, Bardac, and Gibbons, 2001). Studies have suggested thatwomen are characterized by the stereotypic emotions like sadness,fear and happiness. On the other hand, men are characteristicallystereotyped as angry. Such stereotypes are a basis of society’squalifications of what is socially acceptable for gender roles andwhat is not in emotional displays.

Thegeneral anonymity that was offered by social media and networkingsites brought a belief that it would weaken the stereotypic profilingof women as more emotionally expressive. This was interrogated in astudy carried out on fifty native Australians on their use ofexpressional markers in their online communication. It hypothesizedthat regardless of how changes in communication have been influencedby social networking, the gender stereotypes that the sample wasexposed to from early childhood and that they deemed sociallyacceptable for genders is still dominant. The study shows that womenare still more emotionally expressive of the two genders. The datathat supported this hypothesis indicated in the results that thefrequency of use of prosodic expressional markers between males andfemales was at times close, but women remained the more expressivegender (Kelly and Hutson‐Comeaux,2002).

Theseemotional expression markers include punctuation, full stops,capitals, additional letters, emotions and laughter as shown in thefigure below.

(Kellyand Hutson‐Comeaux,2002)


Theroleplayedby genderin our livesas humanis crucialrightfrom birth.Aspartof growthanddevelopmentthere is theintrinsicdevelopmentof differentbehaviour in thetwo gendersas compatiblewith societal expectations andstipulationsandgenderidentification.Accordingto mostlinguistsandsociolinguistics, thewayof speakingandwritingforbothgendersvaries due to their differentexperiencesin life(Labov,1972). Theidentificationof thesedifferencesis thebasisof researchon genderidentification.Linguisticresearchers havebased their researchon phonological andlexical differencesto differentiate maleandfemalelanguageuse(Stamatatos et al., 2001). Thisisbasedon submissionsof their spokenandwrittenmedia.Thishoweverdoesnot focuson creativelanguageas usedin novelsandplaysorpoetrysince itishighlyembellishedwith deeprootedartisticinputon stylistics. Therefore,based on thebeliefthatthere are differencesin maleandfemalewritingsthere is a perceptionthatidentifiesdifferencesin linguisticuse.

Withdevelopmentsin technology andtheshiftin theworldto a global villagethere are numerouschangesin interactionsthat are shapingup everyday.However,despite thesechanges,thetexthas remainedas themostprevalentmediaof communicationover internet protocols.There are numerousapplicationsthat areusedas multimedia channels,butallsocialnetworkapplicationslike Facebook andTwitter arewidelyacceptedas text–based applications.Thisis alsoapparentin othercommunicationtoolslike email accountsandblogs. Accordingto Oxford Internet Institute’s Bernie Hogan, whois a specialistin socialnetworktechnology in theUK, suchtechnology has a usefulrole.Theabilityto provideextracuessuchas thegenderof thewriterwill be a helpfulthingin thistechnological era(Uchida,1992)

However,there has beenspeculationin theprocessof identificationof theauthorsof internet text,giventhattheir scopeis short,andtheir originis widelyunspecifiedandunverified.Mostof theonline informationis attributedto anonymoussources.However,within internet based forensiclinguistics, theneedto identifyauthorsof documentsis apparent.Furtherstill,itis importantto accuratelyidentifyauthorsbased on their genderas exemplifiedby therecenteventsof faked authorgenderover theinternet. Thecoreproblemshiftsfrom authorshipattribution to authorgenderidentification(Miller,1984).

Therehas beentremendousgrowthin languageandgenderstudiesover thepastfour decades. Indeed, there has beenaragingdebateover thepastfewyearsin thestyleof writingandthegenderof thewriterforsociolinguists. Their researches on thelanguagedifferencesin menandwomenhavemainlyhadtwo approaches.Thefirstis thedominance approachwhilethesecondis is thedifferenceapproach(Hamdan and Al-Jallad 2008).

Thefirsttheorystatesthatthedifferencesin languageusebetween themaleandfemalegenderis as a resultof maledominance over thefemalesandtheconsequentfemalesubordinationandsubmission.Theassumptionis mainly based onthelifeexperiencesof thetwo genderswill affecttheir linguistoutput.

Accordingto Lakoff(1975), thesecondtheoryhoweveris based on thebeliefthatmenandwomenbelongto differentsub-cultures. Asa result,their culturaldifferencesaffectthelinguisticsubmissions.Forexample,womenare morecarefulin speech,unlikemenwhosespeechesshowaggression,powerandstrength.Coats(1993) statedthatthere is an inextricablelinkin languageandgender,andbotharedevelopedthrough dailyparticipationin socialactivitiesandpractices.

Therewasan extensivereviewprovidedby KyratzisandCook-Gumperz(2008) in explainingthedevelopmentof gendered practicesin childrenbased on their cultures.In their firstreview,theyclaimedthatthrough languagesocialization theory,thechildrenlearntheculturalgender-related valuesthrough participationandlinguisticinteractions.

Thesecondtheoryis theSeparate World Hypothesis. Thetheoryclaimsthatthere are separate sub-cultures in thegrowthof boysandgirls.They,therefore,growup in differentsocialroles,culturally, andthusdevelopdifferentlinguisticusesas conditionedby their respectivesocialsubsets.

Thethirdtheoryarguesthatthere are two differentculturesthatchildrenandadultsbelongto. An overviewby Bucholtz in 2003 gavesomediscourseanalysisapproachesandtheir utilization in genderstudies.Suchstudieswerebasedon criticaltraditionsandsociological andanthropological perspectives.Theseforms,however,as per theresearcher’s conclusionswerenot comprehensive.There exist otherformsthat can beutilizedby theanalysisof discoursewherestudiedas a socialphenomenon(Coats,1986).

Thereare a numberof linguisticfeaturesidentifiedas tagquestions,qualifiers andintensifiers usedby women.There werealsohedging devices,emptyadjectives,triviallexis andtheuseof risingintonationon declaratives. Similarstudiesreportedthatfemalesusedthefirstpersonsingularpronoun morein comparisonto males.

Inwriting,genderdifferencesin thestudieswererevealedto be based on differentapproaches.Femalesfocuson personalexperiencesin their writingwhilemalewritingis moreof aggressionandcompetition.Womenalsodwellon relationshipsthan menwith theinherent useof apologiesandcomplimentsfrequently.

Genderstudieshaveattemptedto createtoolsandsoftware programsthat enablereaderstoidentifytheauthorby gender.Chao-Yue (2010) collectedtwo differentliteraturessets.Theywerecasualandformalwritingsfrom booksandblogs. The47 booksof over 3 million wordswerefrom 23 maleauthorsand24 femaleauthors.There werealso48 blogs of 4.22 million words.UsingtheNaïve-Bayes classifier and&nbspaNLTK toolkit theresearcher concludedthatfemaleauthorsusethefirstpronouns more.Maleauthorsusedthesecondandthirdpersonpronouns more.Theyalsousedsemicolons morewith femalesusingthemostexpressiveexclamationmarks.Whilefemalesusedtheverb lovemoreto showaffectionmalesusedlike, could, would andthink.

InHerring andPaolillo (2006) investigationweblogs of genderandgenrevariation,thecorpus comprised of a balancedweblog samplesby singlemaleorfemaleauthors.Thecorpus included35,721 wordsamong which 22,134 werewrittenby women,and13,587werewrittenby men.In an examinationof whetherlanguageusevariationsin weblogs could be attributedto thegenderof theauthororthetypeof genre,theresearchers useda constantblog type.16 stylistic featuresthat hadpreviouslybeenidentifiedin machinelearningresearchwereinvestigated.Accordingto theresearch,femalespreferredusingall thepersonalpronouns in thethree persons,in bothsingularandplural.Malespreferredto usedeterminers demonstratives like numbersandquantifiers andthepossessive pronoun its. Thediaryentries,as per theresults,werebased on femalepreferredstylistic features,whilefilter entrieshadmorestylistic featuresthat werepreferredby males.

Aresearchon thereaders`perceptionsof thegenderof authorsby Janssen andMurachver showedthatparticipantsmadecorrectperceptionsof thegenderthan thediscriminant analysisdid.Therefore,theidentificationof theauthor’sgenderis onlynaturalforthereader.Thisis based on the premise thatthere is a relationshipbetween thecreatorof thetextandthetextitself. Thisnatural predictionis as a resultof informalspokeninteractionsas wellas formalwrittentexts.There havebeenconclusionsby researchers thatthe genderof theauthoris not easilyconcealedfrom theaudiencesandreaderstoocannot ignoreit.Consequently,there is aneedto developautomatedtoolsthat can accuratelyidentifythegenderof theauthor(Janssen and Murachver, 2004).

Toinvestigatethegenderof authors,there should be an interrogationof thetextfrom differentangles.Thetextshould be of differentlengths,genresandcontent. Internet content is mainlyof shortlengths,mainlymulti-genre andusuallycontentfree.To developa tollthatidentifiestheauthor`sgender,there should be thebasicqueryon whetherandhowmenandwomenusedifferentlanguageclassesandstylesdifferently.Thiscan take place with theknowledgeof thelinguisticfeaturesthat are genderindicators.

Basingthisinterrogationon pastresearchin humanpsychology,545 psycholinguistic areproposed.Theseincludegender-preferential cuesthat whenusedalong stylometric featurestheybuilda soundbasisforgenderidentification.Itshould benoted,however,thatidentificationof thecorrectsetof genderindicationfeaturesis stillan openresearchproblem(Janssenand Murachver, 2004).

Authorgenderidentificationsoftware that is usedto ‘guess’thegenderof a writercould haveplayedan importantrolein helpingtheworldrealizethata blog that wasopposedto theSyrian governmentandstoodforgayrightswasnot writtenby thepurported younglesbianlivingwithin thecountry.

Asitwasdiscoveredlater,theauthorof thatblog, &quotGay Girl in Damascus,&quot wasactuallya man.Theonline gendercheckercould havemadethisconclusionearlier.Whenthetextfrom thelastblog that waspostedwasfedinto thesoftware, thesoftware accuratelyidentifiedthatthere was63.2 percent likelihoodthattheauthorwasmale(Michaelsonand Margil, 2001).

Thesoftware programthat wasdevelopedat Stevens Institute of Technology Hoboken, New Jersey by Na Chengandcolleagues.Thesoftware has hadnumerousimprovementsover time,anditis soongoingto beusedin revealingthegenderof writersof online content. Thisis regardlessof whethertheyare bloggers,ortheir content is in emails oron socialmedialike Facebook ortwitter. Theteamof developers envisagethesoftware helpingin the protectionof childrenas wellas unsuspectingadultsfrom possiblegrooming by internet predators whomight haveconcealedtheir genderonline (Grey, 1998).

Thefakeblog eventis a highlight on theproblemthat isposedby peoplemasking their trueidentityonline. Theactualtruthabout thealleged blogger, Amina Abdullah, wasonlyevidentafter theblogger haddisappeared.Whiletheonline contactssoonrealised thattheyhadnevermetAmina, her blog photo turnedout to havebeenstolenfrom someone’s Facebook page.Therealblogger masking as Amina wasa 40-year-old American identifiedas Tom MacMaster,whowaslivingin Edinburgh, UK. Helaterconfessedto writingtheblog content allalong.


Basedon thesoftware, determiningthegenderof a blogger oronline writeras Cheng andcolleaguesprogrammedthesoftware, a textfilethat is in a paragraphof 50 ormorewordsshould be uploaded orpastedon theprogramforgenderanalysis.

Makinga judgement on whetherthegenderis maleorfemaleonlytakesa fewminutes.Theprogramalsoprovidesgenderjudgement fortheneutralgender.Thisoptionshowshowmuchtextisstrippedof anygenderindications.Thisis particularlyprevalentforscientificandcontentspecifictexts,accordingto theresearchers (Hollien, 2002).

Towritesucha program,thereis a needforvastby-lined textfrom archives andemail database. Thesedocumentsarethentrawledfor&quotpsycho-linguistic&quot factorsidentifiedin thisresearchin earlierchapters.Theseincludepunctuation stylesandspecificwords.Whenthesefactorsare identifiedandstudiedin detail,theyare thensupposedto be honeddown to a significantnumberof gender-significant ones.Thesefactorsshould includedifferencesin punctuation as wellas otherstylemarkerslike theparagraphlengthsdifferencesbetween menandwomen(Nolanand Grabe, 1996).

Thereshould be a considerationof othergender-significant factorswhich includewordsindicatingmood.Theyshould alsoconsidersentimentsof theauthoras wellas thedegreeof useof adverbsthat are emotionally intensiveandaffectiveadjectives.Theseincludecharming,really,orlovely.These,forinstance,are usedmoreoftenby women.

Thereare three machinelearningalgorithms designedforgenderidentification.TheseincludetheBayesian logistic regression, thesupportvector machineandtheAdaBoost decisiontree andwhich arebasedon theinitiallyproposedfeatures.

5.2 Bayesian algorithms

Finally,thesoftware should combinetheidentifiedcuesusinga Bayesian algorithm. Thealgorithm guessesthegenderof theauthorbased on a balanceof probabilitiesas suggestedby thetell-tale factors.Whenthetextis fedinto thesoftware theexpectation of accuracyon thejudgement,itmakeson whetherthewriteris maleorfemaleshould be 85 per centaccurate.However,thispercentageimproveswith moreuseby morepeople.Thisis becausetheusers givefeedback on whatthesystemguessedincorrectly.Thedeveloper of thealgorithm hencelearns(Mulac,Bardac and Gibbons, 2001). Evenwhenthesystemgivesa &quotneutral&quotdecisionon thepredictionof theauthor’sgender,thedeveloper should makeobservationsandcollatethem with theknownfacts.Thisis since suchidentificationcould be an indicationof writerstryingto writein a differentgendervoicethat is not naturalto them.


Finally,there should be extensiveexperimentsmadeonlargecorpora of textto indicatetheaccuracylevels scoredin identificationof thegender.Suchexperimentsare alsoimportantin indicatingthefunctionwords,wordandstructuralfeaturesthat are significantas genderdiscriminatorswithin thetool.Software has beentestedbeforeon knownauthorsto deducetheir accuracy.Forinstance,in thecaseof Mary Evans, a femalenovelistwhowroteunder thenom de plumeas George Eliot, software analysison thegenderof thewriterbased on onlythefirstparagraphsof her novelMiddlemarch showedshewas94.6 per centfemale(McMenamin,1993).


Linguisticresearch is mainly based on descriptive survey designs that usequantitative methodologies for an elaborate presentation of findingsand analysis. While the descriptive survey design concerns itselfwith gathering of facts and obtaining pertinent and preciseinformation that is quantitatively used for purposes of drawingpossible conclusions. Such research, therefore, describes, analysesand reports conditions that exist or existed on the research topic(Coulthard and Johnson, 2007).

Quantitativedata that is analysed using descriptive statistics from the usedtraining data (dataset) for extraction of the feature in most of theresearch in linguistics is however queried. There are concerns thatthe methodology lacks consistency and should such instruments be usedwith different sets of variables the findings could be different.

Thishowever is argued against by researchers in linguistics due to anumber of reasons. According to the researchers, linguistic studiesand researches are still new and therefore there should be more timeto develop instruments that are accurate for the purpose. The chosenmethodology is arguably recommendable despite the clear limitationsthat come with its lack of consistency. The testing and validationare for theories that are already constructed and therefore theoutput of such processes is dependent on the theories. Such testinghypotheses are usually constructed way before the actual data iscollected and as such reliability of the methodology cannot bequeried based on substitution of data (Canary and Dindia eds., 2011).

Researchersalso make the submission that research findings can be generalizedwhere the data is from random samples that are of sufficient size andtherefore where there is an actual representation there should be noreasons to question the validity of the methodology.

Themethodologies also used to eliminate the confounding influence bymany variables, and there is an actual cause-and-effect relationshipthat is credibly established between the data set and the features.Where tools are aptly developed, quantitative and precise numericaldata are quickly analysed in a relatively shorter time. The resultsthat are arrived at are independent of the researcher bias too.

Accordingto Coulthard (2000), the methodology also provides for cleardocumentation based on the content of the data set and the surveyinstruments hence the validity of the findings is bot pegged on theprocess, but the data collected. Such methodologies have standardizedapproaches that allow for the replication of the study in differentto give comparable findings. Where there is a challenge for thenatural settings of evaluations due to effects of extraneousvariables, the methodologies employed in linguistic researches alsohelp in controlling such effects.

However,there are several limitations that arise from the methodologiesemployed from time to time in building word lists and feature setsfor the purposes of author gender identification. These are theweaknesses that should be highlighted and addressed to ensure thatfuture research is done with minimal possibilities of errors.

Amongthe weaknesses that are given for the methodologies are factors thatinvolve the results’ validity than the process itself. Thequantitative methodologies used might produce categories that do notrepresent the actual situation. This could be due to the instrumentsused or the data fed into the analysis tools. Since the researchprocess might be more focused on the occurring phenomenon as itfocuses on the theories and hypothesis being tested than on theprocess of the generation of the hypothesis. This leads to aconfirmation bias on the eventual findings (David, 1998).

Themethodologies involved in the acquisition of features from data setsbring out results that are too abstract for the direct application bysome individuals in its original content. The process, therefore,limits the scope of its audience considerably. The data collectionmethod involved is highly structured for most features that areanalysed, and it is hard to create apt instruments that can collectinformation on such datasets effectively. The research process isalso speculative as the entities that are used to create the datasetare from secondary data based on other research inputs and thediscretion of the researcher. The data collection method is,therefore, suspect and risk bias from the author.

Linguisticentities like speech markers are not entirely measurable on usage andtherefore the representation of such as features fails to meet thethreshold of an entirely statistical process and rather can be seenas a product of computation. The mechanisation of this process makesthe features lack feasibility. The features arrived at could,therefore, be a result of self-reported information that is obtainedfrom a researcher’s understanding of previous research, literaturereviews and linguistics. The dataset is, therefore, inaccurate orincomplete (Dubois and Crouch, 1975).

Therelationship between contextual factors and the interpreted result isoften viewed as speculative as the variations in behaviour of thedataset to produce the results is not easily explainable orunderstood to many other than the creators of the tools that arriveat the results. Having chosen to use secondary data to arrive atdifferent entities as function words that create the dataset, thereis an unusual situation created that alienates interested parties.Some would have given varying datasets on their part, and they,therefore, might lack the subjective element in such research.

Theuse of designed tools for such researches makes them expensive andelevates them from the level of the normal people who would have beeninterested in carrying out research since they lack technicalabilities and knowledge to use them. The research methods are,therefore, seen as inflexible due to fact that the instruments usedcannot be modified easily during the study.

Sincethe research is supposed to be descriptive and attributive, when thedata is reduced to numbers in the results there is information thatis lost. Such information is not captured in the graphicalrepresentations. The correlations produced between the input in thedatasets and the features produced may ignore a number of causes. Italso leaves so many untested variables that may account forproblematic impacts on the result.

Inthe case of errors in the hypotheses that were tested, misimpressionsmay be yielded hence there could be the understandable questions onthe quality of the program or the dataset. Any errors in the initialselection of the procedures meant for determining the statisticalsignificance in the research results in erroneous findings thatdisregard the intended impact (Koppel et al., 2002). Consequently,the researcher should be careful to establish a process and progressthat is ethical, valid and unbiased.


Thispaper has interrogated the viability of the research methodologiesthat have attempted to identify features from datasets to help createspeech markers that can be used for identity of the gender of theauthor. However, the accuracy rates of these results are doubtedsince positive identification rates despite being high have not beenverified on the second and subsequent attempts. Therefore, creationof software that is apt, accurate and feasible seems like a hard taskthat must take into account various defining factors and processes.

Thefactors that lead to the identification are many and distinct andcreating a single tool that combines them to identify the gender ofthe author. The considerations that are made are on the language usedin the text are based on certain characteristic features that help toidentify the gender of the author. These factors when combined form auser preference language for different genders based onsociolinguistic influences. The users of language then use givenforms of language more than those of the opposite gender due to theinfluences and biases.

Thesoftware should have tools within it that combine these underlyingfactors that are in retrospect reflected in the language to identifythe gender of the author. There have been many researches onauthorship identification. There are a number of studies on thisidentification that deals with gender identification and some havestudied this gender identification on electronic discourse. Theseresearchers have proposed many classification techniques based on amulti-disciplinary approach for the identification of the gender ofthe author accurately. However, not much has been studied for thepurposes of development of gender identification software (Gail andHawisher eds., 2012).

Thecreation of such software that can identify the gender accuratelybased on the many contributing factors is a challenge on its owngiven the scope of the input that is required. About anonymous text,software linguistic analysis that is accurate enough in identifyingthe gender from electronic discourse is a challenge. The creation ofa tool that applies linguistics while extending the principles ofpsycholinguistics techniques helps in creating a profile that canidentify the author of a document positively. Leet-Peregrini (1980)states that creating software that caters for psycholinguistics toestablish a relationship between linguistics and psychologicalprocesses that lead to certain language uses should be combined withthe need to accurately identify an author to develop workingsoftware.

Aframework for development of author gender identification can bebased on earlier studies and analysis that are related (Ellis, 1994).It makes considerations for three linguistic feature sets which arelexical, syntactic and structural features. However, there is noparadigm that is established giving any fixed clear lists for thefeature set. This, therefore, means that while the software may bedeveloped using a certain set, it might leave out a considerableamount of function features hence be of questionable validity.

Forexample, the list of emotional words that can be arrived at by agiven study could be passed as identifiers of the female gender.However, they could be adequately present in a text authored by a manhence giving false identification. Moreover, the choice of emotionalwords is more explained as a result of context of the situation thangender. There are men who also identify as feminine and are prone touse these words, hence the identification it could give is notcomplete.

Furtherstill, the software might not identify the gender of an author whowrites in a neutral gender voice, whether he/she does it on purposeor not. The essence of the software is identification of the genderof the author and by this, it fails in its mandate since a neutergender result is far removed from the expectation of identificationby the user.

Thereare keywords that are established by linguistic analysts as markersof either gender. However, these words are not identified by themeaning, but rather by occurrence (Fairclough, 1989). The preferenceof use of these words by genders is their quantifying factor askeywords. However, since their choice is not attributive ordescriptive, the software, therefore, does not understand them, andthey could mistake them based on the context of use. For instance,where it identifies a noun that is otherwise only a keyword if usedas a verb, an error is bound to be made. The use of words in adifferent figure of speech is not captured on the basics of thesoftware and any use (in either context) is registered and analysedby the software.

Moreover,there are words that are used by both genders and when used theycould register for either of the two on the software. To identifythe gender of the user of a contentious word and its intended meaningwould require the construction of a parsing tree. Such a constructwould be laborious and resource consuming in time and efforts. Itwould be even harder to create the construct within the software(Fitzgerald, 2004).

John(2004) observes that to identify the author of short text positivelyby gender on a relatively short text that does not have enoughcorpora for in-depth analysis could also be a limitation for softwaredevelopment. There has been a recorded lower success rate with adecrease in the length of the text. The construct of the software,therefore, becomes a shortcoming where the software is supposed toidentify the gender from text that is short and does not containwords that act as markers and identifiers.

Inlinguistic forensics, the dire need is to create software tools thatare accurate whose result can be presented in a law court asevidence. However, such evidence should be of integrity that cannotbe challenged or contested. The software however is bound to an errorand while experimental identification has room for error the judicialsystem does not offer such luxury and it might take some time beforesoftware is accepted to be used to identify the gender of authors tobe used as evidence in a court of law (Gibbons, Prakasam, Tirumaleshand Nagarajan eds., 2004).

Thepredictions made by the software tools are driven by linguisticapproaches entirely, and their development is a multiphasic processthat involves the application of theories, creation of features andthe eventual development of the software itself. However, it is hardto get developers of software that are aptly placed to understand thedepths of linguistics just as very few linguists will understandsoftware development (Maurer, 2006). It, therefore, needs a lot ofresearch and consultancy for any of the two to develop the software.Failure to understand either of these inputs is a limitation thatwill hamper the development of an accurate tool. Should there be away to cater for these limitations and challenges, then thedevelopment of software can be accurate in prediction can be anachievable task.


Ayres,Jr, B. Drummond (22 July 1988). McDonald`s,to Court: `Mc` Is Ours.New York: The New York Times. Retrieved 19 March 2012.

Baayen,R.H., Van Halteren, H., Neijit, A., &amp Tweedie, F. 2002, Anexperiment in authorship attribution. In proceedings of the 6thInternational conference on the Statistical Analysis of Textual Data,St. Malo, France.

Baldwin,J &amp French P 1990, Forensicphonetics,London, Pinter Publishers.

Boston,U 1991, PanelFinds Plagiarism by Dr. King&quot.The New York Times. 11 October 1991. Retrieved 2008-06-14.

Bunz,U &amp Campbell, S 2003, Accommodating politeness indicators inpersonal

CanaryD &amp Dindia k (Eds.), Gender differences and similarities incommunication: Critical essays and empirical investigations of genderand gender in interaction, (pp. 127-153). Mahwah, NJ: LaurenceErlbaum,.

Centrefor Forensic Linguistics&quot. Aston University.

ChengN, Cheng X, Chandramouli R, Subbalakshmi K.P 2009, “Genderidentification from e mails,” in IEEE Symposium on computationalintelligence and data mining proceedings, pp. 154-158.

Climate,C 1997, Menand women talking: The differential use of speech and language bygender.Retrieved December 14, 2006

Corney,M &amp Vel, O. d., Anderson, A &ampMohay, G. 2002,Gender-PreferentialText Mining of Email Discourse. In proceedings of the 18th AnnualComputer Security Application Conference,LA, NV

CoatsJ 1986, Women, Men and Language. First Edition. New York: Longman

CortesC, Vapnik V. Support-vector networks. In: Machine learning 1995. p.273-297

Coulthard,M 2004, ‘Author identification, idiolect and linguisticuniqueness. Applied Linguistics,’ 25(4), 431-447.

Coulthard,M. &amp Johnson, A 2007, Forensiclinguistics An Introduction to Language, Crime and Law,Tehran,JahadDaneshgahi Publication.

Coulthard,M. and Johnson, A 2010, AHandbook of Forensic Linguistics: Language in Evidence,London, Routledge

Coulthard,M., and Johnson, A 2007, ‘An introduction to forensic linguistics:Language in evidence’ Oxford: Routledge:162-3.

Coulthard,R.M 2000, &quot Whose text is it? On the linguistic investigation ofauthorship &quot, in S. Sarangi and R.M. Coulthard: Discourse andSocial Life. London, Longman.

DavidI. H 1998, &quotThe Evolution of Stylometry in HumanitiesScholarship,&quot Literary and Linguistic Computing 13/3: Pages:111-117.

Douglaset al. 1986, “Criminal Profiling from Crime Scene Analysis”.Behavioural sciences &amp the law 4, 401-421.

Dubois,B. L., &amp Crouch, I 1975. ‘The question of tag question inwomen’s speech: They don’t really use more of them, do they?’Language in Society, 4, 289-294.

Eagleson,R 1994, `Forensic analysis of personal written texts: a case study`,John Gibbons (ed.), Language and the Law, London: Longman, 362–373.

Eisenman,R 1997,s Men, women and gender differences: the attitudes of threefeminists – Gloria Steinem, Gloria Allred and Bella Azbug.Retrieved November 21, 2006, from

electronicmail messages. Presented at Association of Internet Researchers’3rdAnnual Conference, Maastricht, The Netherlands.

Ellis,S 1994, `Case report: The Yorkshire Ripper enquiry, Part 1`, ForensicLinguistics 1, ii, 197-206.

Fairclough,N 1989, Languageand Power,London: Longman.

Fitzgerald,J. R. 2004. &quotUsing a forensic linguistic approach to trackingthe Unabomber.&quot In J. Campbell, &amp D. DeNevi (Eds.)Profilers: Leading investigators take you inside the criminal mind(pp. 193-222). New York: Prometheus Books.

Frameworksand Project Management. Stillwater, OK: New Forums Press

FreundY, Schapire RE 1995, Adecision-theoretic generalization of online learning and anapplication to boosting,MA: Addison-Wesley Publishing Company.

Gailand Hawisher eds., 2012, Literacy, technology, and society:Confronting the issues (pp. 424-441). Upper Saddle River, NJ:Prentice Hall.

Gibbons,J 2003, Forensic Linguistics: an introduction to language in theJustice System, Blackwell, London.

Gibbons,J. and M. Teresa Turell (eds) 2008, Dimensionsof Forensic Linguistics,Amsterdam, John Benjamins.

Gibbons,J., V Prakasam, K V Tirumalesh, &amp H Nagarajan (Eds) 2004,Languagein the Law,New Delhi, Orient Longman.

Grant,T &amp Baker, K 2001, `Reliable, valid markers of authorship`,Forensic Linguistics VIII(1): 66-79.

Grant,T 2008, &quotQuantifying evidence in forensic authorship analysis&quot,Journal of Speech, Language and the Law 14 (1).

Grant,T. D 2008, Approachingquestions in forensic authorship analysis. In J. Gibbons &amp M. T.Turell (Eds.), Dimensions of Forensic Linguistics.Amsterdam, John Benjamins.

Grey,C 1998, Towards an overview on gender and language variation.Retrieved November 21, 2006,

HamdanJ, Al-Jallad N (2008). &quotThe Semantics of –ship and –hoodform a Foreign Language Learner’s Perspective&quot. Int. J. ArabicEnglish Stud. pp.107-122.

HerringS, Paolillo J (2006). &quotGender and Genre variation in Weblogs.&quotJ. Sociolinguist. 10(4):439-459.

Hollien,H 2002, ForensicVoice Identification.New York, Harcourt.

Holmes,J 1986, Functions of ‘you know’ in women’s and men’s speech.Language

Holmes,J 1993, Anintroduction to sociolinguistics.London, UK, Longman.

Hoover,D. L 2001, &quotStatistical stylistics and authorship attribution:an empirical investigation&quot, Literary and Linguistic Comuputing,XIV (4), 421-44

JanssenA, and Murachver T 2004, &quotThe Role of Gender in New ZelandLiterature: Comparisons Across Periods and Styles of Writing&quot.J. Lang. Soc. Psychol. 23(2):180-203.

John,O 2004, AnIntroduction to Language Crime and the Law.London, Continuum International Publishing Group

John,O 2008, ForensicLinguistics,Second Edition. London: Continuum ISBN 978-0-8264-6109-4.

Kelly,J.R. and Hutson‐Comeaux,S.L. 2002. Gender Stereotypes of Emotional Reactions: How we Judge anEmotion as Valid. SexRoles 47:1‐10.

Koenig,B.J 1986, `Spectrographic voice identification: a forensic survey`,letter to the editor of J. Acoustic Soc, Am., 79, 6, 2088-90.

Koppelet al., 2002, “Automatically categorizing written text by authorgender”. Literary and linguistic computing vol 17, no 4. pp.401-412.

Labov,W 1972, Sociolinguisticpatterns. Philadelphia,PA, University of Pennsylvania Press.

Lakoff,R 1975, Languageand women’s place.New York, Harper and Row.

Levi,J 1994, Forensic Linguistics in the US: ‘Bad news about your socialbenefits,` a letter that was written, New York, Harcourt.

Leet-Peregrini,H. M 1980, Conversationaldominance as a function of gender and expertise. Language: Socialpsychological perspectives.Oxford, Pergamon.

Maley,Y 1994. `The language of the law`, in J. Gibbons (ed.), Language andthe Law, London, Longman.

Martin,F 1994, TheChronicle of Crime: The infamous felons of modern history and theirhideous crimes,New York, Harper and Rows.

Maurer,F. K 2006, “Plagiarism – a survey”, Journal of Universal ComputerScience, vol.12, no. 8, Pages: 1050-1084.

McGehee,F 1937,` The reliability of the identification of the human voice`,Journal of General Psychology, 17, 249-71.

McMenamin,G 1993, ForensicStylistics. Amsterdam,Elsevier.

Michaelson,G., &amp Margil, P. 2001, Virtualgender: Gender in e-mail based cooperative problem solving, London,UK: Routledge.

Millian,P 2013, signature stylometric system retrieved from

Miller,C 1984, &quotGenre as social action.&quot Quarterly Journal ofSpeech, pp 151-167

Morton,A.Q &amp Michaelson S 1990, TheQsum Plot. Internal Report CSR-3-90, Department of Computer Science,UK, University of Edinburgh.

MostellerF, Wallace DL 1964, Inferenceand disputed authorship: the federalist. Reading,MA: Addison-Wesley Publishing Company, Inc.

Mosteller,F et al. 1964. Interference and disputer authorship: The Federalist.Reading, MA: Addison – Wesley

Mulac,A 1998, The gender-linked language effect: Do language differencesreally make a difference?

Mulac,A., Bardac, J. J., &amp Gibbons, P 2001, Empirical support for the‘gender as culture’ hypothesis: An intercultural analysis ofmale/female language differences. Human Communication Research, 27,121-152.

Nolan,F. and Grabe, E 1996, `Preparing a voice lineup`, ForensicLinguistics, 3 i, 74-94

Pennebaker,J. W 1990, `Physiological factors influencing the reporting ofphysical symptoms`. The Science of Self-report: Implications forResearch and Practice. Mahwah, NJ: Erlbaum Publishers, pp. 299-316

Pennycook,A 1996, `Borrowing others words: text, ownership, memory andplagiarism`, TESOL Quarterly, 30, 201-30.

Peter,T 2004, What is Forensic Linguistics? Recording Police Questioning&quot.The New York Times. 15 June 2004.

ShieldsP. and Rangarjan, N 2013, A Playbook for Research Methods:Integrating Conceptual

Shuy,R. W 2001, `Discourse Analysis in the Legal Context.` In The Handbookof Discourse Analysis. Eds. Deborah Schiffrin, Deborah Tannen, andHeidi E. Hamilton. Oxford: Blackwell Publishing. pp. 437–452.

SpeechPatterns in Messages Betray a Killer, Elizabeth Svoboda, New YorkTimes May 11, 2009.

Stamatatoset al. 2001,“Computer based authorship attribution without lexicalmeasures”. Computers and the humanities Vol. 35 no. 2, pp. 193-214.

Swallowe,J 2003, A critical review of research into differences between menand women

T.C. Mendenhall, &quotThe Characteristic Curves of Composition,&quotScience 214, Pages : 237 246, 1887.

Tannen,D 1990, Youjust don’t understand: Women and men in conversation.New York, NY: William Morrow.

Thomson,R and Murachver, T., 2001, PredictingGender From Electronic Discourse British Journal of SocialPsychology,University of Otago, New Zealand.

Trialof Rehan Asghar, Central Criminal Court, London, January 2008.

ScottM 2004, Wordsmith tool v. 4.0 accessed from


Uchida,A 1992, When difference is dominance: A critique of theanti-power-based cultural

approachto gender differences. Language in Society, 21, 547-568.

Woolls,D 2003, Better tools for the trade and how to use them. Theinterbational Journal of

SpeechLanguage and Law: Forensic Linguistics, Vol 10 no. 1, 102-112.

Yule,G.U. “On sentence as a statistical characteristic of style inprose”. Biometrica 30 (1938): 363-390

Zheng,R., Li, J., Chen, H., &amp Huang, Z 2006, “A Framework forAuthorship Identification of Online Messages: Writing- Style Featuresand Classification Techniques”. The American society forinformation science and technology, vol. 57 no. 3, pp. 378-393.