Groups > comp.lang.python > #64673 > unrolled thread

Need Help with Programming Science Project

Started by	theguy <kvxdelta@gmail.com>
First post	2014-01-24 02:05 -0800
Last post	2014-01-28 17:31 +1000
Articles	17 — 14 participants

Back to article view | Back to comp.lang.python

  Need Help with Programming Science Project theguy <kvxdelta@gmail.com> - 2014-01-24 02:05 -0800
    Re: Need Help with Programming Science Project Peter Otten <__peter__@web.de> - 2014-01-24 12:07 +0100
    Re: Need Help with Programming Science Project bob gailer <bgailer@gmail.com> - 2014-01-24 18:38 -0500
    Re: Need Help with Programming Science Project Chris Angelico <rosuav@gmail.com> - 2014-01-25 11:34 +1100
    Re: Need Help with Programming Science Project Ben Finney <ben+python@benfinney.id.au> - 2014-01-25 11:59 +1100
      Re: Need Help with Programming Science Project Roy Smith <roy@panix.com> - 2014-01-24 20:38 -0500
    Re: Need Help with Programming Science Project Terry Reedy <tjreedy@udel.edu> - 2014-01-24 20:10 -0500
    Re: Need Help with Programming Science Project kvxdelta@gmail.com - 2014-01-24 18:42 -0800
      Re: Need Help with Programming Science Project Rustom Mody <rustompmody@gmail.com> - 2014-01-24 19:06 -0800
        Re: Need Help with Programming Science Project theguy <kvxdelta@gmail.com> - 2014-01-24 20:58 -0800
          Re: Need Help with Programming Science Project Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-01-25 20:30 +1300
          Re: Need Help with Programming Science Project Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-25 11:31 +0000
        Re: Need Help with Programming Science Project Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2014-01-25 09:42 -0500
          Re: Need Help with Programming Science Project Rustom Mody <rustompmody@gmail.com> - 2014-01-25 08:15 -0800
      Re: Need Help with Programming Science Project Dave Angel <davea@davea.name> - 2014-01-25 01:38 -0500
    Re: Need Help with Programming Science Project Gregory Ewing <greg.ewing@canterbury.ac.nz> - 2014-01-25 20:25 +1300
    Re: Need Help with Programming Science Project alex23 <wuwei23@gmail.com> - 2014-01-28 17:31 +1000

#64673 — Need Help with Programming Science Project

From	theguy <kvxdelta@gmail.com>
Date	2014-01-24 02:05 -0800
Subject	Need Help with Programming Science Project
Message-ID	<b1831c17-d9f9-4576-9488-7463e76ccf3b@googlegroups.com>

I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws. If I can figure out a way to put it in without the bits from the stories, then I'll do so, but as of now, any help is appreciated. I understand I'm not exactly making it easy since I'm not putting up any code, but I'm kind of desperate for help here, as I can't seem to find anybody or anything else helpful in any way. Thank you.

[toc] | [next] | [standalone]

#64674

From	Peter Otten <__peter__@web.de>
Date	2014-01-24 12:07 +0100
Message-ID	<mailman.5934.1390561622.18130.python-list@python.org>
In reply to	#64673

theguy wrote:

> I have a science project that involves designing a program which can
> examine a bit of text with the author's name given, then figure out who
> the author is if another piece of example text without the name is given.
> I so far have three different authors in the program and have already put
> in the example text but for some reason, the program always leans toward
> one specific author, Suzanne Collins, no matter what insane number I try
> to put in or how much I tinker with the coding. I would post the code, but
> I don't know if it's fine to put it here, as it contains pieces from
> books. I do believe that would go against copyright laws. If I can figure
> out a way to put it in without the bits from the stories, then I'll do so,
> but as of now, any help is appreciated. I understand I'm not exactly mak
>  ing it easy since I'm not putting up any code, but I'm kind of desperate
>  for help here, as I can't seem to find anybody or anything else helpful
>  in any way. Thank you.

If I were to speculate what your program might look like:

text_samples = {
    "Suzanne Collins": "... some text by collins ...",
    "J. K. Rowling": "... some text by rowling ...",
    #...
}

unknown = "... sample text by unknown author ..."

def calc_match(text1, text2):
   import random
   return random.random()

guessed_author = None
guessed_match = None

for author, text in text_samples.items():
   match = calc_match(unknown, text)
   print(author, match)
   if guessed_author is None or match > guessed_match:
       guessed_author = author
       guessed_match = match

print("The author is", guessed_author)

The important part in this script are not the text samples or the loop to 
determine the best match -- it's the algorithm used to determine how good 
two texts match. 
In the above example that algorithm is encapsulated in the calc_match() 
function and it's really bad, it gives you random numbers between 0 and 1.

For us to help you it should be sufficient when you post the analog of this 
function in your code together with a description in plain english of how it 
is meant to calculate the similarity between two texts.

Alternatavely, instead of the copyrighted texts grab text samples from 
project gutenberg with expired copyright.

Make sure that the resulting post is as short as possible -- long text 
samples don't make the post clearer than short ones.

[toc] | [prev] | [next] | [standalone]

#64707

From	bob gailer <bgailer@gmail.com>
Date	2014-01-24 18:38 -0500
Message-ID	<mailman.5956.1390609805.18130.python-list@python.org>
In reply to	#64673

On 1/24/2014 5:05 AM, theguy wrote:
> I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given. I so far have three different authors in the program and have already put in the example text but for some reason, the program always leans toward one specific author, Suzanne Collins, no matter what insane number I try to put in or how much I tinker with the coding. I would post the code, but I don't know if it's fine to put it here, as it contains pieces from books. I do believe that would go against copyright laws.
AFAIK copyright laws apply to reproducing something for profit. I doubt 
that posting it here will matter.

In any case do post your code; you could trim the fat out of the text if 
you need to,

[toc] | [prev] | [next] | [standalone]

#64708

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-25 11:34 +1100
Message-ID	<mailman.5957.1390610061.18130.python-list@python.org>
In reply to	#64673

On Sat, Jan 25, 2014 at 10:38 AM, bob gailer <bgailer@gmail.com> wrote:
> On 1/24/2014 5:05 AM, theguy wrote:
>>
>> I have a science project that involves designing a program which can
>> examine a bit of text with the author's name given, then figure out who the
>> author is if another piece of example text without the name is given. I so
>> far have three different authors in the program and have already put in the
>> example text but for some reason, the program always leans toward one
>> specific author, Suzanne Collins, no matter what insane number I try to put
>> in or how much I tinker with the coding. I would post the code, but I don't
>> know if it's fine to put it here, as it contains pieces from books. I do
>> believe that would go against copyright laws.
>
> AFAIK copyright laws apply to reproducing something for profit. I doubt that
> posting it here will matter.

Incorrect; posting not-for-profit can still be a violation of
copyright. But as Peter said, the text itself isn't critical. Post
with placeholder text, as he suggested, and we can look at the code.

ChrisA

[toc] | [prev] | [next] | [standalone]

#64711

From	Ben Finney <ben+python@benfinney.id.au>
Date	2014-01-25 11:59 +1100
Message-ID	<mailman.5959.1390611612.18130.python-list@python.org>
In reply to	#64673

bob gailer <bgailer@gmail.com> writes:

> On 1/24/2014 5:05 AM, theguy wrote:
> > I would post the code, but I don't know if it's fine to put it here,
> > as it contains pieces from books. I do believe that would go against
> > copyright laws.

> AFAIK copyright laws apply to reproducing something for profit.

That's a common misconception that has never been true.

<URL:http://www.faqs.org/faqs/law/copyright/myths/part1/>

Copyright is a legal monopoly in a work, reserving a large set of
actions to the copyright holders. Without license from the copyright
holders, or an exemption under the law, you cannot legally perform those
actions.

Paying money may sometimes help one acquire a license to perform some
reserved actions (though frequently the license is severely restricted,
and frequently the license you need isn't available for any price).

But “I'm not seeking a profit” nor “I didn't get any money for it” are
never grounds for copyright exemptions under any jurisdiction I've ever
heard of.

-- 
 \           “People are very open-minded about new things, as long as |
  `\         they're exactly like the old ones.” —Charles F. Kettering |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]

#64715

From	Roy Smith <roy@panix.com>
Date	2014-01-24 20:38 -0500
Message-ID	<roy-E86E5A.20385224012014@news.panix.com>
In reply to	#64711

In article <mailman.5959.1390611612.18130.python-list@python.org>,
 Ben Finney <ben+python@benfinney.id.au> wrote:

> bob gailer <bgailer@gmail.com> writes:
> 
> > On 1/24/2014 5:05 AM, theguy wrote:
> > > I would post the code, but I don't know if it's fine to put it here,
> > > as it contains pieces from books. I do believe that would go against
> > > copyright laws.
> 
> > AFAIK copyright laws apply to reproducing something for profit.
> 
> That's a common misconception that has never been true.
> 
> <URL:http://www.faqs.org/faqs/law/copyright/myths/part1/>
> 
> Copyright is a legal monopoly in a work, reserving a large set of
> actions to the copyright holders. Without license from the copyright
> holders, or an exemption under the law, you cannot legally perform those
> actions.

[The rest of this post is based on my "I am not a lawyer" understanding 
of the law.  Also, this is based on US copyright law; things may be 
different elsewhere, and I haven't the foggiest idea what law applies to 
an international forum such as this]

On the other hand (where Ben Finney's post is the first hand), there is 
the Fair Use Doctrine (FUD), which grants certain exemptions.  The US 
Copyright Office has a page (http://www.copyright.gov/fls/fl102.html) 
about this.

As a real-life example, I believe I can safely invoke the FUD to quote 
the leading paragraphs from today's New York Times and New York Post 
articles about the same event and give their Fleish-Kincaid Reading Ease 
and Grade Level scores, if I was comparing the writing style of the two 
newspapers:

----------------------------------------------

NY Times:

The crime gripped the public’s imagination, for both its magnitude and 
its moxie: In the predawn hours of Dec. 11, 1978, a group of masked 
gunmen seized about $6 million in cash and jewels from a cargo building 
at Kennedy International Airport.

Reading Ease Score: 56.6
Grade Level: 10.6

----------------------------------------------

NY Post:

On Dec. 11, 1978, armed mobsters stole $5 million in cash and nearly $1 
million in jewels from a Lufthansa airlines vault at JFK Airport, in 
what would be for decades the biggest-ever heist on US soil.

Reading Ease Score: 76.2
Grade Level: 7.3

----------------------------------------------

The scores above were computed by http://www.readability-score.com/

In my opinion, this meets all of the requirements of the FUD.  I'm 
quoting short passages, and using them to critique the writing styles of 
the two papers.

In the OP's case, he's analyzing published works as input to a text 
analysis algorithm.  In my personal opinion, posting samples of those 
texts, for the purpose of discussing how his algorithm works, would be 
well within the bounds of Fair Use.

[toc] | [prev] | [next] | [standalone]

#64714

From	Terry Reedy <tjreedy@udel.edu>
Date	2014-01-24 20:10 -0500
Message-ID	<mailman.5962.1390612504.18130.python-list@python.org>
In reply to	#64673

On 1/24/2014 7:34 PM, Chris Angelico wrote:
> On Sat, Jan 25, 2014 at 10:38 AM, bob gailer <bgailer@gmail.com> wrote:
>> On 1/24/2014 5:05 AM, theguy wrote:
>>>
>>> I have a science project that involves designing a program which can
>>> examine a bit of text with the author's name given, then figure out who the
>>> author is if another piece of example text without the name is given. I so
>>> far have three different authors in the program and have already put in the
>>> example text but for some reason, the program always leans toward one
>>> specific author, Suzanne Collins, no matter what insane number I try to put
>>> in or how much I tinker with the coding. I would post the code, but I don't
>>> know if it's fine to put it here, as it contains pieces from books. I do
>>> believe that would go against copyright laws.
>>
>> AFAIK copyright laws apply to reproducing something for profit. I doubt that
>> posting it here will matter.
>
> Incorrect; posting not-for-profit can still be a violation of
> copyright. But as Peter said, the text itself isn't critical. Post
> with placeholder text, as he suggested, and we can look at the code.

In the US, short quotations are allowed for 'fair use'.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#64717

From	kvxdelta@gmail.com
Date	2014-01-24 18:42 -0800
Message-ID	<ba404807-08a2-48e6-bfa2-dd8ea8d1acca@googlegroups.com>
In reply to	#64673

Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:

#D.J. Machale - Pendragon
#Pendragon: Book Six - The Rivers of Zadaa
#Page 98
#The sample sentences for this author. I put each sentence into a seperate variable because I knew no other way to divide the sentence. I also removed spaces so they wouldn't be counted.
djmachale_1 = 'WheretonowIaskedLoor'
djmachale_2 = 'ToaplacewherewewillnotbedisturbedbyBatuorRokadorsheanswered'
djmachale_3 = 'WelefttheroomfollowingLoorthroughthetwistingtunnelthatIhadwalkedthroughseveraltimesbeforeonvisitingtoZadaa'
djmachale_4 = 'Shortlyweleftthesmallertunneltoenterthehugecavernthatonceheldanundergroundriver'
djmachale_5 = 'WhenSpaderandIwerefirstheretherewasafour-storywaterfallononesideoftheimmensecavernthatfedadeepragingriver'
djmachale_6 = 'Nowtherewasonlyadribbleofwaterthatfellfromarockymouthintoapathetictrickleofastreamatthebottomofthemostlydryriverbed'
djmachale_7 = 'WhathappenedhereAlderasked'
djmachale_8 = 'ThereisalottotellLooranswered'
djmachale_9 = 'Later'
djmachale_10 = 'Alderacceptedthat'
djmachale_11 = 'Hewasaneasyguy'
djmachale_12 = 'Loorledustotheopeningthatwasoncehiddenbehindthewaterfallbutwasnowinplainsight'
djmachale_13 = 'Weclimbedafewstonestairssteppedthroughtheportalandenteredaroomthatheldthewater-controldeviceIhavedescribedtoyoubefore'
djmachale_14 = 'Toremindyouguysthisthinglookedlikeoneofthosegiantpipe-organsthatyouseeinchurch'
djmachale_15 = 'Butthesepipesranhorizontallydisappearingintotherockwalloneithersideoftheroom'
djmachale_16 = 'Therewasaplatforminfrontofitthatheldanamazingarrayofswitchesandvalves'
djmachale_17 = 'WhenIfirstcameheretherewasaRokadorengineeronthatplatformfeverishlyworkingthecontrolslikeanexpert'
djmachale_18 = 'Ihadnoideawhatthedevicedidotherthanknowingithadsomethingtodowithcontrollingtheflowofwaterfromtherivers'
djmachale_19 = 'Theguyhadmapsanddiagramsthathereferredtowhilehequicklymadeadjustmentsandtoggledswitches'
djmachale_20 = 'Nowtheplatformwasempty'

#djmwords contains the amount of words in each sentence
#djmwords_total is the total word count between all the samples
djmwords = [6, 15, 22, 17, 26, 29, 5, 8, 1, 3, 5, 19, 25, 18, 16, 17, 20, 25, 18, 5]
djmwords_total = sum(djmwords)
avgWORDS_per_SENTENCE_DJMACHALE = (djmwords_total/20)

#Each variable becomes the total number of letters in each sentence
djmachale_1 = len(djmachale_1)
djmachale_2 = len(djmachale_2)
djmachale_3 = len(djmachale_3)
djmachale_4 = len(djmachale_4)
djmachale_5 = len(djmachale_5)
djmachale_6 = len(djmachale_6)
djmachale_7 = len(djmachale_7)
djmachale_8 = len(djmachale_8)
djmachale_9 = len(djmachale_9)
djmachale_10 = len(djmachale_10)
djmachale_11 = len(djmachale_11)
djmachale_12 = len(djmachale_12)
djmachale_13 = len(djmachale_13)
djmachale_14 = len(djmachale_14)
djmachale_15 = len(djmachale_15)
djmachale_16 = len(djmachale_16)
djmachale_17 = len(djmachale_17)
djmachale_18 = len(djmachale_18)
djmachale_19 = len(djmachale_19)
djmachale_20 = len(djmachale_20)

#DJMACHALE_TOTAL is the total letter count between all the samples
DJ_Machale = [djmachale_1, djmachale_2, djmachale_3, djmachale_4, djmachale_5, djmachale_6, djmachale_7, djmachale_8, djmachale_9, djmachale_10, djmachale_11, djmachale_12, djmachale_13, djmachale_14, djmachale_15, djmachale_16, djmachale_17, djmachale_18, djmachale_19, djmachale_20]
DJMACHALE_TOTAL = (djmachale_1+djmachale_2+djmachale_3+djmachale_4+djmachale_5+djmachale_6+djmachale_7+djmachale_8+djmachale_9+djmachale_10+djmachale_11+djmachale_12+djmachale_13+djmachale_14+djmachale_15+djmachale_16+djmachale_17+djmachale_18+djmachale_19+djmachale_20)
avgLETTERS_per_SENTENCE_DJMACHALE = (DJMACHALE_TOTAL/20)

avgLETTERS_per_WORD_DJMACHALE = (DJMACHALE_TOTAL/djmwords_total)

#----------------------------------------------------------------------------------------------------------------------------------------------------------------------
#Suzanne Collins - The Hunger Games
#The Hunger Games
#Page 103
suzannecollins_1 = 'AsIstridetowardtheelevatorIflingmybowtoonesideandmyquivertotheother'
suzannecollins_2 = 'IbrushpastthegapingAvoxeswhoguardtheelevatorsandhitthenumbertwelvebuttonwithmyfist'
suzannecollins_3 = 'ThedoorsslidetogetherandIzipupward'
suzannecollins_4 = 'Iactuallymakeitbacktomyfloorbeforethetearsstartrunningdownmycheeks'
suzannecollins_5 = 'IcanheartheotherscallingmefromthesittingroombutIflydownthehallintomyroomboltthedoorandflingmyselfontomybed'
suzannecollins_6 = 'ThenIreallybegintosob'
suzannecollins_7 = 'NowIvedoneit'
suzannecollins_8 = 'NowIveruinedeverything'
suzannecollins_9 = 'IfIdevenstoodaghostofachanceitvanishedwhenIsentthatarrowflyingattheGamemakers'
suzannecollins_10 = 'Whatwilltheydotomenow'
suzannecollins_11 = 'Arrestme'
suzannecollins_12 = 'Executeme'
suzannecollins_13 = 'CutmytongueandturnintoanAvoxsoIcanwaitonthefutretributesofPanem'
suzannecollins_14 = 'WhatwasIthinkingshootingattheGamemakers'
suzannecollins_15 = 'OfcourseIwasntIwasshootingatthatapplebecauseIwassoangryatbeingignored'
suzannecollins_16 = 'Iwasnttryingtokilloneofthem'
suzannecollins_17 = 'IfIweretheydbedead'
suzannecollins_18 = 'Ohwhatdoesitmatter'
suzannecollins_19 = 'ItsnotlikeIwasgoingtowintheGamesanyway'
suzannecollins_20 = 'Whocareswhattheydotome'

suzcwords = [19, 19, 8, 16, 6, 4, 4, 20, 7, 2, 2, 19, 8, 18, 8, 6, 5, 11, 7]
suzcwords_total = (19+19+8+16+6+4+4+20+7+2+2+19+8+18+8+6+5+11+7)
avgWORDS_per_SENTENCE_SUZANNECOLLINS = (suzcwords_total/20)

suzannecollins_1 = len(suzannecollins_1)
suzannecollins_2 = len(suzannecollins_2)
suzannecollins_3 = len(suzannecollins_3)
suzannecollins_4 = len(suzannecollins_4)
suzannecollins_5 = len(suzannecollins_5)
suzannecollins_6 = len(suzannecollins_6)
suzannecollins_7 = len(suzannecollins_7)
suzannecollins_8 = len(suzannecollins_8)
suzannecollins_9 = len(suzannecollins_9)
suzannecollins_10 = len(suzannecollins_10)
suzannecollins_11 = len(suzannecollins_11)
suzannecollins_12 = len(suzannecollins_12)
suzannecollins_13 = len(suzannecollins_13)
suzannecollins_14 = len(suzannecollins_14)
suzannecollins_15 = len(suzannecollins_15)
suzannecollins_16 = len(suzannecollins_16)
suzannecollins_17 = len(suzannecollins_17)
suzannecollins_18 = len(suzannecollins_18)
suzannecollins_19 = len(suzannecollins_19)
suzannecollins_20 = len(suzannecollins_20)

Suzanne_Collins = [suzannecollins_1, suzannecollins_2, suzannecollins_3, suzannecollins_4, suzannecollins_5, suzannecollins_6, suzannecollins_7, suzannecollins_8, suzannecollins_9, suzannecollins_10, suzannecollins_11, suzannecollins_12, suzannecollins_13, suzannecollins_14, suzannecollins_15, suzannecollins_16, suzannecollins_17, suzannecollins_18, suzannecollins_19, suzannecollins_20]
SUZANNECOLLINS_TOTAL = (suzannecollins_1+suzannecollins_2+suzannecollins_3+suzannecollins_4+suzannecollins_5+suzannecollins_6+suzannecollins_7+suzannecollins_8+suzannecollins_9+suzannecollins_10+suzannecollins_11+suzannecollins_12+suzannecollins_13+suzannecollins_14+suzannecollins_15+suzannecollins_16+suzannecollins_17+suzannecollins_18+suzannecollins_19+suzannecollins_20)
avgLETTERS_per_SENTENCE_SUZANNECOLLINS = (SUZANNECOLLINS_TOTAL/20)

avgLETTERS_per_WORD_SUZANNECOLLINS = (SUZANNECOLLINS_TOTAL/suzcwords_total)

#-----------------------------------------------------------------------------------------------------------------------------------------
#Richard Peck - The Last Safe Place on Earth
#The Last Safe Place on Earth
#Page 1-2

richardpeck_1 = 'HalloweensaweekandahalfawayHomecomingtheweekendafter'
richardpeck_2 = 'ItsthattimeofyearandcominghomeImthinkingWhatagreateveningtobegoingsomewherewithagirlmyarmdrapedoverhersoftshoulderthetwoofusscuffingthroughtheleaves'
richardpeck_3 = 'ImseeinggirlseverywhereIlooksomeofthemrealmostnot'
richardpeck_4 = 'Iseegirlsintheshapesthetreetrunksmakeandintheformationsoftheclouds'
richardpeck_5 = 'Iseealotofgirlsthisfall'
richardpeck_6 = 'Imnotobsessed'
richardpeck_7 = 'Imintenthgrade'
richardpeck_8 = 'SoIwascominghomeonfoot'
richardpeck_9 = 'Therewereacoupleofbooksinmybackpack'
richardpeck_10 = 'OnewasRayBradburysFahrenheit451whichweweresupposedtobereadingforMrsLenkysclass'
richardpeck_11 = 'Iplannedtobuckledownonschoolworkandreallyhitthebooksnextyearsenioryearatthelatest'
richardpeck_12 = 'MeanwhileIwastakingeverydayasitcametryingtogetatoeholdonhighschool'
richardpeck_13 = 'ButthefactisIdidntreallythinkhighschoolwashappeninguntilIfoundagirl'
richardpeck_14 = 'ItwasapostcardeveningalongTranquilyLanetheactualnameofourstreet'
richardpeck_15 = 'Thehazewaslikebonfiresmoke,thoughwecantburnleaveswithinthevillagelimits'
richardpeck_16 = 'Itwasared-and-goldworldwithpurpleeveningcomingon'
richardpeck_17 = 'OurhouseisthebigwhitebrickwiththegreenshutterslikeahouseonaChristmascard'
richardpeck_18 = 'Weusedtoliveinthewesternsuburbs'
richardpeck_19 = 'ButwhenDianaandIwereinsixthgradethejuniorhighouttherehadacoupleofknifefightsthatmadethenews'
richardpeck_20 = 'Thegangsweremovinginsowemovedout'

richwords = [11, 36, 12, 17, 8, 3, 4, 7, 9 , 17, 19, 17, 17, 14, 15, 12, 18, 9, 23, 9]
richwords_total = (11+36+12+17+8+3+4+7+9+17+19+17+17+14+15+12+18+9+23+9)
avgWORDS_per_SENTENCE_RICHARDPECK = (richwords_total/20)

richardpeck_1 = len(richardpeck_1)
richardpeck_2 = len(richardpeck_2)
richardpeck_3 = len(richardpeck_3)
richardpeck_4 = len(richardpeck_4)
richardpeck_5 = len(richardpeck_5)
richardpeck_6 = len(richardpeck_6)
richardpeck_7 = len(richardpeck_7)
richardpeck_8 = len(richardpeck_8)
richardpeck_9 = len(richardpeck_9)
richardpeck_10 = len(richardpeck_10)
richardpeck_11 = len(richardpeck_11)
richardpeck_12 = len(richardpeck_12)
richardpeck_13 = len(richardpeck_13)
richardpeck_14 = len(richardpeck_14)
richardpeck_15 = len(richardpeck_15)
richardpeck_16 = len(richardpeck_16)
richardpeck_17 = len(richardpeck_17)
richardpeck_18 = len(richardpeck_18)
richardpeck_19 = len(richardpeck_19)
richardpeck_20 = len(richardpeck_20)

Richard_Peck = [richardpeck_1, richardpeck_2, richardpeck_3, richardpeck_4, richardpeck_5, richardpeck_6, richardpeck_7, richardpeck_8, richardpeck_9, richardpeck_10, richardpeck_11, richardpeck_12, richardpeck_13, richardpeck_14, richardpeck_15, richardpeck_16, richardpeck_17, richardpeck_18, richardpeck_19, richardpeck_20]
RICHARDPECK_TOTAL = (richardpeck_1+richardpeck_2+richardpeck_3+richardpeck_4+richardpeck_5+richardpeck_6+richardpeck_7+richardpeck_8+richardpeck_9+richardpeck_10+richardpeck_11+richardpeck_12+richardpeck_13+richardpeck_14+richardpeck_15+richardpeck_16+richardpeck_17+richardpeck_18+richardpeck_19+richardpeck_20)
avgLETTERS_per_SENTENCE_RICHARDPECK = (RICHARDPECK_TOTAL/20)

avgLETTERS_per_WORD_RICHARDPECK = (RICHARDPECK_TOTAL/richwords_total)

#---------------------------------------------------------------------------------------------------------
#EXAMPLE SLOT
example1 = 'Wepulledthefilmfortheten-thirtynewstohearhowtheWarriorshaddoneagainsttheLakeVillaVikinsontheVikingshomefield'
example2 = 'WedlostbutitwascloseandC.E.andIwentbacktotheDracula'
example3 = 'Itwasgettinglatewhenthephonerang'
example4 = 'DeepinhispopcornworldDaddidntanswerit'
example5 = 'Ipickedupinthedenanditwasawoman'
example6 = 'IwavedatC.E.toturndownthesoundsbecausethewomanwascrying'
example7 = 'Whoisthis'
example8 = 'ItwasMrsCunningham'
example9 = 'Icantfindmydaughtershesaid'
example10 = 'IcantfindPace'
example11 = 'SheshereIsaid'
example12 = 'Shesupstairswithmysister'
example13 = 'AmomentofsilencethenandMrsCunninghamsvoiceshuddered'
example14 = 'IssheYoutellhertostayrightthereImcomingover'
example15 = 'SoweneverdidseehowtheDraculafilmended'
example16 = 'HeyPaceIsaidupthestairs'
example17 = 'Yourmomscomingover'
example18 = 'ThisbroughteverybodytothefronthallPacefirst'
example19 = 'DianawasbehindherandMominherrobeandMarnieinherpajamas'
example20 = 'BeforeDrandMrsCunninghamgothereDadwasinthefronthalltooinhisapron'

examplewords = [25, 15, 8, 9, 11, 14, 3, 4, 6, 4, 4, 5, 10, 7, 4, 9, 14, 17]
examplewords_total = sum(examplewords)
avgWORDS_per_SENTENCE_EXAMPLE = (examplewords_total/20)

example1 = len(example1)
example2 = len(example2)
example3 = len(example3)
example4 = len(example4)
example5 = len(example5)
example6 = len(example6)
example7 = len(example7)
example8 = len(example8)
example9 = len(example9)
example10 = len(example10)
example11 = len(example11)
example12 = len(example12)
example13 = len(example13)
example14 = len(example14)
example15 = len(example15)
example16 = len(example16)
example17 = len(example17)
example18 = len(example18)
example19 = len(example19)
example20 = len(example20)

example = [example1, example2, example3, example4, example5, example6, example7, example8, example9, example10, example11, example12, example13, example14, example15, example16, example17, example18, example19, example20]
EXAMPLE_TOTAL = (example1+example2+example3+example4+example5+example6+example7+example8+example9+example10+example11+example12+example13+example14+example15+example16+example17+example18+example19+example20)
avgLETTERS_per_SENTENCE_EXAMPLE = (EXAMPLE_TOTAL/20)

avgLETTERS_per_WORD_EXAMPLE = (EXAMPLE_TOTAL/examplewords_total)

#------------------------------------------------------------------------------------------------------------------------------
#Tests for similarities and prints (displays) the author whom the program believes to have written the example text

#I used a scoreboard system of sorts to determine which author was most similar to the example. Each time the program finds a match to one in each of the tests, it adds a point to that author here.
DJMachalePossibility = 0
SuzanneCollinsPossibility = 0
RichardPeckPossibility = 0

#Matches average letters/sentence in example with most likely author
#I attempted to find the closest value by subtracting the example's value from each of the authors. The author with the smallest distance from the example would be marked up one point.

avgLPS_DJ_EXAMPLE = (avgLETTERS_per_SENTENCE_DJMACHALE-avgLETTERS_per_SENTENCE_EXAMPLE)

avgLPS_SUZC_EXAMPLE = (avgLETTERS_per_SENTENCE_SUZANNECOLLINS-avgLETTERS_per_SENTENCE_EXAMPLE)

avgLPS_RICH_EXAMPLE = (avgLETTERS_per_SENTENCE_RICHARDPECK-avgLETTERS_per_SENTENCE_EXAMPLE)

LPS_Comparisons = [avgLPS_DJ_EXAMPLE, avgLPS_SUZC_EXAMPLE, avgLPS_RICH_EXAMPLE]
avgLPS_Match = min(LPS_Comparisons)

if avgLPS_Match == avgLPS_DJ_EXAMPLE:
    DJMachalePossibility = (DJMachalePossibility+1)

if avgLPS_Match == avgLPS_SUZC_EXAMPLE:
    SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgLPS_Match == avgLPS_RICH_EXAMPLE:
    RichardPeckPossibility = (RichardPeckPossibility+1)

#Matches average words/sentence in example with most likely author 


avgWPS_DJ_EXAMPLE = (avgWORDS_per_SENTENCE_DJMACHALE-avgWORDS_per_SENTENCE_EXAMPLE)

avgWPS_SUZC_EXAMPLE = (avgWORDS_per_SENTENCE_SUZANNECOLLINS-avgWORDS_per_SENTENCE_EXAMPLE)

avgWPS_RICH_EXAMPLE = (avgWORDS_per_SENTENCE_RICHARDPECK-avgWORDS_per_SENTENCE_EXAMPLE)


WPS_Comparisons = [avgWPS_DJ_EXAMPLE, avgWPS_SUZC_EXAMPLE, avgWPS_RICH_EXAMPLE]
avgWPS_Match = min(WPS_Comparisons)

if avgWPS_Match == avgWPS_DJ_EXAMPLE:
    DJMachalePossibility = (DJMachalePossibility+1)

if avgWPS_Match == avgWPS_SUZC_EXAMPLE:
    SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgWPS_Match == avgWPS_RICH_EXAMPLE:
    RichardPeckPossibility = (RichardPeckPossibility+1)

#Matches average letters/word in example with most likely author


avgLPW_DJ_EXAMPLE = (avgLETTERS_per_WORD_DJMACHALE-avgLETTERS_per_WORD_EXAMPLE)

avgLPW_SUZC_EXAMPLE = (avgLETTERS_per_WORD_SUZANNECOLLINS-avgLETTERS_per_WORD_EXAMPLE)

avgLPW_RICH_EXAMPLE = (avgLETTERS_per_WORD_RICHARDPECK-avgLETTERS_per_WORD_EXAMPLE)

LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, avgLPW_SUZC_EXAMPLE]
avgLPW_Match = min(LPW_Comparisons)

if avgLPW_Match == avgLPW_DJ_EXAMPLE:
    DJMachalePossibility = (DJMachalePossibility+1)

if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
    SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)

if avgLPW_Match == avgLPW_RICH_EXAMPLE:
    RichardPeckPossibility = (RichardPeckPossibility+1)

AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, RichardPeckPossibility]

#The author with the most points on them would be considered the program's guess.
Match = max(AUTHOR_SCOREBOARD)

print AUTHOR_SCOREBOARD

if Match == DJMachalePossibility:
    print "The author should be D.J. Machale."

if Match == SuzanneCollinsPossibility:
    print "The author should be Suzanne Collins."

if Match == RichardPeckPossibility:
    print "The author should be Richard Peck."


------------------------------------------------------------------------------
Hopefully, there won't be any copyright issues. Like someone said, this should be fair use. The problem I'm having is that it always gives Suzanne Collins, no matter what example is put in. I'm really sorry that the code isn't very clean. Like I said, it was rushed and I have little experience. I'm just desperate for help as it's a bit too late to change projects, so I have to stick with this. Also, if it's of any importance, I have to be able to remove or add any of the "average letters per word/average letters per sentence/average words per sentence things" to test the program at different levels of strictness. I would GREATLY appreciate any help with this. Thank you!

[toc] | [prev] | [next] | [standalone]

#64718

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-24 19:06 -0800
Message-ID	<d87429ec-7143-40a4-8079-456a2e33a040@googlegroups.com>
In reply to	#64717

On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote:
> Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:

<snipped>

Ewwww!

If you (or anyone with basic python experience) rewrites that code, it will become
1/50th the size and all that you call 'code' will reside in data files.

That can mean one of json, xml, yml, ini, pickle, ini, csv  etc

If you need further help in understanding/choosing, post back

[toc] | [prev] | [next] | [standalone]

#64720

From	theguy <kvxdelta@gmail.com>
Date	2014-01-24 20:58 -0800
Message-ID	<1eeb0e4b-ff9a-4b5e-86ec-773ca98fbf1b@googlegroups.com>
In reply to	#64718

On Friday, January 24, 2014 7:06:55 PM UTC-8, Rustom Mody wrote:
> On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote:
> 
> > Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:
> 
> 
> 
> <snipped>
> 
> 
> 
> Ewwww!
> 
> 
> 
> If you (or anyone with basic python experience) rewrites that code, it will become
> 
> 1/50th the size and all that you call 'code' will reside in data files.
> 
> 
> 
> That can mean one of json, xml, yml, ini, pickle, ini, csv  etc
> 
> 
> 
> If you need further help in understanding/choosing, post back

I know. I'm kind of ashamed of the code, but it does the job I need it to up to a certain point, where it for some reason continually gives me Suzanne Collins as the author. It always gives three points to her name in the AUTHOR_SCOREBOARD list. The code, though, is REALLY bad. I'm trying to simply get it to do the things needed for the program. If I could get it to actually calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems would be solved. Luckily, I'm not being graded on the elegance or conciseness of my code. Thank you for the constructive criticism, though I am really seeking help with my little problem involving that dang scoreboard. Thank you.

[toc] | [prev] | [next] | [standalone]

#64730

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2014-01-25 20:30 +1300
Message-ID	<bkh7hiFkehfU1@mid.individual.net>
In reply to	#64720

theguy wrote:
> If I could get it to actually
> calculate the "points" for AUTHOR_SCOREBOARD properly, then all my problems
> would be solved.

Have you tried getting it to print out the values
it's getting for the scores, and comparing them
with what you calculate by hand?

-- 
Greg

[toc] | [prev] | [next] | [standalone]

#64740

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2014-01-25 11:31 +0000
Message-ID	<lc079s$5fh$1@dont-email.me>
In reply to	#64720

On Fri, 24 Jan 2014 20:58:50 -0800, theguy wrote:

> I know. I'm kind of ashamed of the code, but it does the job I need it
> to up to a certain point

OK, well first of all take a step back and look at the problem.

You have n exemplars, each from a known author.

You analyse each exemplar, and determine some statistics for it.

You then take your unknown sample, determine the same statistics for the 
unknown sample.

Finally, you compare each exemplar's stats with the sample's stats to try 
and find a best match.

So, perhaps you want a dictionary of { author: statistics }, and a 
function to analyse a piece of text, which might call other functions to 
get eg avg words / sentence, avg letters / sentence, avg word length, and 
the sd in each, and the short word ratio (words <= 3 chars vs words >= 4 
chars) and some other statistics.

Given the statistics for each exemplar, you might store these in your 
dictionary as a tuple.

this isn't python, it's a description of an algorithm, it just looks a 
bit pythonic:

# tuple of weightings applied to different stats
stat_weightings = ( 1.0, 1.3, 0.85, ...... )

def get_some_stat( t ):
	# calculate some numerical statistic on a block of text
	# return it

def analyse( f ):
	text = read_file( f )
	return ( get_some_stat( text ), ...... )

exemplars = {}

for exemplar_file in exemplar_files:
	exemplar_data[author] = analyse( exemplar_file )

sample_data = analyse( sample_file )

scores = {}

tmp = 0
x = 0

# score for a piece of work is sum of ( diff of stat * weighting )
# for all the stats, lower score = closer match
for author in keys( exemplar_data ):
	for i in len( exemplar_data[ author ] ):
		tmp = tmp + sqrt( exemplar_data[ author ][ i ] - 
sample_data[ i ] ) * stat_weightings( i )
	scores[ author ] = tmp
	if tmp > x:
		x = tmp

names = []

for author in keys( scores ):
	if scores[ author ] < x:
		x = scores[ author ]
		names = [ author ]
	elif scores[ author ] == x:
		names.append( [ author ] )

print "the best matching author(s) is/are: ", names

Then all you have to do is find enough ways to calculate stats, and the 
magic coefficients to use in the stat_weightings

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]

#64745

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2014-01-25 09:42 -0500
Message-ID	<mailman.5977.1390660939.18130.python-list@python.org>
In reply to	#64718

On Fri, 24 Jan 2014 19:06:55 -0800 (PST), Rustom Mody
<rustompmody@gmail.com> declaimed the following:

>On Saturday, January 25, 2014 8:12:41 AM UTC+5:30, kvxd...@gmail.com wrote:
>> Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:
>
><snipped>
>
>Ewwww!

	I think my reaction was more guttural -- <barf!>

>
>If you (or anyone with basic python experience) rewrites that code, it will become
>1/50th the size and all that you call 'code' will reside in data files.
>
>That can mean one of json, xml, yml, ini, pickle, ini, csv  etc
>
>If you need further help in understanding/choosing, post back

	Heck, at the very least turn all those xxxx_99 variables into single
lists.... The posted code looks like something from 1968 K&K BASIC.

-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#64747

From	Rustom Mody <rustompmody@gmail.com>
Date	2014-01-25 08:15 -0800
Message-ID	<2ebec4b9-66dd-4ddc-94a5-1b431e7b0edf@googlegroups.com>
In reply to	#64745

On Saturday, January 25, 2014 8:12:20 PM UTC+5:30, Dennis Lee Bieber wrote:
> 
> 	Heck, at the very least turn all those xxxx_99 variables into single
> lists.... The posted code looks like something from 1968 K&K BASIC.

Yes thats correct.

My suggestion of data-files is a second step.

A first step is just converting to using internal (python) data structures.
[And not 1968 BASIC scalars!]

[toc] | [prev] | [next] | [standalone]

#64724

From	Dave Angel <davea@davea.name>
Date	2014-01-25 01:38 -0500
Message-ID	<mailman.5966.1390631789.18130.python-list@python.org>
In reply to	#64717

 kvxdelta@gmail.com Wrote in message:
> Alright. I have the code here. Now, I just want to note that the code was not designed to work "quickly" or be very well-written. It was rushed, as I only had a few days to finish the work, and by the time I wrote the program, I hadn't worked with Python (which I never had TOO much experience with anyways) for a while. (About a year, maybe?) It was a bit foolish to take up the project, but here's the code anyways:
.........
>
> 
> LPW_Comparisons = [avgLPW_DJ_EXAMPLE, avgLPW_SUZC_EXAMPLE, avgLPW_SUZC_EXAMPLE]
> avgLPW_Match = min(LPW_Comparisons)
> 
> if avgLPW_Match == avgLPW_DJ_EXAMPLE:
>     DJMachalePossibility = (DJMachalePossibility+1)
> 
> if avgLPW_Match == avgLPW_SUZC_EXAMPLE:
>     SuzanneCollinsPossibility = (SuzanneCollinsPossibility+1)
> 
> if avgLPW_Match == avgLPW_RICH_EXAMPLE:
>     RichardPeckPossibility = (RichardPeckPossibility+1)
> 
> AUTHOR_SCOREBOARD = [DJMachalePossibility, SuzanneCollinsPossibility, RichardPeckPossibility]
> 
> #The author with the most points on them would be considered the program's guess.
> Match = max(AUTHOR_SCOREBOARD)
> 
> print AUTHOR_SCOREBOARD
> 
> if Match == DJMachalePossibility:
>     print "The author should be D.J. Machale."
> 
> if Match == SuzanneCollinsPossibility:
>     print "The author should be Suzanne Collins."
> 
> if Match == RichardPeckPossibility:
>     print "The author should be Richard Peck."
> 
> 
> ------------------------------------------------------------------------------
> Hopefully, there won't be any copyright issues. Like someone said, this should be fair use. The problem I'm having is that it always gives Suzanne Collins, no matter what example is put in. I'm really sorry that the code isn't very clean. Like I said, it was rushed and I have little experience. I'm just desperate for help as it's a bit too late to change projects, so I have to stick with this. Also, if it's of any importance, I have to be able to remove or add any of the "average letters per word/average letters per sentence/average words per sentence things" to test the program at different levels of strictness. I would GREATLY appreciate any help with this. Thank you!
> 

1. When you calculate averages,  you should be using floating
 point divide. 
         avg = float (a) / b

  2. When you subtract two values, you need an abs, because
 otherwise min () will hone in on the negative values.
 

  3. Realize that having Match agree with more than one is not
 that unlikely. 

   4. If you want to vary what you call strictness,  you're really
 going to need to learn about functions.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#64729

From	Gregory Ewing <greg.ewing@canterbury.ac.nz>
Date	2014-01-25 20:25 +1300
Message-ID	<bkh772Fkc9iU1@mid.individual.net>
In reply to	#64673

theguy wrote:
> I so far have
> three different authors in the program and have already put in the example
> text but for some reason, the program always leans toward one specific
> author, Suzanne Collins, no matter what insane number I try to put in or how
> much I tinker with the coding.

It's obvious what's happening here: all the other
authors have heavily borrowed from Suzanne Collins.
You've created a plagiarism detector! :-)

-- 
Greg

[toc] | [prev] | [next] | [standalone]

#64893

From	alex23 <wuwei23@gmail.com>
Date	2014-01-28 17:31 +1000
Message-ID	<lc7md4$bq0$1@dont-email.me>
In reply to	#64673

On 24/01/2014 8:05 PM, theguy wrote:
> I have a science project that involves designing a program which can examine a bit of text with the author's name given, then figure out who the author is if another piece of example text without the name is given.

This sounds like exactly the sort of thing NLTK was made for. Here's an 
example of using it for this requirement:

http://www.aicbt.com/authorship-attribution/

[toc] | [prev] | [standalone]

csiph-web

Need Help with Programming Science Project

Contents

#64673 — Need Help with Programming Science Project

#64674

#64707

#64708

#64711

#64715

#64714

#64717

#64718

#64720

#64730

#64740

#64745

#64747

#64724

#64729

#64893